Learning and exploring the workflow of RNA-SEQ analysis-a remark for myself

Learned RNA-SEQ-Workflow using C. Difficile data from a published study 🧬. RAW processes by Fastp → Kallisto → DESEQ2 pipeline. Results corresponded to the findings of the original article, with clear differential expression between mucus and control conditions 📊.

Motivations:

To be honest, I don’t really know what RNA-Seq was until I learned more about it and the potential of it! We have viewed in our earlier learning processes
Learn antimicrobial resistance (AMR) genes with bioconductor”
Phylogenetic analysis”
Building DNA sequence”
Assembly DNA sequence, learning blazing and MLST Learning”
Long series of on -work flow exploring. Note that this is all related to DNA. This article, we will explore the world of RNA! In my simplistic vision, RNA-SEQ enables us to explore and discover transcripts of certain conditions, whether it can tell us a little more of the gene expression based on certain conditions. The potential is huge! Imagine that you can see the difference between colonization versus a real infection in a clinical setting, I wonder if differential expression analysis of transcriptoma can offer us some more information! Differentiating infection versus contamination (more accurate hai definition 🤷 agrees). Let’s dive into the shallow pool of the unknown world of RNA-SEQ and at least learn the basics! Let’s go!

Game plan:

Let’s take a look Clostridioides difficile And see if we can find raw CDNA sequences on NCBI. After some searching I thought this
Clostridioides Difficile-Mucus Interactions include shifts in gene expression, metabolism and biofilm formation That can potentially be a good one to look at and see if we can somewhat reproduce what has been found.

Safeguard:

I am not a biostatistician, nor do I work in a laboratory. I try to understand the Bioinformatics workflow of an RNA-SEQ. If you find information here that is displayed here, let me know so that I can learn. Also check the information presented here. This is a documentation of my learning process and a comment for my future self to reproduce the workflow that I have investigated

Objectives:

What is RNA-SEQ?

RNA-SEQ, or RNA sequencing, is a powerful technique that is used to analyze the transcriptom of a cell or organism. The transcriptom refers to the full set of RNA molecules, including Messenger-RNA (MRNA), ribosomal RNA (RRNA), transfer RNA (TRNA) and non-coding RNAs, which are present in a cell at a specific time. RNA-SEQ enables researchers to study gene expression patterns, to identify differential genes made and to discover new transcripts.
Read more about Wiki. It is usually a challenge to sequence from RNA itself because of its instability and sensitivity to breakdown. That is why RNA is typically converted into complementary DNA (CDNA) with the help of a process called reverse transcription before sequencing. This CDNA is then used as the input for sequencing platforms. For a short RNA sequence, nowadays with ONT, it does offer flow cells that RNA can immediately sequence can sequence without the need for conversion to CDNA.

Let’s look at existing data

Let’s dive into the
bioproject. To download the RAW series, we have to go through different clicks. And also understand what some of these abbreviations means.

Sequence read archive (SRA):
SRR – Individual sequencing runs (unprocessed data files).
SRX – Experiments (groups of related runs).
SRS – Samples (organic samples).
SRP – projects/studies (collections of related experiments).

Genexpression for everyone (GEO):
GSM – Individual samples with expression data.
GSE – Series/data sets (collections of related examples).
GPL – Platforms (descriptions of microarray or sequencing platform).
GDS – Curated data sets (processed GSE data).

Assembly database:
ASM – Genoomsemblies.
GCA – Genbank -AsMblies.
GCF – Refseq – assemblies.

Core sequence databases:
NC_ – Refseq -chromosomes/complete taken.
NM_ – Refseq mrna sequences.
NP_ – Refseq -protein sequences.
XM_/XP_ – Model/predicted Refseq sequences.
AC_ – Genbank completed genomic sequences.

Bioproject & Biosample:
PRJNA – Bioproject – accessions (research projects).
SAMN – Biosample accessions (organic samples).

Okay, with that from the way, we are interested in SRRthat must be under SRA. We can see on the Bio project page that there is SRA experiments. Click on that and it takes you to a page of SRXS. Because they are very large files, in terms of performances, we cannot simply download from the website, we need sra-tools

## Install follow this link instruction 
## https://github.com/ncbi/sra-tools/wiki/02.-Installing-SRA-Toolkit

## download SRRs example
fasterq-dump --progress --threads 4 SRR27792604

Now repeat the above for the other SRRs And you get all the rough sequences fastq Format. With our practice we download alone SRR27792597“SRR27792603“SRR27792598AndSRR27792604. Two slime and two controls. It will take a while, that’s why I think the --progress parameter. Okay, after we have downloaded the sequences, let’s continue with our official workflow.

files <- c("kallisto_SRR27792597/abundance.h5",
           "kallisto_SRR27792603/abundance.h5", 
           "kallisto_SRR27792598/abundance.h5",
           "kallisto_SRR27792604/abundance.h5")

The workflow

Tl; Dr Quality Control (FASTP) -> Create Transcriptome Reference (Kallisto) -> Together Raw Order (Kallisto) -> Analyze (PCA & DESEQ2)

QC read from raw

# install fastp

# run QC
fastp -i SRR27792597_1.fastq -h SRR27792597_1.html --verbose --stdout --thread 8 > /dev/null

If you are, like me, you like to know what’s going on while running, make sure you record --verbose And I found --thread Is very helpful, otherwise it would take somewhere. Repeat for everyone fastq Files and inspect.

What is considered acceptable?

I will be honest, I’m not sure. 🤔 Read quality statistics such as Quality Scores (Phred -Scores): Q30+:> 80% is good,> 90% is excellent, Q20+: must be> 95%. Total Reads: Depends on our goals: Differential expression: 10-30m reads per example, while new transcripts coverage: 50-100m+ reads. Contamination and Artifacts: Adapter content: <5%: goed, 5-20%: matig (moet trimmen),> 20%: High contamination. Duplication Rate: <30%: laag (genomisch DNA-achtig), 30-70%: normaal voor RNA-seq,> 70%: Possible problematic. Composition of the series:
GC Content: Must correspond to the expected organism: people/mouse: ~ 42%, E. coli: ~ 50%, yeast: ~ 38%; Quality per base: must remain> Q28 about the reading length. Red flags to pay attention to poor quality data: <10m reads for differential expression <80% q20 bases, severe 3 'decrease in quality (under Q20), unusual GC content peaks, 20% adapter pollution after cropping.

Our QC report is not included, but a quick look, it looks pretty good!

Getting a reference from transcriptom

On the supplement of the article we found that they used FN545816.1 As a reference tree. We can download them from
here. Click download and include gff also.

# Install gffread if you don't have it
brew install gffread  # on macOS

# install kallisto
# go here https://pachterlab.github.io/kallisto/download.html for info

# extract transcriptome sequences
gffread -w transcriptome.fa -g GCA_000027105.1_ASM2710v1_genomic.fna genomic.gff
kallisto index -i transcriptome.idx transcriptome.fa

You should have transcriptome.idx And transcriptome.fa In your current workbook. What this does is that it removes the transcripta sequences from the genome based on the gff file, in our case it is a specific
Clostridioides difficile R20291 Meeting with accession FN545816.1. The gff File contains information about the locations of genes and other functions on the genome. The -w flag specifies the output file for the transcripta sequences and the -g Vlag indicates the input genoma sequence file. The resulting transcriptome.fa File contains the nucleotid checks of all transcriptions annotated in the gff file.

Now that we have the reference. Let us compile our raw order into something readable.

To assemble

# assemble, make sure to include all the other samples
kallisto quant -i transcriptome.idx \
 -o kallisto_SRR27792604 \
 --threads 8 \
 SRR27792604_1.fastq SRR27792604_2.fastq

We should go through all the unprocessed sequences that we have downloaded and mounted. We then see folders like kallisto_SRR27792604. And we see in the folder abundance.h5 And abundance.tsv. The reason I prefer kallisto over STAR Is mainly because of the speed. When I tried Star for the first time, it was very slow and not a progress bar. The output of these files is actually quite small. Let’s take a look!

library(tximport)
library(DESeq2)

# Create file paths for all samples
files <- c("kallisto_SRR27792597/abundance.h5",
           "kallisto_SRR27792603/abundance.h5", 
           "kallisto_SRR27792598/abundance.h5",
           "kallisto_SRR27792604/abundance.h5")

# Name them
names(files) <- c("mucus_rep1", "control_rep1", "mucus_rep2", "control_rep2")

# Import data
txi <- tximport(files, type = "kallisto", txOut = TRUE)

coldata <- data.frame(
  condition = c("mucus", "control", "mucus", "control"),  # or however they're grouped
  row.names = names(files)
)

dds <- DESeqDataSetFromTximport(txi, colData = coldata, design = ~ condition)

Just like a comment, there is another way we can make Matrix and also count tsv.

PCA

# Perform variance stabilizing transformation
vsd <- vst(dds, blind = FALSE)

# Plot PCA
plotPCA(vsd, intgroup = "condition")

Wow, it seems that the treatment and control are quite well separated! Let us dive directly into differential expression analysis. If we look at Figure 2b about the article, it is very similar, although we do not record the full sequences.

DESEQ2

library(tidyverse)

# Run DESeq2 analysis
dds <- DESeq(dds)

# Get results
res <- results(dds, alpha = 0.01, tidy = T)

# View summary
summary(res)

##      row               baseMean       log2FoldChange         lfcSE        
##  Length:3570        Min.   :      0   Min.   :-8.42507   Min.   :0.06850  
##  Class :character   1st Qu.:    326   1st Qu.:-0.51037   1st Qu.:0.08493  
##  Mode  :character   Median :   1313   Median : 0.00074   Median :0.10664  
##                     Mean   :  12154   Mean   :-0.01934   Mean   :0.16100  
##                     3rd Qu.:   4999   3rd Qu.: 0.53025   3rd Qu.:0.16896  
##                     Max.   :3272163   Max.   : 3.75721   Max.   :4.98958  
##                                       NA's   :1          NA's   :1        
##       stat               pvalue               padj          
##  Min.   :-80.75478   Min.   :0.0000000   Min.   :0.0000000  
##  1st Qu.: -4.39763   1st Qu.:0.0000000   1st Qu.:0.0000000  
##  Median :  0.00566   Median :0.0001034   Median :0.0002068  
##  Mean   : -0.71728   Mean   :0.1229994   Mean   :0.1366817  
##  3rd Qu.:  3.44035   3rd Qu.:0.0970157   3rd Qu.:0.1293422  
##  Max.   : 28.59200   Max.   :0.9978324   Max.   :0.9978324  
##  NA's   :1           NA's   :1           NA's   :1

# positive
res |> 
  filter(log2FoldChange > 0) |>
  arrange(padj, desc(log2FoldChange)) |>
  head(10)

##                   row  baseMean log2FoldChange      lfcSE     stat
## 1  gene-CDR20291_1626  2872.357       2.999390 0.10490313 28.59200
## 2  gene-CDR20291_2014 32895.724       2.043527 0.07250627 28.18414
## 3  gene-CDR20291_0509 54339.793       1.925179 0.07046733 27.32016
## 4  gene-CDR20291_2174 13209.067       1.957986 0.07539631 25.96926
## 5  gene-CDR20291_2495 24664.096       2.321013 0.08978192 25.85168
## 6  gene-CDR20291_0508  8811.393       2.021692 0.07967500 25.37423
## 7  gene-CDR20291_2738  9037.752       1.859512 0.07781080 23.89787
## 8  gene-CDR20291_1446  4410.271       1.982683 0.08581234 23.10487
## 9  gene-CDR20291_2871  3862.148       1.912250 0.08437678 22.66322
## 10 gene-CDR20291_2017  5272.620       1.784124 0.08208607 21.73479
##           pvalue          padj
## 1  8.448830e-180 1.076924e-177
## 2  9.148740e-175 1.053286e-172
## 3  2.444054e-164 2.643282e-162
## 4  1.102039e-148 9.832939e-147
## 5  2.329704e-147 2.027979e-145
## 6  4.856404e-142 3.939206e-140
## 7  3.223217e-126 2.170502e-124
## 8  4.136645e-118 2.636372e-116
## 9  1.033340e-113 6.250829e-112
## 10 9.621927e-105 5.283178e-103

# negative
res |> 
  filter(log2FoldChange < 0) |>
  arrange(padj, log2FoldChange) |>
  head(10)

##                   row   baseMean log2FoldChange      lfcSE      stat pvalue
## 1  gene-CDR20291_3145 195320.938      -8.425071 0.10432907 -80.75478      0
## 2  gene-CDR20291_2142  14441.266      -5.528640 0.08551642 -64.65004      0
## 3  gene-CDR20291_1557  21279.449      -4.484919 0.11271527 -39.78981      0
## 4  gene-CDR20291_0332  11550.169      -4.080126 0.08702073 -46.88683      0
## 5  gene-CDR20291_0331   4196.984      -3.907433 0.09507966 -41.09641      0
## 6  gene-CDR20291_2206   6931.606      -3.821015 0.08700146 -43.91897      0
## 7  gene-CDR20291_1078  28192.247      -3.531056 0.08204064 -43.04033      0
## 8  gene-CDR20291_2237   6355.265      -3.491135 0.08384686 -41.63704      0
## 9  gene-CDR20291_0330   6786.174      -3.481683 0.08283612 -42.03097      0
## 10 gene-CDR20291_2238   6034.666      -3.430734 0.08338778 -41.14193      0
##    padj
## 1     0
## 2     0
## 3     0
## 4     0
## 5     0
## 6     0
## 7     0
## 8     0
## 9     0
## 10    0

Interesting thing about the pvalue, we have to look instead to adjust Pval because we make multiple comparison. Pasvalue is via Benjamini-Hochberg method. This adjusts the multitude of comparison.

Volcanic plot

library(ggplot2)
library(ggrepel)

# Define specific genes to label
genes_to_label <- c("gene-CDR20291_1626", "gene-CDR20291_0508", "gene-CDR20291_1446", 
                    "gene-CDR20291_2495", "gene-CDR20291_0455", "gene-CDR20291_0876",
                    "gene-CDR20291_0877", "gene-CDR20291_1275", "gene-CDR20291_0875",
                    "gene-CDR20291_3145")

# Create labeled volcano plot
res |>
  filter(!is.na(padj)) |>
  mutate(label = ifelse(row %in% genes_to_label, str_extract(row, "\\d+$"), "")) |>
  mutate(color = case_when(
    log2FoldChange < -1 & padj <= 0.01 ~ "negative",
    log2FoldChange > 1 & padj <= 0.01 ~ "positive",
    TRUE ~ "neutral"
  )) |>
  ggplot(aes(x = log2FoldChange, y = -log10(padj), color = color)) +
  geom_point(alpha = 0.2) +
  geom_vline(xintercept = c(-1, 1), linetype = "dashed") +
  geom_hline(yintercept = -log10(0.01), linetype = "dashed") +
  geom_text_repel(aes(label = label), 
                  size = 5, 
                  # box.padding = 0.3,
                  max.overlaps = 20) +
  labs(title = "Volcano Plot - Mucus vs Control",
       x = "Log2 Fold Change", 
       y = "-Log10 Adjusted P-value") +
  scale_color_manual(values = c("positive" = "red", "negative" = "blue", "neutral" = "grey")) +
  theme_bw() +
  theme(legend.position = "none")