Learned RNA-SEQ-Workflow using C. Difficile data from a published study 🧬. RAW processes by Fastp → Kallisto → DESEQ2 pipeline. Results corresponded to the findings of the original article, with clear differential expression between mucus and control conditions 📊.
Motivations:
To be honest, I don’t really know what RNA-Seq was until I learned more about it and the potential of it! We have viewed in our earlier learning processes
Learn antimicrobial resistance (AMR) genes with bioconductor”
Phylogenetic analysis”
Building DNA sequence”
Assembly DNA sequence, learning blazing and MLST Learning”
Long series of on -work flow exploring. Note that this is all related to DNA. This article, we will explore the world of RNA! In my simplistic vision, RNA-SEQ enables us to explore and discover transcripts of certain conditions, whether it can tell us a little more of the gene expression based on certain conditions. The potential is huge! Imagine that you can see the difference between colonization versus a real infection in a clinical setting, I wonder if differential expression analysis of transcriptoma can offer us some more information! Differentiating infection versus contamination (more accurate hai definition 🤷 agrees). Let’s dive into the shallow pool of the unknown world of RNA-SEQ and at least learn the basics! Let’s go!
Game plan:
Let’s take a look Clostridioides difficile And see if we can find raw CDNA sequences on NCBI. After some searching I thought this
Clostridioides Difficile-Mucus Interactions include shifts in gene expression, metabolism and biofilm formation That can potentially be a good one to look at and see if we can somewhat reproduce what has been found.
Safeguard:
I am not a biostatistician, nor do I work in a laboratory. I try to understand the Bioinformatics workflow of an RNA-SEQ. If you find information here that is displayed here, let me know so that I can learn. Also check the information presented here. This is a documentation of my learning process and a comment for my future self to reproduce the workflow that I have investigated
Objectives:
What is RNA-SEQ?
RNA-SEQ, or RNA sequencing, is a powerful technique that is used to analyze the transcriptom of a cell or organism. The transcriptom refers to the full set of RNA molecules, including Messenger-RNA (MRNA), ribosomal RNA (RRNA), transfer RNA (TRNA) and non-coding RNAs, which are present in a cell at a specific time. RNA-SEQ enables researchers to study gene expression patterns, to identify differential genes made and to discover new transcripts.
Read more about Wiki. It is usually a challenge to sequence from RNA itself because of its instability and sensitivity to breakdown. That is why RNA is typically converted into complementary DNA (CDNA) with the help of a process called reverse transcription before sequencing. This CDNA is then used as the input for sequencing platforms. For a short RNA sequence, nowadays with ONT, it does offer flow cells that RNA can immediately sequence can sequence without the need for conversion to CDNA.
Let’s look at existing data
Let’s dive into the
bioproject. To download the RAW series, we have to go through different clicks. And also understand what some of these abbreviations means.
Sequence read archive (SRA): SRR – Individual sequencing runs (unprocessed data files). SRX – Experiments (groups of related runs). SRS – Samples (organic samples).SRP – projects/studies (collections of related experiments).
Genexpression for everyone (GEO): GSM – Individual samples with expression data.GSE – Series/data sets (collections of related examples).GPL – Platforms (descriptions of microarray or sequencing platform). GDS – Curated data sets (processed GSE data).
Assembly database: ASM – Genoomsemblies. GCA – Genbank -AsMblies. GCF – Refseq – assemblies.
Core sequence databases: NC_ – Refseq -chromosomes/complete taken. NM_ – Refseq mrna sequences. NP_ – Refseq -protein sequences. XM_/XP_ – Model/predicted Refseq sequences. AC_ – Genbank completed genomic sequences.
Bioproject & Biosample: PRJNA – Bioproject – accessions (research projects). SAMN – Biosample accessions (organic samples).
Okay, with that from the way, we are interested in SRRthat must be under SRA. We can see on the Bio project page that there is SRA experiments. Click on that and it takes you to a page of SRXS. Because they are very large files, in terms of performances, we cannot simply download from the website, we need sra-tools
## Install follow this link instruction ## https://github.com/ncbi/sra-tools/wiki/02.-Installing-SRA-Toolkit ## download SRRs example fasterq-dump --progress --threads 4 SRR27792604
Now repeat the above for the other SRRs And you get all the rough sequences fastq Format. With our practice we download alone SRR27792597“SRR27792603“SRR27792598AndSRR27792604. Two slime and two controls. It will take a while, that’s why I think the --progress parameter. Okay, after we have downloaded the sequences, let’s continue with our official workflow.
files <- c("kallisto_SRR27792597/abundance.h5",
"kallisto_SRR27792603/abundance.h5",
"kallisto_SRR27792598/abundance.h5",
"kallisto_SRR27792604/abundance.h5")
The workflow
Tl; Dr Quality Control (FASTP) -> Create Transcriptome Reference (Kallisto) -> Together Raw Order (Kallisto) -> Analyze (PCA & DESEQ2)
QC read from raw
# install fastp # run QC fastp -i SRR27792597_1.fastq -h SRR27792597_1.html --verbose --stdout --thread 8 > /dev/null
If you are, like me, you like to know what’s going on while running, make sure you record --verbose And I found --thread Is very helpful, otherwise it would take somewhere. Repeat for everyone fastq Files and inspect.
What is considered acceptable?
I will be honest, I’m not sure. 🤔 Read quality statistics such as Quality Scores (Phred -Scores): Q30+:> 80% is good,> 90% is excellent, Q20+: must be> 95%. Total Reads: Depends on our goals: Differential expression: 10-30m reads per example, while new transcripts coverage: 50-100m+ reads. Contamination and Artifacts: Adapter content: <5%: goed, 5-20%: matig (moet trimmen),> 20%: High contamination. Duplication Rate: <30%: laag (genomisch DNA-achtig), 30-70%: normaal voor RNA-seq,> 70%: Possible problematic. Composition of the series:
GC Content: Must correspond to the expected organism: people/mouse: ~ 42%, E. coli: ~ 50%, yeast: ~ 38%; Quality per base: must remain> Q28 about the reading length. Red flags to pay attention to poor quality data: <10m reads for differential expression <80% q20 bases, severe 3 'decrease in quality (under Q20), unusual GC content peaks, 20% adapter pollution after cropping.
Our QC report is not included, but a quick look, it looks pretty good!
Getting a reference from transcriptom
On the supplement of the article we found that they used FN545816.1 As a reference tree. We can download them from
here. Click download and include gff also.
# Install gffread if you don't have it brew install gffread # on macOS # install kallisto # go here https://pachterlab.github.io/kallisto/download.html for info # extract transcriptome sequences gffread -w transcriptome.fa -g GCA_000027105.1_ASM2710v1_genomic.fna genomic.gff kallisto index -i transcriptome.idx transcriptome.fa
You should have transcriptome.idx And transcriptome.fa In your current workbook. What this does is that it removes the transcripta sequences from the genome based on the gff file, in our case it is a specific
Clostridioides difficile R20291 Meeting with accession FN545816.1. The gff File contains information about the locations of genes and other functions on the genome. The -w flag specifies the output file for the transcripta sequences and the -g Vlag indicates the input genoma sequence file. The resulting transcriptome.fa File contains the nucleotid checks of all transcriptions annotated in the gff file.
Now that we have the reference. Let us compile our raw order into something readable.
To assemble
# assemble, make sure to include all the other samples kallisto quant -i transcriptome.idx \ -o kallisto_SRR27792604 \ --threads 8 \ SRR27792604_1.fastq SRR27792604_2.fastq
We should go through all the unprocessed sequences that we have downloaded and mounted. We then see folders like kallisto_SRR27792604. And we see in the folder abundance.h5 And abundance.tsv. The reason I prefer kallisto over STAR Is mainly because of the speed. When I tried Star for the first time, it was very slow and not a progress bar. The output of these files is actually quite small. Let’s take a look!
library(tximport)
library(DESeq2)
# Create file paths for all samples
files <- c("kallisto_SRR27792597/abundance.h5",
"kallisto_SRR27792603/abundance.h5",
"kallisto_SRR27792598/abundance.h5",
"kallisto_SRR27792604/abundance.h5")
# Name them
names(files) <- c("mucus_rep1", "control_rep1", "mucus_rep2", "control_rep2")
# Import data
txi <- tximport(files, type = "kallisto", txOut = TRUE)
coldata <- data.frame(
condition = c("mucus", "control", "mucus", "control"), # or however they're grouped
row.names = names(files)
)
dds <- DESeqDataSetFromTximport(txi, colData = coldata, design = ~ condition)
Just like a comment, there is another way we can make Matrix and also count tsv.
PCA
# Perform variance stabilizing transformation vsd <- vst(dds, blind = FALSE) # Plot PCA plotPCA(vsd, intgroup = "condition")
![]()
Wow, it seems that the treatment and control are quite well separated! Let us dive directly into differential expression analysis. If we look at Figure 2b about the article, it is very similar, although we do not record the full sequences.
DESEQ2
library(tidyverse) # Run DESeq2 analysis dds <- DESeq(dds) # Get results res <- results(dds, alpha = 0.01, tidy = T) # View summary summary(res) ## row baseMean log2FoldChange lfcSE ## Length:3570 Min. : 0 Min. :-8.42507 Min. :0.06850 ## Class :character 1st Qu.: 326 1st Qu.:-0.51037 1st Qu.:0.08493 ## Mode :character Median : 1313 Median : 0.00074 Median :0.10664 ## Mean : 12154 Mean :-0.01934 Mean :0.16100 ## 3rd Qu.: 4999 3rd Qu.: 0.53025 3rd Qu.:0.16896 ## Max. :3272163 Max. : 3.75721 Max. :4.98958 ## NA's :1 NA's :1 ## stat pvalue padj ## Min. :-80.75478 Min. :0.0000000 Min. :0.0000000 ## 1st Qu.: -4.39763 1st Qu.:0.0000000 1st Qu.:0.0000000 ## Median : 0.00566 Median :0.0001034 Median :0.0002068 ## Mean : -0.71728 Mean :0.1229994 Mean :0.1366817 ## 3rd Qu.: 3.44035 3rd Qu.:0.0970157 3rd Qu.:0.1293422 ## Max. : 28.59200 Max. :0.9978324 Max. :0.9978324 ## NA's :1 NA's :1 NA's :1 # positive res |> filter(log2FoldChange > 0) |> arrange(padj, desc(log2FoldChange)) |> head(10) ## row baseMean log2FoldChange lfcSE stat ## 1 gene-CDR20291_1626 2872.357 2.999390 0.10490313 28.59200 ## 2 gene-CDR20291_2014 32895.724 2.043527 0.07250627 28.18414 ## 3 gene-CDR20291_0509 54339.793 1.925179 0.07046733 27.32016 ## 4 gene-CDR20291_2174 13209.067 1.957986 0.07539631 25.96926 ## 5 gene-CDR20291_2495 24664.096 2.321013 0.08978192 25.85168 ## 6 gene-CDR20291_0508 8811.393 2.021692 0.07967500 25.37423 ## 7 gene-CDR20291_2738 9037.752 1.859512 0.07781080 23.89787 ## 8 gene-CDR20291_1446 4410.271 1.982683 0.08581234 23.10487 ## 9 gene-CDR20291_2871 3862.148 1.912250 0.08437678 22.66322 ## 10 gene-CDR20291_2017 5272.620 1.784124 0.08208607 21.73479 ## pvalue padj ## 1 8.448830e-180 1.076924e-177 ## 2 9.148740e-175 1.053286e-172 ## 3 2.444054e-164 2.643282e-162 ## 4 1.102039e-148 9.832939e-147 ## 5 2.329704e-147 2.027979e-145 ## 6 4.856404e-142 3.939206e-140 ## 7 3.223217e-126 2.170502e-124 ## 8 4.136645e-118 2.636372e-116 ## 9 1.033340e-113 6.250829e-112 ## 10 9.621927e-105 5.283178e-103 # negative res |> filter(log2FoldChange < 0) |> arrange(padj, log2FoldChange) |> head(10) ## row baseMean log2FoldChange lfcSE stat pvalue ## 1 gene-CDR20291_3145 195320.938 -8.425071 0.10432907 -80.75478 0 ## 2 gene-CDR20291_2142 14441.266 -5.528640 0.08551642 -64.65004 0 ## 3 gene-CDR20291_1557 21279.449 -4.484919 0.11271527 -39.78981 0 ## 4 gene-CDR20291_0332 11550.169 -4.080126 0.08702073 -46.88683 0 ## 5 gene-CDR20291_0331 4196.984 -3.907433 0.09507966 -41.09641 0 ## 6 gene-CDR20291_2206 6931.606 -3.821015 0.08700146 -43.91897 0 ## 7 gene-CDR20291_1078 28192.247 -3.531056 0.08204064 -43.04033 0 ## 8 gene-CDR20291_2237 6355.265 -3.491135 0.08384686 -41.63704 0 ## 9 gene-CDR20291_0330 6786.174 -3.481683 0.08283612 -42.03097 0 ## 10 gene-CDR20291_2238 6034.666 -3.430734 0.08338778 -41.14193 0 ## padj ## 1 0 ## 2 0 ## 3 0 ## 4 0 ## 5 0 ## 6 0 ## 7 0 ## 8 0 ## 9 0 ## 10 0
Interesting thing about the pvalue, we have to look instead to adjust Pval because we make multiple comparison. Pasvalue is via Benjamini-Hochberg method. This adjusts the multitude of comparison.
Volcanic plot
library(ggplot2)
library(ggrepel)
# Define specific genes to label
genes_to_label <- c("gene-CDR20291_1626", "gene-CDR20291_0508", "gene-CDR20291_1446",
"gene-CDR20291_2495", "gene-CDR20291_0455", "gene-CDR20291_0876",
"gene-CDR20291_0877", "gene-CDR20291_1275", "gene-CDR20291_0875",
"gene-CDR20291_3145")
# Create labeled volcano plot
res |>
filter(!is.na(padj)) |>
mutate(label = ifelse(row %in% genes_to_label, str_extract(row, "\\d+$"), "")) |>
mutate(color = case_when(
log2FoldChange < -1 & padj <= 0.01 ~ "negative",
log2FoldChange > 1 & padj <= 0.01 ~ "positive",
TRUE ~ "neutral"
)) |>
ggplot(aes(x = log2FoldChange, y = -log10(padj), color = color)) +
geom_point(alpha = 0.2) +
geom_vline(xintercept = c(-1, 1), linetype = "dashed") +
geom_hline(yintercept = -log10(0.01), linetype = "dashed") +
geom_text_repel(aes(label = label),
size = 5,
# box.padding = 0.3,
max.overlaps = 20) +
labs(title = "Volcano Plot - Mucus vs Control",
x = "Log2 Fold Change",
y = "-Log10 Adjusted P-value") +
scale_color_manual(values = c("positive" = "red", "negative" = "blue", "neutral" = "grey")) +
theme_bw() +
theme(legend.position = "none")
![]()
If we look at Figure 2a, it looks very similar again! To interpret the volcano plot, we in principle look at the top right and top left in the plot. These are those who are considerably differentially expressed. At the top right are the registered genes, while at the top left are the downward genes. We can see that gene-CDR20291_1626 Is the most regulated gene, while gene-CDR20291_3145 Is the most downward gene. 🙌
Looking at NCBI, looks like 1626 is one
alleged sodium/phosphate transporter [Clostridioides difficile R20291]. And 3145 is
probable. What this means is that when exposed to mucus in comparison with control, the alleged sodium/phosphate transporter brought to expression gene was found more (in the slime group), while the likely protease gene was found less (in Mucusgroep).
Opportunities for improvement
- I would probably have to rewrite the above to a script and with function, so that we can easily reproduce every analysis in the future
- involve
esearchorentrezTo gain access metadata for a more accurate label - involve
heatmap - Pathway analysis. Yes, we can analyze these genes, but what do they actually mean? They use nutrients differently etc.
Learned lessons
- learned the basis of
sra-tools fasterq-dump“fastp”kallisto“DESeq2. - It took me a while to find the right
referenceInsulate. Found in their supplement material and got the right one. For future reference, do not assume that they use the popular Refseq, go through their procedure and get it specifically. - Learned from Raw RNA-SEQ QC
- learned to interpret volcano plot
If you like this article:
Related
#Learning #exploring #workflow #RNASEQ #analysisa #remark #RBloggers


