Ever wondered how DNA lines actually work under the hood? 🧬 We have completely repeated the Needleman-Wunsch algorithm, where we were continued by the hand by scoring matrices with simple examples such as “Cat” versus “CT” before testing on real E. coli sequencies. Quite neat to see the magic happen! ✨
Motivations
We have viewed
Amr -Genes Detect And
Phylogenetic analysis workflow Rather. Because we like to look under the hood to see how things work, let’s see today how the alignment really works.
Safeguard:
I am not a bioinformatician and do not work directly with genes, the presented articles and method is my attempt to get a better understanding of the method behind the Aligment. Please take this with a grain of salt. Check the information presented. If you have noticed an error in this article, let me know so that I can learn! These codes were also not performed during the knitting of Rarkdown because of a significant delay for every knitting, but the results published here must be reprodicuquing. Please let me know again if they are not
Objectives
What is alignment?
Sequence lines is a fundamental technique in bioinformatics that arranges two or more biological sequences (such as DNA, RNA or protein sequences) to identify areas with similarity and difference. By introducing openings that are displayed by dashes, algorithms position corresponding signs from different sequences in the same columns, whereby evolutionary relationships, functional similarities and structural retention are revealed. For example, the alignment of “CCCCGGGGGG” with “CCCGGG” can demonstrate that the shorter series is missing certain bases, but shares a common core with the longer ones.
Which method can be used for coordination?
Sequence lines of lines include dynamic programming algorithms such as such as
Needleman Wish (Global) and Smith-Waterman (local) that offer optimum but slow results, fast Heuristic methods such as blast and fast for the search for database, progressive approaches such as ClustalW for multiple sequences and specialized techniques, including Dot-Matrix Visualization, Structural Methodes with Hidden with Hidden and with Hidden with Hidden and Hidden with Hidden and Hidden and Hidden Inflaming with Hidden and Hidden and Hidden Inflaming and Hidden and Hidden and Hidden and Hidden and Bodies and Hidden-Seal and Bodies and Podes and Podes and Podes and Podes and Visuals and Visuals.
source
Global alignment method such as
Needleman-Wunsch Must be a good one to learn all over with dynamic programming. This
Youtube -self -study by professor Hendrix is very useful when making scoring matrix” traceback matrix And optimal alignment.
Let’s go through an example
What we will do here is that we go through a simple example by hand and then write code that shows it. It would not surprise me if some of these are confusing because they are, and it cost me to work on paper for a few days to follow the procedure. If you find it confusing, look at
Youtube -self -study by professor HendrixI thought it was very helpful.
Let’s set an example. We have 2 very simple sequences and having the end will be useful.
As you can see above, we have 2 sequences. The first row is CATAnd the second row is CT. The - Is what we call a gap. If the algorithm works, we see the dynamic programming doing its magic and automatically insert a gap on our second series
Step 1: Set scores
Match = +2
Mismatch = -1
Gap = -2
What the above means is that when we compare our first and second order for element, we will use this score to form our Scorematrix below. For example, if it is a match (for example A-A), we will +2 From a diagonal score. If it is a mismatch, we will do that -1 From diagonal score. Then we will just -2 For the score above and on the left. Confusing? I’m with you. See below example for a better visual procedure.
Step 2: Calculate the Scorematrix
Because we have the end in mind, we must have this score matrix for our sequences above.
![]()
But of course, when we start for the first time, we will have an empty matrix, so.
![]()
Let’s look at the steps at the same time, so.
![]()
Note that the matrix[1,1] is always 0 And we call the column and row as .. Then we fill matrix[1,1:3] with cumulative -2. The same goes to matrix[1:4,1]. Then we start filling our first score matrix[2,2] through the maximum of 3 calculations (diagonal” up” left), See photo for the actual calculation above. Be explicit, for matrix[2,2]We first assess whether column name And row name by matrix[2,2] Is a match or a mismatch. What do you think? Matrix[2,2] is C-CHence a match! On diagonal (meaning matrix[1,1]), we all use matrix[1,1] Score and +2. For those above (up) matrix[1,2]we will -2. The same applies to the left matrix[2,1]we will -2Because these are holes. One thing that really confused me are these ups and links. But don’t forget that we will always use and subtract this score by our GAP fine (in our case -2). After all this calculation, the maximum of these 3 (Diag, up, left) 2, which comes from Diag. There matrix[2,2] is 2. Then we go next to it matrix[3,2] etc.
Code: Click to fold out
nrow <- nchar(seq1)+1
ncol <- nchar(seq2)+1
mat <- matrix(nrow = nrow, ncol = ncol, dimnames = list(c(".",str_split_1(seq1,"")),c(".",str_split_1(seq2,""))))
# gap penalty
mat[1,1:ncol] <- seq(0,-2*(ncol-1),-2)
mat[1:nrow,1] <- seq(0,-2*(nrow-1),-2)
# calc score
colname <- colnames(mat)
rowname <- rownames(mat)
row_list <- 2:nrow
col_list <- 2:ncol
for (i in row_list) {
for (j in col_list) {
same <- ifelse(rowname[i]==colname[j], T, F)
## calc diag
if (same) {
score_diag <- mat[i-1,j-1] + 2
} else {
score_diag <- mat[i-1,j-1] - 1
}
## calc up
score_up <- mat[i-1,j] - 2
## calc left
score_left <- mat[i,j-1] - 2
score_i <- max(score_diag, score_up, score_left)
mat[i,j] <- score_i
# print(mat)
}
}
Step 3: Traceback to get optimum alignment
Okay, our last score matrix is ​​as such.
![]()
We first start at the bottom right and then follow the maximum scoretraster as so.
![]()
So we first start with the bottom right. First assess whether the column name and rijname are the same, in this case it is, that’s why we align themT-T. Then we look at diag” upAnd left and see which the maximum. In this case, 1 is the highest, that’s why we move up. When we move upIt means the row name continues to exist while we have one gap By column name for our alignment. This took me a while to get the intuition, but note that there is a row of index reduction, but the column index remains the same. That’s why we add one gap On our new column name sequence. Then you do the same, and the maximum is the movement diagonal matrix[2,2]And because the columnnamed and the Rijnaam are the same C-CWe vote them as such.
There you go! That is the basic principles of Needleman-Wunsch-Algorithm! 🙌
Click to fold out
library(tidyverse)
library(Biostrings)
ken_tedious_alignment <- function(seq1,seq2) {
{
nrow <- nchar(seq1)+1
ncol <- nchar(seq2)+1
mat <- matrix(nrow = nrow, ncol = ncol, dimnames = list(c(".",str_split_1(seq1,"")),c(".",str_split_1(seq2,""))))
# gap penalty
mat[1,1:ncol] <- seq(0,-2*(ncol-1),-2)
mat[1:nrow,1] <- seq(0,-2*(nrow-1),-2)
# calc score
colname <- colnames(mat)
rowname <- rownames(mat)
row_list <- 2:nrow
col_list <- 2:ncol
for (i in row_list) {
for (j in col_list) {
same <- ifelse(rowname[i]==colname[j], T, F)
## calc diag
if (same) {
score_diag <- mat[i-1,j-1] + 2
} else {
score_diag <- mat[i-1,j-1] - 1
}
## calc up
score_up <- mat[i-1,j] - 2
## calc left
score_left <- mat[i,j-1] - 2
score_i <- max(score_diag, score_up, score_left)
mat[i,j] <- score_i
# print(mat)
}
}
}
## now walk back - start where?
{
row_i <- nrow
col_i <- ncol
seq_instruct <- c()
same <- F
for (repeat_i in 1:(nrow-1)) {
# print(paste("row: ",rowname[row_i]," ",row_i, " col: ",colname[col_i]," ",col_i))
if (colname[col_i]==rowname[row_i]) {
same <- T
} else { same <- F }
# print(same)
if (is.null(seq_instruct)) {
if (same) {
seq_instruct <- c(seq_instruct, "S")
next}
if (!same && (nrow>ncol)) {
seq_instruct <- c(seq_instruct, "A")
}
if (!same && (ncol>nrow)) {
seq_instruct <- c(seq_instruct, "-")
# print("ncol>nrow")
}
# print("this should only show once")
}
## find max
diag <- mat[row_i-1,col_i-1] #1
up <- mat[row_i-1,col_i] #2
left <- mat[row_i, col_i-1] #3
# print(seq_instruct)
max_score <- max(diag,up,left)
max_score_location <- which(c(diag,up,left) == max_score)
# print(max_score_location)
if (length(max_score_location) > 1) { max_score_location <- sample(max_score_location, 1) }
## match and find location for new start
if (max_score_location==1) {
row_i <- row_i - 1
col_i <- col_i - 1
# seq_instruct <- c(seq_instruct, "S")
}
if (max_score_location==2) {
row_i <- row_i - 1
# seq_instruct <- c(seq_instruct, "A")
}
if (max_score_location==3) {
col_i <- col_i - 1
# seq_instruct <- c(seq_instruct, "-")
}
if (repeat_i!=1) {
if (max_score_location==1) { seq_instruct <- c(seq_instruct, "S") }
if (max_score_location==2) { seq_instruct <- c(seq_instruct, "A") }
if (max_score_location==3) { seq_instruct <- c(seq_instruct, "-") }
}
# print(seq_instruct)
}
print(mat)
}
## turn instruction back to sequence
{
# seq_instruct
# max_length <- max(nrow-1,ncol-1)
seq1_align <- seq2_align <- vector(mode = "character", length=length(seq_instruct))
max_i <- length(seq_instruct)
i <- nrow
j <- ncol
for (count in 1:length(seq_instruct)) {
if (seq_instruct[count]=="S") {
seq1_align[max_i] <- rowname[i]
seq2_align[max_i] <- colname[j]
i <- i - 1
j <- j - 1
}
if (seq_instruct[count]=="A") {
seq1_align[max_i] <- rowname[i]
seq2_align[max_i] <- "-"
i <- i - 1
}
if (seq_instruct[count]=="-") {
seq1_align[max_i] <- "-"
seq2_align[max_i] <- colname[j]
j <- j -1
}
max_i <- max_i - 1
}
}
tryCatch(expr = {
print(DNAString(seq1_align |> paste(collapse="")))
print(DNAString(seq2_align |> paste(collapse="")))
return(list(seq1_align, seq2_align))}
, error=function(e) {
print(seq1_align)
print(seq2_align)}
)
}
## Some examples, uncomment to try
# seq1 <- "dogcathorse"
# seq2 <- "dogcthrs"
#
# seq1 <- "CCAGCCAGGACTACGTAAGTCA"
# seq2 <- "CCGCGGACTCGTATCA"
# ken_ted ious_alignment(seq1,seq2)
Let’s try our code with more sequences
seq1 <- "CCAGCCAGGACTACGTAAGTCA" seq2 <- "CCGCGGACTCGTATCA" ken_tedious_alignment(seq1,seq2)
![]()
The first example with a longer order. Not bad. It seems that it works.
seq1 <- "AAATCCATATGCCACAGA" seq2 <- "AATTCGATCCATATATTTGCCAAATTCCAGA" ken_tedious_alignment(seq1,seq2)
![]()
Second example, we made SEQ1 shorter. Not bad either! It works!
seq1 <- "GATATAGCGGGTTTAACCGTTAAA" seq2 <- "GATATAGCGGGTTTAACCGTT" ken_tedious_alignment(seq1,seq2)
![]()
Duds for example, we want to ensure that we can have gaps at the end. Yes, it works too!
seq1 <- "dogcathorse" seq2 <- "dgcthrs" ken_tedious_alignment(seq1,seq2)
![]()
Let’s just try regular words and see if it works? Yes it does!
Now, here are a few examples that don’t work completely that might be a good opportunity to debug
seq1 <- "GATATAGCGGGTTTAACCGTTAAA" seq2 <- "AGCGGGTTTAACCGTT" ken_tedious_alignment(seq1,seq2)
![]()
Yes, that looks really wrong. 😑 can be good to debug another time! Because the above sequences were only letters and were randomly introduced, we let 2 Ecoli 16S RRNA sequences checked.
seq1 <- "TTGAAGAGTTTGATCATGGCTCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAACGGTAACAGAAAGCAGCTTGCTGCTTTGCTGACGAGTGGCGGACGGGTGAGTAATGTCTGGGAAACTGCCTGATGGAGGGGGATAACTACTGGAAACGGTAGCTAATACCGCATAACGTCGCAAGACCAAAGAGGGGGACCTTCGGGCCTCTTGCCATCGGATGTGCCCAGATGGGATTAGCTTGTTGGTGGGGTAACGGCTCACCAAGGCGACGATCCCTAGCTGGTCTGAGAGGATGACCAGCCACACTGGAACTGAGACACGGTCCAGACTCCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGGGCGCAAGCCTGATGCAGCCATGCCGCGTGTATGAAGAAGGCCTTCGGGTTGTAAAGTACTTTCAGCGGGGAGGAAGGGAGTAAAGTTAATACCTTTACTCATTGACGTTACCCGCAGAAGAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTTTGTTAAGTCAGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATCTGATACTGGCAAGCTTGAGTCTCGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACGAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGTCGACTTGGAGGTTGTGCCCTTGAGGCGTGGCTTCCGGAGCTAACGCGTTAAGTCGACCGCCTGGGGAGTACGGCCGCAAGGTTAAAACTCAAATGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGCAACGCGAAGAACCTTACCTGGTCTTGACATCCACGGAAGTTTTCAGAGATGAGAATGTGCCTTCGGGAACCGTGAGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTTGTGAAATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTATCCTTTGTTGCCAGCGGCCCGGCCGGGAACTCAAAGGAGACTGCCAGTGATAAACTGGAGGAAGGTGGGGATGACGTCAAGTCATCATGGCCCTTACGACCAGGGCTACACACGTGCTACAATGGCGCATACAAAGAGAAGCGACCTCGCGAGAGCAAGCGGACCTCATAAAGTGCGTCGTAGTCCGGATTGGAGTCTGCAACTCGACTCCATGAAGTCGGAATCGCTAGTAATCGTGGATCAGAATGCCACGGTGAATACGTTCCCGGGCCTTGTACACACCGCCCGTCACACCATGGGAGTGGGTTGCAAAAGAAGTAGGTAGCTTAACCTTCGGGAGGGCGCTTACCACTTTGTGATTCATGACTGGGGTGAAGTCGTAACAAGGTAACCGTAGGGGAACCTGCGGTTGGATCACCTCCTT"
seq2 <- "TTGAAGAGTTTGATCATGGCTCAGATTGAACGCTGGCGGCAGGCCTAACACATGCAAGTCGAACGGTAACAGGAAGAAGCTTGCTTCTTTGCTGACGAGTGGCGGACGGGTGAGTAATGTCTGGGAAACTGCCTGATGGAGGGGGATAACTACTGGAAACGGTAGCTAATACCGCATAACGTCGCAAGACCAAAGAGGGGGACCTTCGGGCCTCTTGCCATCGGATGTGCCCAGATGGGATTAGCTAGTAGGTGGGGTAACGGCTCACCTAGGCGACGATCCCTAGCTGGTCTGAGAGGATGACCAGCCACACTGGAACTGAGACACGGTCCAGACTCCTACGGGAGGCAGCAGTGGGGAATATTGCACAATGGGCGCAAGCCTGATGCAGCCATGCCGCGTGTATGAAGAAGGCCTTCGGGTTGTAAAGTACTTTCAGCGGGGAGGAAGGGAGTAAAGTTAATACCTTTACTCATTGACGTTACCCGCAGAAGAAGCACCGGCTAACTCCGTGCCAGCAGCCGCGGTAATACGGAGGGTGCAAGCGTTAATCGGAATTACTGGGCGTAAAGCGCACGCAGGCGGTTTGTTAAGTCAGATGTGAAATCCCCGGGCTCAACCTGGGAACTGCATCTGATACTGGCAAGCTTGAGTCTCGTAGAGGGGGGTAGAATTCCAGGTGTAGCGGTGAAATGCGTAGAGATCTGGAGGAATACCGGTGGCGAAGGCGGCCCCCTGGACGAAGACTGACGCTCAGGTGCGAAAGCGTGGGGAGCAAACAGGATTAGATACCCTGGTAGTCCACGCCGTAAACGATGTCGACTTGGAGGTTGTGCCCTTGAGGCGTGGCTTCCGGAGCTAACGCGTTAAGTCGACCGCCTGGGGAGTACGGCCGCAAGGTTAAAACTCAAATGAATTGACGGGGGCCCGCACAAGCGGTGGAGCATGTGGTTTAATTCGATGCAACGCGAAGAACCTTACCTGGTCTTGACATCCACGGAAGTTTTCAGAGATGAGAATGTGCCTTCGGGAACCGTGAGACAGGTGCTGCATGGCTGTCGTCAGCTCGTGTTGTGAAATGTTGGGTTAAGTCCCGCAACGAGCGCAACCCTTATCCTTTGTTGCCAGCGGTCCGGCCGGGAACTCAAAGGAGACTGCCAGTGATAAACTGGAGGAAGGTGGGGATGACGTCAAGTCATCATGGCCCTTACGACCAGGGCTACACACGTGCTACAATGGCGCATACAAAGAGAAGCGACCTCGCGAGAGCAAGCGGACCTCATAAAGTGCGTCGTAGTCCGGATTGGAGTCTGCAACTCGACTCCATGAAGTCGGAATCGCTAGTAATCGTGGATCAGAATGCCACGGTGAATACGTTCCCGGGCCTTGTACACACCGCCCGTCACACCATGGGAGTGGGTTGCAAAAGAAGTAGGTAGCTTAACCTTCGGGAGGGCGCTTACCACTTTGTGATTCATGACTGGGGTGAAGTCGTAACAAGGTAACCGTAGGGGAACCTGCGGTTGGATCACCTCCTT"
seq_list <- ken_tedious_alignment(seq1,seq2)
idx <- which((seq1 |> str_split_1(""))!=(seq2 |> str_split_1("")))
for (i in idx) {
print(DNAString(paste(seq_list[[1]][(i-3):(i+3)],collapse="")))
print(DNAString(paste(seq_list[[2]][(i-3):(i+3)],collapse="")))
cat("------------------\n\n")
}
![]()
![]()
The above are the differences of the 2 tribes and how our algorithm has matched them. Pretty neat!
Last thoughts
Wow, that was quite worthwhile to go through the dynamic programming procedure. It was annoying, but worth it. I am curious which other problems would benefit from this method? 🤔 Let me know if you have used this on non-bio-informed-related field!
Possibility of improvement
- More information about local alignment, clustering, explosion, etc.
- Learn how
DECIPHERto coordinate - Learn how to make this more efficient
- The next stop would be the distance calculation after alignment, and then growing tree 🌲
Learned lessons
- Learned the basis of DNA -sequence
- The algorithm completely again coded, really helped me to understand the procedure
If you like this article:
Related
#Construction #DNA #sequence #alignment #needleemanwunschalgorithm #completely #redesigned #RBloggers


