The Needleman-Wunsch algorithm for
sequence alignment
7th Melbourne Bioinformatics Course
Vladimir Likić, Ph.D.
e-mail: vlikic@unimelb.edu.au
Bio21 Molecular Science and Biotechnology Institute
The University of Melbourne
The Needleman-Wunsch algorithm for sequence alignment – p.1/46
Outline of the talk
Sequence comparison and sequence alignment.
Types of sequence alignment.
The scoring scheme, substitution matrices, gaps.
The Needleman-Wunsch algorithm for global sequence
alignment.
The Needleman-Wunsch algorithm for sequence alignment – p.2/46
Sequence comparison
To observe patterns of conservation (or variability).
To find the common motifs present in both sequences.
To assess whether it is likely that two sequences evolved
from the same sequence.
To find out which sequences from the database are similar
to the sequence at hand.
The Needleman-Wunsch algorithm for sequence alignment – p.3/46
Two routes for sequence comparison
dotplot – visual, qualitative
sequence alignment – exact and quantitative. Involves:
1. Construction of the best alignment between the
sequences.
2. Assessment of the similarity from the alignment.
There are three different types of sequence alignment:
Global alignment
Local alignment
Multiple sequence alignment
The Needleman-Wunsch algorithm for sequence alignment – p.4/46
Global sequence alignment
The best alignment over the entire length of two
sequences
Suitable when the two sequences are of similar length,
with a significant degree of similarity throughout.
Example:
SIMILARITY
PI-LLAR---
The Needleman-Wunsch algorithm for sequence alignment – p.5/46
Local sequence alignment
Involving stretches that are shorter than the entire
sequences, possibly more than one.
Suitable when comparing substantially different sequences,
which possibly differ significantly in length, and have only
a short patches of similarity.
For example, the local alignment of SIMILARITY and
PILLAR:
MILAR
ILLAR
The Needleman-Wunsch algorithm for sequence alignment – p.6/46
Multiple sequence alignment
Simultaneous alignment of more than two sequences.
Suitable when searching for subtle conserved sequence
patterns in a protein family, and when more than two
sequences of the protein family are available.
For example:
SIMILARITY
PI-LLAR---
--MOLARITY
The Needleman-Wunsch algorithm for sequence alignment – p.7/46
Alignment ”by eye”
Consider the ”best” alignment of ATGGCGT and ATGAGT
ATGGCGT
*** !**
ATG-AGT
Intuitively we seek an alignment to maximize the number
of residue-to-residue matches.
The Needleman-Wunsch algorithm for sequence alignment – p.8/46
A mathematical framework
Sequence alignment is the establishment of
residue-to-residue correspondence between two or more
sequences such that the order of residues in each sequence
is preserved.
A gap, which indicates a residue-to-nothing match, may
be introduced in either sequence.
A gap-to-gap match is meaningless and is not allowed.
The Needleman-Wunsch algorithm for sequence alignment – p.9/46
The scoring scheme
Give two sequences we need a number to associate with
each possible alignment (i.e. the alignment score =
goodness of alignment).
The scoring scheme is a set of rules which assigns the
alignment score to any given alignment of two sequences.
1. The scoring scheme is residue based: it consists of residue
substitution scores (i.e. score for each possible residue
alignment), plus penalties for gaps.
2. The alignment score is the sum of substitution scores and
gap penalties.
The Needleman-Wunsch algorithm for sequence alignment – p.10/46
A simple scoring scheme
Use +1 as a reward for a match, -1 as the penalty
for a mismatch, and ignore gaps
The best alignment ”by eye” from before:
ATGGCGT
ATG-AGT score: +1 + 1 + 1 + 0 − 1 + 1 + 1 = 4
An alternative alignment:
ATGGCGT
A-TGAGT score: +1 + 0 − 1 + 1 − 1 + 1 + 1 = 2
The Needleman-Wunsch algorithm for sequence alignment – p.11/46
The substitution matrix
A concise way to express the residue substitution costs can
be achieved with a N × N matrix (N is 4 for DNA and 20
for proteins).
The substitution matrix for the simple scoring scheme:
C T A G
C 1 -1 -1 -1
T -1 1 -1 -1
A -1 -1 1 -1
G -1 -1 -1 1
The Needleman-Wunsch algorithm for sequence alignment – p.12/46
A better substitution matrix
A, G are purines (pyrimidine ring fused to an imidazole
ring), T, C are pyrimidines (one six-membered ring).
Assume we believe that from evolutionary standpoint
purine/pyrimidine mutations are less likely to occur
compared to purine/purine (pyrimidine/pyrimidine)
mutations. Can we capture this in a substitution matrix?
C T A G
C 2 1 -1 -1
T 1 2 -1 -1
A -1 -1 2 1
G -1 -1 1 2
The Needleman-Wunsch algorithm for sequence alignment – p.13/46
Protein substitution matrices
Protein substitution matrices are significantly more
complex than DNA scoring matrices.
Proteins are composed of twenty amino acids, and
physico-chemical properties of individual amino acids vary
considerably.
A protein substitution matrix can be based on any property
of amino acids: size, polarity, charge, hydrophobicity.
In practice the most important are evolutionary
substitution matrices.
The Needleman-Wunsch algorithm for sequence alignment – p.14/46
Evolutionary substitution matrices
PAM (”point accepted mutation”) family
PAM250, PAM120, etc.
BLOSUM (”Blocks substitution matrix”) family
BLOSUM62, BLOSUM50, etc.
The substitution scores of both PAM and BLOSUM
matrices are derived from the analysis of known
alignments of closely related proteins.
The BLOSUM matrices are newer and considered better.
The Needleman-Wunsch algorithm for sequence alignment – p.15/46
BLOSUM62 substitution matrix
The Needleman-Wunsch algorithm for sequence alignment – p.16/46
Gaps
So far we ignored gaps (amounts to gap penalty of 0)
A gap corresponds to an insertion or a deletion of a
residue
A conventional wisdom dictates that the penalty for a gap
must be several times greater than the penalty for a
mutation. That is because a gap/extra residue
Interrupts the entire polymer chain
In DNA shifts the reading frame
The Needleman-Wunsch algorithm for sequence alignment – p.17/46
Gap initiation and extension
The conventional wisdom: the creation of a new gap
should be strongly disfavored.
However, once created insertions/deletions of chunks of
more than one residue should be much less expensive (i.e.
insertion of domains often occurs).
A simple yet effective solution is affine gap penalties:
γ(n) = −o − (n − 1)e
The Needleman-Wunsch algorithm for sequence alignment – p.18/46
Affine gaps: a physical insight
Affine gaps favor the alignment:
ATGTAGTGTATAGTACATGCA
ATGTAG-------TACATGCA
Over the alignment:
ATGTAGTGTATAGTACATGCA
ATGTA--G--TA---CATGCA
Exactly what we want from the biological viewpoint.
The Needleman-Wunsch algorithm for sequence alignment – p.19/46
The alignment score with BLOSUM62
Consider two alternative alignments of ANRGDFS and
ANREFS with the gap opening penalty of 10:
ANRGDFS
ANR-EFS score: 4+6+5−10+2+6+4 = 17
ANRGDFS
ANRE-FS score: 4+6+5−2−10+6+4 = 13
The scoring scheme provides us with the quantitative
measure of how good is some alignment relative to
alternative alignments. However the scoring scheme
does not tell us how to find the best alignment.
The Needleman-Wunsch algorithm for sequence alignment – p.20/46
How do we find the best alignment?
Brute-force approach:
Generate the list all possible alignments between two
sequences, score them
Select the alignment with the best score
The number of possible global alignments between two
sequences of length N is
22N
√
πN
For two sequences of 250 residues this is ∼ 10149
The Needleman-Wunsch algorithm for sequence alignment – p.21/46
The Needleman-Wunsch algorithm
A smart way to reduce the massive number of possibilities
that need to be considered, yet still guarantees that the
best solution will be found (Saul Needleman and Christian
Wunsch, 1970).
The basic idea is to build up the best alignment by using
optimal alignments of smaller subsequences.
The Needleman-Wunsch algorithm is an example of
dynamic programming, a discipline invented by Richard
Bellman (an American mathematician) in 1953!
The Needleman-Wunsch algorithm for sequence alignment – p.22/46
How does dynamic programming work?
A divide-and-conquer strategy:
Break the problem into smaller subproblems.
Solve the smaller problems optimally.
Use the sub-problem solutions to construct an optimal
solution for the original problem.
Dynamic programming can be applied only to problems
exhibiting the properties of overlapping subproblems.
Examples include
Trevelling salesman problem
Finding the best chess move
The Needleman-Wunsch algorithm for sequence alignment – p.23/46
The mathematics
A matrix D(i, j) indexed by residues of each sequence is
built recursively, such that
D(i, j) = max





D(i − 1, j − 1) + s(xi, yj)
D(i − 1, j) + g
D(i, j − 1) + g
subject to a boundary conditions. s(i, j) is the substitution
score for residues i and j, and g is the gap penalty.
The Needleman-Wunsch algorithm for sequence alignment – p.24/46
A walk-through: an overview
We consider all possible pairs of residue from two
sequences (this gives rise to a 2D matrix representation).
We will have two matrices: the score matrix and traceback
matrix.
The Needleman-Wunsch algorithm consists of three steps:
1. Initialisation of the score matrix
2. Calculation of scores and filling the traceback matrix
3. Deducing the alignment from the traceback matrix
The Needleman-Wunsch algorithm for sequence alignment – p.25/46
Consider the simple example
The alignment of two sequences SEND and AND with the
BLOSUM62 substitution matrix and gap opening penalty
of 10 (no gap extension):
SEND
-AND score: +1
A-ND score: +3 ← the best
AN-D score: -3
AND- score: -8
The Needleman-Wunsch algorithm for sequence alignment – p.26/46
The score and traceback matrices
The cells of the score matrix are labelled C(i, j) where
i = 1, 2, ..., N and j = 1, 2, ..., M









































































































































































































































































































































































































































































































































































































S E N D
A
N
D
S E N D
C(4,1) C(4,2) C(4,3) C(4,4) C(4,5)
C(3,1) C(3,2) C(3,3) C(3,4) C(3,5)
C(1,1) C(1,2) C(1,3) C(1,4) C(1,5)
C(2,1) C(2,2) C(2,3) C(2,4) C(2,5)
The Needleman-Wunsch algorithm for sequence alignment – p.27/46
Initialization
The first row and the first column of the score and
traceback matrices are filled during the initialization.
S E N D
A
N
D
S E N D
0 −10 −20 −40
−20
−30
−10
left left left left
up
up
up
done
−30
The Needleman-Wunsch algorithm for sequence alignment – p.28/46
Scoring
The score matrix cells are filled by row starting from the
cell C(2, 2)
The score of any cell C(i, j) is the maximum of:
qdiag = C(i − 1, j − 1) + S(i, j)
qup = C(i − 1, j) + g
qleft = C(i, j − 1) + g
where S(i, j) is the substitution score for letters i and j,
and g is the gap penalty.
The Needleman-Wunsch algorithm for sequence alignment – p.29/46
Scoring– a pictorial representation
The value of the cell C(i, j) depends only on the values of
the immediately adjacent northwest diagonal, up, and left
cells:




















































































































































































C(i−1,j−1) C(i−1,j)
C(i,j−1) C(i,j)
The Needleman-Wunsch algorithm for sequence alignment – p.30/46
The Needleman-Wunsch progression
The first step is to calculate the value of C(2, 2):
S E N D
A
N
D
S E N D
0 −10 −20 −40
−20
−30
−10
left left left left
up
up
up
done
−30
? ?
The Needleman-Wunsch algorithm for sequence alignment – p.31/46
The value of C(2, 2)
The calculation for the cell C(2, 2):
qdiag = C(1, 1) + S(S, A) = 0 + 1 = 1
qup = C(1, 2) + g = −10 + (−10) = −20
qleft = C(2, 1) + g = −10 + (−10) = −20
Where C(1, 1), C(1, 2), and C(2, 1) are read from the
score matrix, and S(S, A) is the score for the S ↔ A
taken from the BLOSUM62 matrix.
The Needleman-Wunsch algorithm for sequence alignment – p.32/46
Filling the score and traceback matrices
For the score matrix C(2, 2) = qdiag which is 1. The
corresponding cell of the traceback matrix is ”diag”:
S E N D
A
N
D
S E N D
0 −10 −20 −40
−20
−30
−10
left left left left
up
up
up
done
−30
1 diag
The Needleman-Wunsch algorithm for sequence alignment – p.33/46
The progression is recursive
The next step is to calculate C(2, 3):
S E N D
A
N
D
S E N D
0 −10 −20 −40
−20
−30
−10
left left left left
up
up
up
done
−30
1 diag
? ?
The Needleman-Wunsch algorithm for sequence alignment – p.34/46
The value of C(2, 3)
The calculation for the cell C(2, 3)
qdiag = C(1, 2) + S(E, A) = −10 + −1 = −11
qup = C(1, 3) + g = −20 + (−10) = −30
qleft = C(2, 2) + g = 1 + (−10) = −9
Thus C(2, 3) = −9 and the corresponding cell of the
traceback matrix is ”left”.
The Needleman-Wunsch algorithm for sequence alignment – p.35/46
The final score and traceback matrices
After all cells are filled, the score and traceback matrices
are:
S E N D
A
N
D
S E N D
0 −10 −20 −40
−20
−30
−10
left left left left
up
up
up
done
−30
1 diag
−9 −19 −29
−9 −1 −3 −13
−19 −11 2 3
left left left
diag diag diag left
up diag diag
diag
The Needleman-Wunsch algorithm for sequence alignment – p.36/46
The traceback
Traceback = the process of deduction of the best
alignment from the traceback matrix.
The traceback always begins with the last cell to be filled
with the score, i.e. the bottom right cell.
One moves according to the traceback value written in the
cell.
There are three possible moves: diagonally (toward the
top-left corner of the matrix), up, or left.
The traceback is completed when the first, top-left cell of
the matrix is reached (”done” cell).
The Needleman-Wunsch algorithm for sequence alignment – p.37/46
The traceback path
The traceback performed on the completed traceback
matrix:








































































A
N
D
S E N D
diag left left left
up
done left left left left
up diag diag diag left
up up diag diag
diag starts here
Traceback
The Needleman-Wunsch algorithm for sequence alignment – p.38/46
The best alignment
The alignment is deduced from the values of cells along
the traceback path, by taking into account the values of
the cell in the traceback matrix:
diag – the letters from two sequences are aligned
left – a gap is introduced in the left sequence
up – a gap is introduced in the top sequence
Sequences are aligned backwards.
The Needleman-Wunsch algorithm for sequence alignment – p.39/46
The traceback step-by-step (1)
The first cell from the traceback path is ”diag” implying
that the corresponding letters are aligned:
D
D


































































A
N
D
S E N D
diag left left left
up
done left left left left
up diag diag diag left
up up diag diag
diag starts here
Traceback
The Needleman-Wunsch algorithm for sequence alignment – p.40/46
The traceback step-by-step (2)
The second cell from the traceback path is also ”diag”
implying that the corresponding letters are aligned:
ND
ND


































































A
N
D
S E N D
diag left left left
up
done left left left left
up diag diag diag left
up up diag diag
diag starts here
Traceback
The Needleman-Wunsch algorithm for sequence alignment – p.41/46
The traceback step-by-step (3)
The third cell from the traceback path is ”left” implying
the gap in the left sequence (i.e. we stay on the letter A
from the left sequence):
END
-ND


































































A
N
D
S E N D
diag left left left
up
done left left left left
up diag diag diag left
up up diag diag
diag starts here
Traceback
The Needleman-Wunsch algorithm for sequence alignment – p.42/46
The traceback step-by-step (4)
The fourth cell from the traceback path is also ”diag”
implying that the corresponding letters are aligned. We
consider the letter A again, this time it is aligned with S:
SEND
A-ND


































































A
N
D
S E N D
diag left left left
up
done left left left left
up diag diag diag left
up up diag diag
diag starts here
Traceback
The Needleman-Wunsch algorithm for sequence alignment – p.43/46
Compare with the exhaustive search
The best alignment via the Needleman-Wunsch algorithm:
SEND
A-ND
The exhaustive search:
SEND
-AND score: +1
A-ND score: +3 ← the best
AN-D score: -3
AND- score: -8
The Needleman-Wunsch algorithm for sequence alignment – p.44/46
A few observations
It was much easier to align SEND and AND by the
exhaustive search!
As we consider longer sequences the situation quickly
turns against the exhaustive search:
Two 12 residue sequences would require considering
∼ 1 million alignments.
Two 150 residue sequences would require considering
∼ 1088 alignments (∼ 1078 is the estimated number of
atoms in the Universe).
For two 150 residue sequences the Needleman-Wunsch
algorithm requires filling a 150 × 150 matrix.
The Needleman-Wunsch algorithm for sequence alignment – p.45/46
The summary
The alignment is over the entire length of two sequences:
the traceback starts from the lower right corner of the
traceback matrix, and completes in the upper left cell of
the matrix.
The Needleman-Wunsch algorithm works in the same way
regardless of the length or complexity of sequences, and
guarantees to find the best alignment.
The Needleman-Wunsch algorithm is appropriate for
finding the best alignment of two sequences which are (i)
of the similar length; (ii) similar across their entire lengths.
The Needleman-Wunsch algorithm for sequence alignment – p.46/46

More Related Content

PPT
20100515 bioinformatics kapushesky_lecture07
PDF
PCB_Lect02_Pairwise_allign (1).pdf
PPT
Seq alignment
PPT
Alignments
PPTX
Needleman-wunch algorithm harshita
PPTX
2016 bioinformatics i_alignments_wim_vancriekinge
PPTX
2015 bioinformatics alignments_wim_vancriekinge
PDF
Neural Networks: Least Mean Square (LSM) Algorithm
20100515 bioinformatics kapushesky_lecture07
PCB_Lect02_Pairwise_allign (1).pdf
Seq alignment
Alignments
Needleman-wunch algorithm harshita
2016 bioinformatics i_alignments_wim_vancriekinge
2015 bioinformatics alignments_wim_vancriekinge
Neural Networks: Least Mean Square (LSM) Algorithm

Similar to NeedlemanWunsch.pdf (20)

PDF
Sequence alignment
PDF
SVD and the Netflix Dataset
PPTX
pptx - Psuedo Random Generator for Halfspaces
PPTX
pptx - Psuedo Random Generator for Halfspaces
PPTX
Second Order Heuristics in ACGP
PPT
B.sc biochem i bobi u 3.2 algorithm + blast
PPT
B.sc biochem i bobi u 3.2 algorithm + blast
PDF
Seminar Report (Final)
PPTX
Pairwise Sequence Alignment is alignment.pptx
PDF
sequence alignment
PDF
Group Project
PPTX
Sequence Alignment - Data Bioinformatics Introduction
PDF
A quantum-inspired optimization heuristic for the multiple sequence alignment...
PPT
Needleman wunsch computional ppt
PDF
Ch06 alignment
PDF
Swaati algorithm of alignment ppt
PDF
A simple method to find a robust output feedback controller by random search ...
PDF
Intro to Quant Trading Strategies (Lecture 7 of 10)
PPT
Analysis of Algorithm
PDF
02-alignment.pdf
Sequence alignment
SVD and the Netflix Dataset
pptx - Psuedo Random Generator for Halfspaces
pptx - Psuedo Random Generator for Halfspaces
Second Order Heuristics in ACGP
B.sc biochem i bobi u 3.2 algorithm + blast
B.sc biochem i bobi u 3.2 algorithm + blast
Seminar Report (Final)
Pairwise Sequence Alignment is alignment.pptx
sequence alignment
Group Project
Sequence Alignment - Data Bioinformatics Introduction
A quantum-inspired optimization heuristic for the multiple sequence alignment...
Needleman wunsch computional ppt
Ch06 alignment
Swaati algorithm of alignment ppt
A simple method to find a robust output feedback controller by random search ...
Intro to Quant Trading Strategies (Lecture 7 of 10)
Analysis of Algorithm
02-alignment.pdf
Ad

Recently uploaded (20)

PPTX
4. Diagnosis and treatment planning in RPD.pptx
PPTX
Diploma pharmaceutics notes..helps diploma students
PDF
Myanmar Dental Journal, The Journal of the Myanmar Dental Association (2015).pdf
PDF
Everyday Spelling and Grammar by Kathi Wyldeck
PDF
Journal of Dental Science - UDMY (2020).pdf
PPTX
UNIT_2-__LIPIDS[1].pptx.................
PPTX
Climate Change and Its Global Impact.pptx
PDF
Solved Past paper of Pediatric Health Nursing PHN BS Nursing 5th Semester
PDF
Nurlina - Urban Planner Portfolio (english ver)
PDF
fundamentals-of-heat-and-mass-transfer-6th-edition_incropera.pdf
PDF
LIFE & LIVING TRILOGY- PART (1) WHO ARE WE.pdf
PDF
0520_Scheme_of_Work_(for_examination_from_2021).pdf
PDF
Compact First Student's Book Cambridge Official
PDF
faiz-khans about Radiotherapy Physics-02.pdf
PPTX
Reproductive system-Human anatomy and physiology
PDF
PUBH1000 - Module 6: Global Health Tute Slides
PPTX
BSCE 2 NIGHT (CHAPTER 2) just cases.pptx
PDF
FYJC - Chemistry textbook - standard 11.
PDF
Disorder of Endocrine system (1).pdfyyhyyyy
PPTX
2025 High Blood Pressure Guideline Slide Set.pptx
4. Diagnosis and treatment planning in RPD.pptx
Diploma pharmaceutics notes..helps diploma students
Myanmar Dental Journal, The Journal of the Myanmar Dental Association (2015).pdf
Everyday Spelling and Grammar by Kathi Wyldeck
Journal of Dental Science - UDMY (2020).pdf
UNIT_2-__LIPIDS[1].pptx.................
Climate Change and Its Global Impact.pptx
Solved Past paper of Pediatric Health Nursing PHN BS Nursing 5th Semester
Nurlina - Urban Planner Portfolio (english ver)
fundamentals-of-heat-and-mass-transfer-6th-edition_incropera.pdf
LIFE & LIVING TRILOGY- PART (1) WHO ARE WE.pdf
0520_Scheme_of_Work_(for_examination_from_2021).pdf
Compact First Student's Book Cambridge Official
faiz-khans about Radiotherapy Physics-02.pdf
Reproductive system-Human anatomy and physiology
PUBH1000 - Module 6: Global Health Tute Slides
BSCE 2 NIGHT (CHAPTER 2) just cases.pptx
FYJC - Chemistry textbook - standard 11.
Disorder of Endocrine system (1).pdfyyhyyyy
2025 High Blood Pressure Guideline Slide Set.pptx
Ad

NeedlemanWunsch.pdf

  • 1. The Needleman-Wunsch algorithm for sequence alignment 7th Melbourne Bioinformatics Course Vladimir Likić, Ph.D. e-mail: [email protected] Bio21 Molecular Science and Biotechnology Institute The University of Melbourne The Needleman-Wunsch algorithm for sequence alignment – p.1/46
  • 2. Outline of the talk Sequence comparison and sequence alignment. Types of sequence alignment. The scoring scheme, substitution matrices, gaps. The Needleman-Wunsch algorithm for global sequence alignment. The Needleman-Wunsch algorithm for sequence alignment – p.2/46
  • 3. Sequence comparison To observe patterns of conservation (or variability). To find the common motifs present in both sequences. To assess whether it is likely that two sequences evolved from the same sequence. To find out which sequences from the database are similar to the sequence at hand. The Needleman-Wunsch algorithm for sequence alignment – p.3/46
  • 4. Two routes for sequence comparison dotplot – visual, qualitative sequence alignment – exact and quantitative. Involves: 1. Construction of the best alignment between the sequences. 2. Assessment of the similarity from the alignment. There are three different types of sequence alignment: Global alignment Local alignment Multiple sequence alignment The Needleman-Wunsch algorithm for sequence alignment – p.4/46
  • 5. Global sequence alignment The best alignment over the entire length of two sequences Suitable when the two sequences are of similar length, with a significant degree of similarity throughout. Example: SIMILARITY PI-LLAR--- The Needleman-Wunsch algorithm for sequence alignment – p.5/46
  • 6. Local sequence alignment Involving stretches that are shorter than the entire sequences, possibly more than one. Suitable when comparing substantially different sequences, which possibly differ significantly in length, and have only a short patches of similarity. For example, the local alignment of SIMILARITY and PILLAR: MILAR ILLAR The Needleman-Wunsch algorithm for sequence alignment – p.6/46
  • 7. Multiple sequence alignment Simultaneous alignment of more than two sequences. Suitable when searching for subtle conserved sequence patterns in a protein family, and when more than two sequences of the protein family are available. For example: SIMILARITY PI-LLAR--- --MOLARITY The Needleman-Wunsch algorithm for sequence alignment – p.7/46
  • 8. Alignment ”by eye” Consider the ”best” alignment of ATGGCGT and ATGAGT ATGGCGT *** !** ATG-AGT Intuitively we seek an alignment to maximize the number of residue-to-residue matches. The Needleman-Wunsch algorithm for sequence alignment – p.8/46
  • 9. A mathematical framework Sequence alignment is the establishment of residue-to-residue correspondence between two or more sequences such that the order of residues in each sequence is preserved. A gap, which indicates a residue-to-nothing match, may be introduced in either sequence. A gap-to-gap match is meaningless and is not allowed. The Needleman-Wunsch algorithm for sequence alignment – p.9/46
  • 10. The scoring scheme Give two sequences we need a number to associate with each possible alignment (i.e. the alignment score = goodness of alignment). The scoring scheme is a set of rules which assigns the alignment score to any given alignment of two sequences. 1. The scoring scheme is residue based: it consists of residue substitution scores (i.e. score for each possible residue alignment), plus penalties for gaps. 2. The alignment score is the sum of substitution scores and gap penalties. The Needleman-Wunsch algorithm for sequence alignment – p.10/46
  • 11. A simple scoring scheme Use +1 as a reward for a match, -1 as the penalty for a mismatch, and ignore gaps The best alignment ”by eye” from before: ATGGCGT ATG-AGT score: +1 + 1 + 1 + 0 − 1 + 1 + 1 = 4 An alternative alignment: ATGGCGT A-TGAGT score: +1 + 0 − 1 + 1 − 1 + 1 + 1 = 2 The Needleman-Wunsch algorithm for sequence alignment – p.11/46
  • 12. The substitution matrix A concise way to express the residue substitution costs can be achieved with a N × N matrix (N is 4 for DNA and 20 for proteins). The substitution matrix for the simple scoring scheme: C T A G C 1 -1 -1 -1 T -1 1 -1 -1 A -1 -1 1 -1 G -1 -1 -1 1 The Needleman-Wunsch algorithm for sequence alignment – p.12/46
  • 13. A better substitution matrix A, G are purines (pyrimidine ring fused to an imidazole ring), T, C are pyrimidines (one six-membered ring). Assume we believe that from evolutionary standpoint purine/pyrimidine mutations are less likely to occur compared to purine/purine (pyrimidine/pyrimidine) mutations. Can we capture this in a substitution matrix? C T A G C 2 1 -1 -1 T 1 2 -1 -1 A -1 -1 2 1 G -1 -1 1 2 The Needleman-Wunsch algorithm for sequence alignment – p.13/46
  • 14. Protein substitution matrices Protein substitution matrices are significantly more complex than DNA scoring matrices. Proteins are composed of twenty amino acids, and physico-chemical properties of individual amino acids vary considerably. A protein substitution matrix can be based on any property of amino acids: size, polarity, charge, hydrophobicity. In practice the most important are evolutionary substitution matrices. The Needleman-Wunsch algorithm for sequence alignment – p.14/46
  • 15. Evolutionary substitution matrices PAM (”point accepted mutation”) family PAM250, PAM120, etc. BLOSUM (”Blocks substitution matrix”) family BLOSUM62, BLOSUM50, etc. The substitution scores of both PAM and BLOSUM matrices are derived from the analysis of known alignments of closely related proteins. The BLOSUM matrices are newer and considered better. The Needleman-Wunsch algorithm for sequence alignment – p.15/46
  • 16. BLOSUM62 substitution matrix The Needleman-Wunsch algorithm for sequence alignment – p.16/46
  • 17. Gaps So far we ignored gaps (amounts to gap penalty of 0) A gap corresponds to an insertion or a deletion of a residue A conventional wisdom dictates that the penalty for a gap must be several times greater than the penalty for a mutation. That is because a gap/extra residue Interrupts the entire polymer chain In DNA shifts the reading frame The Needleman-Wunsch algorithm for sequence alignment – p.17/46
  • 18. Gap initiation and extension The conventional wisdom: the creation of a new gap should be strongly disfavored. However, once created insertions/deletions of chunks of more than one residue should be much less expensive (i.e. insertion of domains often occurs). A simple yet effective solution is affine gap penalties: γ(n) = −o − (n − 1)e The Needleman-Wunsch algorithm for sequence alignment – p.18/46
  • 19. Affine gaps: a physical insight Affine gaps favor the alignment: ATGTAGTGTATAGTACATGCA ATGTAG-------TACATGCA Over the alignment: ATGTAGTGTATAGTACATGCA ATGTA--G--TA---CATGCA Exactly what we want from the biological viewpoint. The Needleman-Wunsch algorithm for sequence alignment – p.19/46
  • 20. The alignment score with BLOSUM62 Consider two alternative alignments of ANRGDFS and ANREFS with the gap opening penalty of 10: ANRGDFS ANR-EFS score: 4+6+5−10+2+6+4 = 17 ANRGDFS ANRE-FS score: 4+6+5−2−10+6+4 = 13 The scoring scheme provides us with the quantitative measure of how good is some alignment relative to alternative alignments. However the scoring scheme does not tell us how to find the best alignment. The Needleman-Wunsch algorithm for sequence alignment – p.20/46
  • 21. How do we find the best alignment? Brute-force approach: Generate the list all possible alignments between two sequences, score them Select the alignment with the best score The number of possible global alignments between two sequences of length N is 22N √ πN For two sequences of 250 residues this is ∼ 10149 The Needleman-Wunsch algorithm for sequence alignment – p.21/46
  • 22. The Needleman-Wunsch algorithm A smart way to reduce the massive number of possibilities that need to be considered, yet still guarantees that the best solution will be found (Saul Needleman and Christian Wunsch, 1970). The basic idea is to build up the best alignment by using optimal alignments of smaller subsequences. The Needleman-Wunsch algorithm is an example of dynamic programming, a discipline invented by Richard Bellman (an American mathematician) in 1953! The Needleman-Wunsch algorithm for sequence alignment – p.22/46
  • 23. How does dynamic programming work? A divide-and-conquer strategy: Break the problem into smaller subproblems. Solve the smaller problems optimally. Use the sub-problem solutions to construct an optimal solution for the original problem. Dynamic programming can be applied only to problems exhibiting the properties of overlapping subproblems. Examples include Trevelling salesman problem Finding the best chess move The Needleman-Wunsch algorithm for sequence alignment – p.23/46
  • 24. The mathematics A matrix D(i, j) indexed by residues of each sequence is built recursively, such that D(i, j) = max      D(i − 1, j − 1) + s(xi, yj) D(i − 1, j) + g D(i, j − 1) + g subject to a boundary conditions. s(i, j) is the substitution score for residues i and j, and g is the gap penalty. The Needleman-Wunsch algorithm for sequence alignment – p.24/46
  • 25. A walk-through: an overview We consider all possible pairs of residue from two sequences (this gives rise to a 2D matrix representation). We will have two matrices: the score matrix and traceback matrix. The Needleman-Wunsch algorithm consists of three steps: 1. Initialisation of the score matrix 2. Calculation of scores and filling the traceback matrix 3. Deducing the alignment from the traceback matrix The Needleman-Wunsch algorithm for sequence alignment – p.25/46
  • 26. Consider the simple example The alignment of two sequences SEND and AND with the BLOSUM62 substitution matrix and gap opening penalty of 10 (no gap extension): SEND -AND score: +1 A-ND score: +3 ← the best AN-D score: -3 AND- score: -8 The Needleman-Wunsch algorithm for sequence alignment – p.26/46
  • 27. The score and traceback matrices The cells of the score matrix are labelled C(i, j) where i = 1, 2, ..., N and j = 1, 2, ..., M S E N D A N D S E N D C(4,1) C(4,2) C(4,3) C(4,4) C(4,5) C(3,1) C(3,2) C(3,3) C(3,4) C(3,5) C(1,1) C(1,2) C(1,3) C(1,4) C(1,5) C(2,1) C(2,2) C(2,3) C(2,4) C(2,5) The Needleman-Wunsch algorithm for sequence alignment – p.27/46
  • 28. Initialization The first row and the first column of the score and traceback matrices are filled during the initialization. S E N D A N D S E N D 0 −10 −20 −40 −20 −30 −10 left left left left up up up done −30 The Needleman-Wunsch algorithm for sequence alignment – p.28/46
  • 29. Scoring The score matrix cells are filled by row starting from the cell C(2, 2) The score of any cell C(i, j) is the maximum of: qdiag = C(i − 1, j − 1) + S(i, j) qup = C(i − 1, j) + g qleft = C(i, j − 1) + g where S(i, j) is the substitution score for letters i and j, and g is the gap penalty. The Needleman-Wunsch algorithm for sequence alignment – p.29/46
  • 30. Scoring– a pictorial representation The value of the cell C(i, j) depends only on the values of the immediately adjacent northwest diagonal, up, and left cells: C(i−1,j−1) C(i−1,j) C(i,j−1) C(i,j) The Needleman-Wunsch algorithm for sequence alignment – p.30/46
  • 31. The Needleman-Wunsch progression The first step is to calculate the value of C(2, 2): S E N D A N D S E N D 0 −10 −20 −40 −20 −30 −10 left left left left up up up done −30 ? ? The Needleman-Wunsch algorithm for sequence alignment – p.31/46
  • 32. The value of C(2, 2) The calculation for the cell C(2, 2): qdiag = C(1, 1) + S(S, A) = 0 + 1 = 1 qup = C(1, 2) + g = −10 + (−10) = −20 qleft = C(2, 1) + g = −10 + (−10) = −20 Where C(1, 1), C(1, 2), and C(2, 1) are read from the score matrix, and S(S, A) is the score for the S ↔ A taken from the BLOSUM62 matrix. The Needleman-Wunsch algorithm for sequence alignment – p.32/46
  • 33. Filling the score and traceback matrices For the score matrix C(2, 2) = qdiag which is 1. The corresponding cell of the traceback matrix is ”diag”: S E N D A N D S E N D 0 −10 −20 −40 −20 −30 −10 left left left left up up up done −30 1 diag The Needleman-Wunsch algorithm for sequence alignment – p.33/46
  • 34. The progression is recursive The next step is to calculate C(2, 3): S E N D A N D S E N D 0 −10 −20 −40 −20 −30 −10 left left left left up up up done −30 1 diag ? ? The Needleman-Wunsch algorithm for sequence alignment – p.34/46
  • 35. The value of C(2, 3) The calculation for the cell C(2, 3) qdiag = C(1, 2) + S(E, A) = −10 + −1 = −11 qup = C(1, 3) + g = −20 + (−10) = −30 qleft = C(2, 2) + g = 1 + (−10) = −9 Thus C(2, 3) = −9 and the corresponding cell of the traceback matrix is ”left”. The Needleman-Wunsch algorithm for sequence alignment – p.35/46
  • 36. The final score and traceback matrices After all cells are filled, the score and traceback matrices are: S E N D A N D S E N D 0 −10 −20 −40 −20 −30 −10 left left left left up up up done −30 1 diag −9 −19 −29 −9 −1 −3 −13 −19 −11 2 3 left left left diag diag diag left up diag diag diag The Needleman-Wunsch algorithm for sequence alignment – p.36/46
  • 37. The traceback Traceback = the process of deduction of the best alignment from the traceback matrix. The traceback always begins with the last cell to be filled with the score, i.e. the bottom right cell. One moves according to the traceback value written in the cell. There are three possible moves: diagonally (toward the top-left corner of the matrix), up, or left. The traceback is completed when the first, top-left cell of the matrix is reached (”done” cell). The Needleman-Wunsch algorithm for sequence alignment – p.37/46
  • 38. The traceback path The traceback performed on the completed traceback matrix: A N D S E N D diag left left left up done left left left left up diag diag diag left up up diag diag diag starts here Traceback The Needleman-Wunsch algorithm for sequence alignment – p.38/46
  • 39. The best alignment The alignment is deduced from the values of cells along the traceback path, by taking into account the values of the cell in the traceback matrix: diag – the letters from two sequences are aligned left – a gap is introduced in the left sequence up – a gap is introduced in the top sequence Sequences are aligned backwards. The Needleman-Wunsch algorithm for sequence alignment – p.39/46
  • 40. The traceback step-by-step (1) The first cell from the traceback path is ”diag” implying that the corresponding letters are aligned: D D A N D S E N D diag left left left up done left left left left up diag diag diag left up up diag diag diag starts here Traceback The Needleman-Wunsch algorithm for sequence alignment – p.40/46
  • 41. The traceback step-by-step (2) The second cell from the traceback path is also ”diag” implying that the corresponding letters are aligned: ND ND A N D S E N D diag left left left up done left left left left up diag diag diag left up up diag diag diag starts here Traceback The Needleman-Wunsch algorithm for sequence alignment – p.41/46
  • 42. The traceback step-by-step (3) The third cell from the traceback path is ”left” implying the gap in the left sequence (i.e. we stay on the letter A from the left sequence): END -ND A N D S E N D diag left left left up done left left left left up diag diag diag left up up diag diag diag starts here Traceback The Needleman-Wunsch algorithm for sequence alignment – p.42/46
  • 43. The traceback step-by-step (4) The fourth cell from the traceback path is also ”diag” implying that the corresponding letters are aligned. We consider the letter A again, this time it is aligned with S: SEND A-ND A N D S E N D diag left left left up done left left left left up diag diag diag left up up diag diag diag starts here Traceback The Needleman-Wunsch algorithm for sequence alignment – p.43/46
  • 44. Compare with the exhaustive search The best alignment via the Needleman-Wunsch algorithm: SEND A-ND The exhaustive search: SEND -AND score: +1 A-ND score: +3 ← the best AN-D score: -3 AND- score: -8 The Needleman-Wunsch algorithm for sequence alignment – p.44/46
  • 45. A few observations It was much easier to align SEND and AND by the exhaustive search! As we consider longer sequences the situation quickly turns against the exhaustive search: Two 12 residue sequences would require considering ∼ 1 million alignments. Two 150 residue sequences would require considering ∼ 1088 alignments (∼ 1078 is the estimated number of atoms in the Universe). For two 150 residue sequences the Needleman-Wunsch algorithm requires filling a 150 × 150 matrix. The Needleman-Wunsch algorithm for sequence alignment – p.45/46
  • 46. The summary The alignment is over the entire length of two sequences: the traceback starts from the lower right corner of the traceback matrix, and completes in the upper left cell of the matrix. The Needleman-Wunsch algorithm works in the same way regardless of the length or complexity of sequences, and guarantees to find the best alignment. The Needleman-Wunsch algorithm is appropriate for finding the best alignment of two sequences which are (i) of the similar length; (ii) similar across their entire lengths. The Needleman-Wunsch algorithm for sequence alignment – p.46/46