SlideShare a Scribd company logo
Pairwise Sequence Alignment Misha Kapushesky Slides:  Stuart M. Brown, Fourie Joubert, NYU St. Petersburg Russia 2010
Protein Evolution “ For many protein sequences, evolutionary history can be traced back 1-2 billion years” -William Pearson When we align sequences, we assume that they share a common ancestor They are then homologous Protein fold is much more conserved than protein sequence DNA sequences tend to be less informative than protein sequences
Definition Homology: related by descent Homologous sequence positions ATTGCGC    ATTGCGC    ATCCGC C ATTGCGC AT-CCGC  ATTGCGC
Orthologous and paralogous Orthologous sequences differ because they are found in different species (a speciation event) Paralogous sequences differ due to a gene duplication event Sequences may be both orthologous and paralogous
Pairwise Alignment The alignment of two sequences (DNA or protein) is a relatively straightforward computational problem.  There are lots of possible alignments. Two sequences can  always  be aligned. Sequence alignments have to be  scored . Often there is  more than one  solution with the same score.
Methods of Alignment By hand - slide sequences on two lines of a word processor Dot plot with windows Rigorous mathematical approach Dynamic programming (slow, optimal) Heuristic methods (fast, approximate) BLAST and FASTA  Word matching and hash tables0
Align by Hand GATC GCCTA_TTACGTCCTGGAC  <-- --> AGGC A TAC G TA_G C CCT TTCGC You still need some kind of scoring system to find the best alignment
Percent Sequence Identity The extent to which two nucleotide or amino acid sequences are invariant A C  C  T G  A  G  –  A G  A C  G  T G  –  G  C  A G 70% identical mismatch indel
Dotplot: A              T         T         C            A             C        A             T         A             T A C A T T A C G T A C Sequence 1 Sequence 2 A dotplot gives an overview of all possible alignments
Dotplot: A              T         T         C            A             C        A             T         A             T A C A T T A C G T A C T A C A T T A C G T A C A T A C A C T T A Sequence 1 Sequence 2 One possible alignment: In a dotplot each diagonal corresponds to a possible (ungapped) alignment
Insertions / Deletions in a Dotplot T A C T G T C A T T  A  C  T  G  T  T  C  A  T Sequence 1 Sequence 2 T A C T G   -   T C A T | | | | |  | | | | T A C T G T T C A T
Hemoglobin   -chain Hemoglobin  -chain Dotplot   (Window = 130 /  Stringency = 9)
Word Size Algorithm T  A C G  G T A T G A C A  G T A T C T A  C G G  T A T G A  C A G  T A T C T A C  G G T  A T G A C  A G T  A T C T A C G  G T A  T G A C A  G T A  T C C         T         A       T          G       A               C   A              T A C G G T A T G Word Size = 3 
PTHPL ASKTQILPEDLA SEDLTI PTHPL AGERAIGLARLA EEDFGM Score = 7  PTH PLASKTQILPED LASEDLTI PTH PLAGERAIGLAR LAEEDFGM Score = 11  Matrix: PAM250 Window = 12  Stringency = 9 Scoring Matrix Filtering PTHP LASKTQILPEDL ASEDLTI PTHP LAGERAIGLARL AEEDFGM Score = 11  Window / Stringency
Dotplot   (Window = 18  /  Stringency = 10) Hemoglobin  -chain Hemoglobin   -chain
Considerations The window/stringency method is more sensitive than the wordsize method (ambiguities are permitted). The smaller the window, the larger the weight of statistical (unspecific) matches. With large windows the sensitivity for short sequences is reduced. Insertions/deletions are not treated explicitly.
Alignment methods Rigorous algorithms = Dynamic Programming Needleman-Wunsch  (global) Smith-Waterman  (local) Heuristic algorithms  (faster but approximate) BLAST FASTA
The Rocks game N rocks, 2 piles, 2 players Player can Remove 1 rock from either pile Remove 1 rock from each pile Last to remove a rock wins Assume 10 rocks in each pile – winning strategy?
Dynamic Programming
Basic principles of dynamic programming - Creation of an  alignment path matrix -  Stepwise  calculation of score values -  Backtracking  (evaluation of the optimal path)
Dynamic Programming Dynamic Programming is a very general programming technique.  It is applicable when a large search space can be structured into a succession of stages, such that:  the initial stage contains trivial solutions to sub-problems  each partial solution in a later stage can be calculated by recurring a fixed number of partial solutions in an earlier stage the final stage contains the overall solution
Creation of an alignment path matrix Idea: Build up an optimal alignment using previous solutions for optimal alignments of smaller subsequences Construct matrix  F  indexed by  i  and  j  (one index for each sequence) F ( i,j ) is the score of the best alignment between the initial  segment  x 1... i   of  x  up to  x i   and the initial segment  y 1... j   of  y  up to  y j Build  F ( i,j ) recursively beginning with  F (0,0) = 0
 
If  F ( i -1, j -1),  F ( i -1, j ) and  F ( i , j -1) are known we can calculate  F ( i , j ) Three possibilities: x i   and  y j  are aligned,  F ( i , j ) =  F ( i -1, j -1) + s( x i  , y j ) x i  is aligned to a gap,  F ( i , j ) =  F ( i -1, j ) -  d y j   is aligned to a gap,  F ( i , j ) =  F ( i , j -1) -  d The best score up to ( i , j ) will be the  largest  of the three options Creation of an alignment path matrix
H  E  A  G  A  W  G  H  E  E  0  -8  -16  -24  -32  -40  -48  -56  -64  -72  -80  P  -8  -2  -9  -17  -25  -33  -42  -49  -57  -65  -73 A -16  -10  -3  -4  -12  -20  -28  -36  -44  -52  -60 W -24  -18  -11  -6  -7  -15  -5  -13  -21  -29  -37 H -32  -14  -18  -13  -8  -9  -13  -7  -3  -11  -19 E -40  -22  -8  -16  -16  -9  -12  -15  -7  3  -5 A -48  -30  -16  -3  -11  -11  -12  -12  -15  -5  2 E -56  -38  -24  -11  -6  -12  -14  -15  -12  -9  1 Backtracking -5 1 - A E E H H G - W W A A G - A P E - H - 0 -25 -5 -20 -13 -3 3 -8 -16 -17 Optimal global alignment: E E
Global vs.  Local  Alignments Global alignment algorithms start at the beginning of two sequences and add gaps to each until the end of one is reached. Local alignment algorithms finds the region (or regions) of highest similarity between two sequences and build the alignment outward from there.
 
needle  (Needleman & Wunsch)   creates an  end-to-end  alignment. Global Alignment Two closely related sequences:
Two sequences sharing several regions of local similarity: 1 AGGATTGGAATGCTCAGAAGCAGCTAAAGCGTGTATGCAGGATTGGAATTAAAGAGGAGGTAGACCG.... 67 1 AGGATTGGAATGCTAGGCTTGATTGCCTACCTGTAGCCACATCAGAAGCACTAAAGCGTCAGCGAGACCG  70 ||||||||||||||  |  |  | ||||  ||  | |  | ||   Global Alignment
Global Alignment (Needleman-Wunsch) The the Needleman-Wunsch algorithm creates a global alignment over the length of both sequences  (needle) Global algorithms are often not effective for highly diverged sequences - do not reflect the biological reality that two sequences may only share limited regions of conserved sequence.  Sometimes two sequences may be derived from ancient  recombination events where only a single functional domain is shared. Global methods are useful when you want to force two sequences to align over their entire length
Local Alignment (Smith-Waterman) Local alignment Identify the most similar sub-region shared between two sequences Smith-Waterman EMBOSS:  water
Parameters of  Sequence  Alignment Scoring Systems: Each symbol pairing is assigned a numerical value, based on a symbol comparison table. Gap Penalties: Opening: The cost to introduce a gap Extension: The cost to elongate a gap
DNA Scoring Systems -very simple Sequence 1 Sequence 2 A G C T A 1 0 0 0 G 0 1 0 0 C 0 0 1 0 T 0 0 0 1 Match: 1 Mismatch: 0 Score = 5 actaccagttcatttgatacttctcaaa taccattaccgtgttaactgaaaggacttaaagact
Protein Scoring Systems PTHPLASKTQILPEDLASEDLTI PTHPLAGERAIGLARLAEEDFGM Sequence 1 Sequence 2 Scoring matrix T:G =  -2  T:T  =  5 Score =  48   C S T P A G N D . . C  9 S -1  4 T -1  1  5 P -3 -1 -1  7 A  0  1  0 -1  4 G -3  0 -2 -2  0  6 N -3  1  0 -2 -2  0  5 D -3  0 -1 -1 -2 -1  1  6  . .   C S T P A G N D . . C  9 S -1  4 T -1  1  5 P -3 -1 -1  7 A  0  1  0 -1  4 G -3  0 -2 -2  0  6 N -3  1  0 -2 -2  0  5 D -3  0 -1 -1 -2 -1  1  6  . .
Amino acids have different biochemical and physical properties that influence their relative replaceability in evolution. C P G G A V I L M F Y W H K R E Q D N S T C SH S+S positive charged polar aliphatic aromatic small tiny hydrophobic Protein Scoring Systems
Scoring matrices reflect: # of mutations to convert one to another chemical similarity –  observed mutation frequencies –  the probability of occurrence of each amino acid Widely used scoring matrices: PAM  BLOSUM Protein Scoring Systems
PAM matrices Family of matrices PAM 80, PAM 120, PAM 250 The number with a PAM matrix represents the evolutionary distance between the sequences on which the matrix is based Greater numbers denote greater distances
PAM  ( P ercent  A ccepted  M utations)  matrices The numbers of replacements were used to compute a so-called PAM-1 matrix. The PAM-1 matrix reflects an average change of 1% of all amino  acid positions. PAM matrices for larger evolutionary distances can be extrapolated from the PAM-1 matrix. PAM250 = 250 mutations per 100 residues. Greater numbers mean bigger evolutionary distance
PAM  ( P ercent  A ccepted  M utations)  matrices Derived from global alignments of  protein families  . Family members share at least 85% identity ( Dayhoff  et al ., 1978 ). Construction of phylogenetic tree and ancestral sequences of each protein family Computation of number of replacements for each pair of amino acids
A  R  N  D  C  Q  E  G  H  I  L  K  M  F  P  S  T  W  Y  V  B  Z A  2 -2  0  0 -2  0  0  1 -1 -1 -2 -1 -1 -3  1  1  1 -6 -3  0  2  1  R  -2  6  0 -1 -4  1 -1 -3  2 -2 -3  3  0 -4  0  0 -1  2 -4 -2  1  2  N  0  0  2  2 -4  1  1  0  2 -2 -3  1 -2 -3  0  1  0 -4 -2 -2  4  3  D  0 -1  2  4 -5  2  3  1  1 -2 -4  0 -3 -6 -1  0  0 -7 -4 -2  5  4  C  -2 -4 -4 -5 12 -5 -5 -3 -3 -2 -6 -5 -5 -4 -3  0 -2 -8  0 -2 -3 -4  Q  0  1  1  2 -5  4  2 -1  3 -2 -2  1 -1 -5  0 -1 -1 -5 -4 -2  3  5  E  0 -1  1  3 -5  2  4  0  1 -2 -3  0 -2 -5 -1  0  0 -7 -4 -2  4  5  G  1 -3  0  1 -3 -1  0  5 -2 -3 -4 -2 -3 -5  0  1  0 -7 -5 -1  2  1  H  -1  2  2  1 -3  3  1 -2  6 -2 -2  0 -2 -2  0 -1 -1 -3  0 -2  3  3  I  -1 -2 -2 -2 -2 -2 -2 -3 -2  5  2 -2  2  1 -2 -1  0 -5 -1  4 -1 -1  L  -2 -3 -3 -4 -6 -2 -3 -4 -2  2  6 -3  4  2 -3 -3 -2 -2 -1  2 -2 -1  K  -1  3  1  0 -5  1  0 -2  0 -2 -3  5  0 -5 -1  0  0 -3 -4 -2  2  2  M  -1  0 -2 -3 -5 -1 -2 -3 -2  2  4  0  6  0 -2 -2 -1 -4 -2  2 -1  0  F  -3 -4 -3 -6 -4 -5 -5 -5 -2  1  2 -5  0  9 -5 -3 -3  0  7 -1 -3 -4  P  1  0  0 -1 -3  0 -1  0  0 -2 -3 -1 -2 -5  6  1  0 -6 -5 -1  1  1  S  1  0  1  0  0 -1  0  1 -1 -1 -3  0 -2 -3  1  2  1 -2 -3 -1  2  1  T  1 -1  0  0 -2 -1  0  0 -1  0 -2  0 -1 -3  0  1  3 -5 -3  0  2  1  W  -6  2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4  0 -6 -2 -5 17  0 -6 -4 -4  Y  -3 -4 -2 -4  0 -4 -4 -5  0 -1 -1 -4 -2  7 -5 -3 -3  0 10 -2 -2 -3  V  0 -2 -2 -2 -2 -2 -2 -1 -2  4  2 -2  2 -1 -1 -1  0 -6 -2  4  0  0  B  2  1  4  5 -3  3  4  2  3 -1 -2  2 -1 -3  1  2  2 -4 -2  0  6  5  Z  1  2  3  4 -4  5  5  1  3 -1 -1  2  0 -4  1  1  1 -4 -3  0  5  6   PAM 250 C -8 17 W W
PAM - limitations  Based on only one original dataset Examines proteins with few differences (85% identity) Based mainly on small globular proteins so the matrix is biased
BLOSUM matrices Different BLOSUM n  matrices are calculated independently from BLOCKS (ungapped local alignments) BLOSUM n  is based on a cluster of BLOCKS of sequences that share at least  n  percent identity BLOSUM 62  represents closer sequences than BLOSUM 45
Derived from alignments of domains of  distantly  related proteins ( Henikoff & Henikoff,1992 ). Occurrences of each amino acid pair in each column of each block alignment is counted. The numbers derived from all blocks were  used to compute the BLOSUM matrices. A A C E C A - C  = 4 A - E  = 2 C - E  = 2 A - A  = 1 C - C  = 1 BLOSUM  ( Blo cks  Su bstitution  M atrix) A A C E C
The Blosum50 Scoring Matrix
BLOSUM  ( Blo cks  Su bstitution  M atrix) Sequences within blocks are clustered according to their level of identity. Clusters are counted as a single sequence. Different BLOSUM matrices differ in the percentage of sequence identity used in clustering. The number in the matrix name (e.g. 62 in BLOSUM62) refers to the percentage of sequence identity used to build the matrix. Greater numbers mean smaller evolutionary distance.
PAM Vs. BLOSUM PAM100  =  BLOSUM90 PAM120  =  BLOSUM80 PAM160  =  BLOSUM60 PAM200  =  BLOSUM52 PAM250  =  BLOSUM45 More distant sequences BLOSUM62 for general use BLOSUM80 for close relations BLOSUM45 for distant relations PAM120 for general use PAM60 for close relations  PAM250 for distant relations
TIPS on choosing a scoring matrix Generally, BLOSUM matrices perform better than PAM matrices for local similarity searches ( Henikoff & Henikoff, 1993 ). When comparing  closely   related   proteins one should use  lower   PAM   or higher BLOSUM  matrices, for  distantly   related   proteins  higher PAM or lower BLOSUM  matrices. For database searching the commonly used matrix is BLOSUM62.
T A T G T  G G  A A T G A  Scoring Insertions and Deletions A T G T  - -  A A T G C A A T G T A A T G C A T A T G T G G A A T G A  The creation of a gap is  penalized  with a negative score value. insertion / deletion
1 GTGATAGACACAGACCGGTGGCATTGTGG 29 |||  |  | |||  |  || || | 1 GTGTCGGGAAGAGATAACTCCGATGGTTG 29 Why Gap Penalties?   Gaps allowed but not penalized   Score:  88  Gaps not permitted  Score:  0 1 GTG.ATAG.ACACAGA..CCGGT..GGCATTGTGG 29 ||| || | | | |||  ||  |  |  || || | 1 GTGTAT.GGA.AGAGATACC..TCCG..ATGGTTG 29 Match = 5 Mismatch = -4
The optimal alignment of two similar sequences is usually that which maximizes  the number of matches and minimizes  the number of gaps. There is a tradeoff between these two - adding gaps reduces mismatches Permitting the insertion of arbitrarily many gaps can lead to high scoring alignments of  non-homologous  sequences. Penalizing gaps forces alignments to have relatively few gaps. Why Gap Penalties?
Gap Penalties   How to balance gaps with mismatches? Gaps must get a steep penalty, or else you’ll end up with nonsense alignments. In real sequences, multi-base (or amino acid) gaps are quite common genetic insertion/deletion events “ Affine” gap penalties give a big penalty for each new gap, but a much smaller “gap extension” penalty.
Scoring Insertions and Deletions A T G T T A T A C T A T G T G C G T A T A   Total Score:  4 Gap parameters: d  = 3 (gap opening) e  = 0.1 (gap extension) g  = 3 (gap lenght)  (g)  = -3 - (3 -1) 0.1 = -3.2   T A T G T G C G T A T A  A T G T - - - T A T A C insertion / deletion match = 1 mismatch = 0 Total Score: 8 - 3.2 = 4.8
Modification of Gap Penalties 1 V...LSPADKFLTNV 12 |  |||| |  | | 1 VFTELSPA.K..T.V 11 1 ...VLSPADKFLTNV 12 ||||  1 VFTELSPAKTV.... 11 gap opening penalty =  0 gap extension penalty =  0.1 score =  11.3 Score Matrix: BLOSUM62 gap opening penalty =  3 gap extension penalty =  0.1 score =  6.3
BLAST Algorithm Basic Local Alignment Search Tool Fast alignment technique(s) Similar to FASTA algorithms (not used much now) There are more accurate ones, but they’re slower BLAST makes a big use of lookup tables  Idea: statistically significant alignments (hits) Will have regions of at least 3 letters same Or at least high scoring with respect to BLOSUM matrix Based on small local alignments CCNDHRKMTCSPNDNNRK TTNDHRMTACSPDNNNKH CCNDHRKMTCSPNDNNRK YTNHHMMTTYSLDNNNKK more likely than
BLAST Overview Given a query sequence Q Seven main stages Remove (filter) low complexity regions from Q Harvest k-tuples (triples) from Q Expand each triple into ~50 high scoring words Seed a set of possible alignments Generate high scoring pairs (HSPs) from the seeds Test significance of matches from HSPs Report the alignments found from the HSPs
BLAST Algorithm Part 1  Removing Low-complexity Segments Imagine matching  HHHHHHHHKMAY and HHHHHHHHURHD The KMAY and URHD are the interesting parts But this pair score highly using BLOSUM It’s a good idea to remove the HHHHHHHs From the query sequence (low complexity) SEG program does this kind of thing Comes with most BLAST implementations Often doesn’t do much, and it can be turned off
Removing Low-complexity Segments Given a segment of length L With each amino acid occurring n 1  n 2  … n 20  times Use the following measure for “compositional complexity”: To use this measure Slide a “window” of ~12 residues along Query Sequence Q Use a threshold to determine low complexity windows Use a minimise routine to replace the segment With an optimal minimised segment (or just an X)
BLAST Algorithm Part 2 Harvesting k-tuples Collect all the k-tuples of elements in Q k set to 3 for residues and 11 for DNA (can vary) Triples are called ‘words’. Call this set W S T S L S T S D K L M R STS TSL SLS LST
BLAST Algorithm Part 3 Finding High Scoring Triples Given a word w from W Find all other words w’ of same length (3), which: Appear in some database sequence  Blosum(w,w’) > a threshold T Choose T to limit number to around 50 Call these the high scoring triples (words) for w Example: letting w=PQG, set T to be 13 Suppose that PQG, PEG, PSG, PQA are found in database Blosum(PQG,PQG) = 18, Blosum(PQG,PEG) = 15 Blosum(PQG,PSG) = 13, Blosum(PQG,PQA) = 12 Hence, PQG and PEG  only  are kept
Finding High Scoring Triples For each w in W, find all the high scoring words Organise these sets of words  Remembering all the places where w was found in Q Each high scoring triple is going to be a seed  In order to generate possible alignment(s) One seed can generate more than one alignment End of the first half of the algorithm Going to find alignments now
BLAST Algorithm Part 4 Seeding Possible Alignments Look at first triple V in query sequence Q Actually from Q (not from W - which has omissions) Retrieve the set of ~50 high scoring words Call this set H V Retrieve the list of places in Q where V occurs Call this set P V For every pair ( word, pos ) Where  word  is from H V  and  pos  is from P V Find all the database sequences D Which have an  exact  match with  word  at position  pos ’ Store an alignment between Q and D With V   matched at  pos  in Q and  pos ’ in D Repeat this for the second triple in Q, and so on
Seeding Possible Alignments Example Suppose Q = QQGPHUIQEGQQG Suppose V = QQG, H V  = {QQG, QEG} Then P V  = {1, 11}  Suppose we are looking in the database at: D = PKLMMQQGKQEG Then the alignments seeded are: QQGPHUIQEGQQG  word=QQG  QQGPHUIQEGQQG  word=QQG PKLMMQQGKQEG  pos=1  PKLMMQQGKQEG  pos=11 QQGPHUIQEGQQG  word=QEG  QQGPHUIQEGQQG  word=QEG  PKLMMQQGKQEG  pos=1  PKLMMQQGKQEG  pos=11
BLAST Algorithm Part 5 Generating High Scoring Pairs (HSPs) For each alignment A Where sequences Q and D are matched Original region matching was M Extend M to the left Until the Blosum score begins to decrease Extend M to the right Until the Blosum score begins to decrease Larger stretch of sequence now matches May have higher score than the original triple Call these high scoring pairs Throw away any alignments for which the score S of the extended region M is lower than some cutoff score
Extending Alignment Regions Example QQGPHUIQEGQQGKEEDPP  Blosum(QQG,QQG) = 16 PKLMMQQGKQEGM  QQGPHUIQEGQQGKEEDPP  Blosum(QQGK,QQGK) = 21 PKLMMQQGKQEGM  QQGPHUIQEGQQGKEEDPP  Blosum(QQGKE,QQGKQ) = 23 PKLMMQQGKQEGM  QQGPHUIQEGQQGKEEDPP  Blosum(QQGKEE,QQGKQE) = 28 PKLMMQQGKQEGM  QQGPHUIQEGQQGKEEDPP  Blosum(QQGKEED,QQGKQEG) = 27 PKLMMQQGKQEGM  So, the extension to the right stops here  HSP (before left extension) is QQGKEE, scoring 28
BLAST Algorithm Part 6 Checking Statistical Significance Reason we extended alignment regions Give a more accurate picture of the probability of that BLOSUM score occurring by chance Question: is a HSP significant? Suppose we have a HSP such that It scores S for a region of length L in sequences Q & D Then the probability of two random sequences Q’ and D’ scoring S in a region of length L is calculated Where Q’ is same length as Q and D’ is same length as D This probability needs to be low for significance
BLAST Algorithm Part 7 Reporting the Alignments For each statistically significant HSP The alignment is reported If a sequence D has two HSPs with Query Q Two different alignments are reported Later versions of BLAST Try and unify the two alignments

More Related Content

Viewers also liked (9)

PDF
20071007 efficientalgorithms kulikov_lecture03
Computer Science Club
 
PDF
20081019 auctions nikolenko_lecture04
Computer Science Club
 
PDF
20091004 cryptoprotocols nikolenko_lecture04
Computer Science Club
 
PDF
20090920 cryptoprotocols nikolenko_lecture01
Computer Science Club
 
PDF
20071104 verification konev_lecture14
Computer Science Club
 
PDF
20100314 virtualization igotti_lecture06
Computer Science Club
 
PDF
20090215 hardnessvsrandomness itsykson_lecture01
Computer Science Club
 
PDF
20100925 ontology konev_lecture02
Computer Science Club
 
PDF
20100930 proof complexity_hirsch_lecture03
Computer Science Club
 
20071007 efficientalgorithms kulikov_lecture03
Computer Science Club
 
20081019 auctions nikolenko_lecture04
Computer Science Club
 
20091004 cryptoprotocols nikolenko_lecture04
Computer Science Club
 
20090920 cryptoprotocols nikolenko_lecture01
Computer Science Club
 
20071104 verification konev_lecture14
Computer Science Club
 
20100314 virtualization igotti_lecture06
Computer Science Club
 
20090215 hardnessvsrandomness itsykson_lecture01
Computer Science Club
 
20100925 ontology konev_lecture02
Computer Science Club
 
20100930 proof complexity_hirsch_lecture03
Computer Science Club
 

Similar to 20100515 bioinformatics kapushesky_lecture07 (20)

PDF
Basics of bioinformatics
Abhishek Vatsa
 
PPT
Seq alignment
Nagendrasahu6
 
PPTX
4. sequence alignment.pptx
ArupKhakhlari1
 
PPTX
DYNAMIC PROGRAMMING, Bioinformatics.pptx
PraisePedzai
 
PDF
PCB_Lect02_Pairwise_allign (1).pdf
ssusera1eccd
 
PDF
Sequence Alignment
Ravi Gandham
 
PDF
Ch06 alignment
BioinformaticsInstitute
 
PDF
Sequence-analysis-pairwise-alignment.pdf
sriaisvariyasundar
 
PPTX
Sequence alignment
Zeeshan Hanjra
 
PDF
Sequence alignment
Vidya Kalaivani Rajkumar
 
DOCX
Bioinformatics_Sequence Analysis
Sangeeta Das
 
PDF
sequence alignment
ammar kareem
 
PPTX
Sequence alignment for bio informatics.pptx
kavinviswa72
 
PPT
B.sc biochem i bobi u 3.1 sequence alignment
Rai University
 
PPT
B.sc biochem i bobi u 3.1 sequence alignment
Rai University
 
PPTX
Sequence Alignment
Meghaj Mallick
 
PPTX
Sequence alignment unit 3
gyanikashukla
 
PPTX
Bioinformatics life sciences_v2015
Prof. Wim Van Criekinge
 
PPT
seq alignment.ppt
AmandeepKaur836413
 
PPTX
Pairwise Sequence Alignment is alignment.pptx
shalevwachsman
 
Basics of bioinformatics
Abhishek Vatsa
 
Seq alignment
Nagendrasahu6
 
4. sequence alignment.pptx
ArupKhakhlari1
 
DYNAMIC PROGRAMMING, Bioinformatics.pptx
PraisePedzai
 
PCB_Lect02_Pairwise_allign (1).pdf
ssusera1eccd
 
Sequence Alignment
Ravi Gandham
 
Ch06 alignment
BioinformaticsInstitute
 
Sequence-analysis-pairwise-alignment.pdf
sriaisvariyasundar
 
Sequence alignment
Zeeshan Hanjra
 
Sequence alignment
Vidya Kalaivani Rajkumar
 
Bioinformatics_Sequence Analysis
Sangeeta Das
 
sequence alignment
ammar kareem
 
Sequence alignment for bio informatics.pptx
kavinviswa72
 
B.sc biochem i bobi u 3.1 sequence alignment
Rai University
 
B.sc biochem i bobi u 3.1 sequence alignment
Rai University
 
Sequence Alignment
Meghaj Mallick
 
Sequence alignment unit 3
gyanikashukla
 
Bioinformatics life sciences_v2015
Prof. Wim Van Criekinge
 
seq alignment.ppt
AmandeepKaur836413
 
Pairwise Sequence Alignment is alignment.pptx
shalevwachsman
 
Ad

More from Computer Science Club (20)

PDF
20141223 kuznetsov distributed
Computer Science Club
 
PDF
Computer Vision
Computer Science Club
 
PDF
20140531 serebryany lecture01_fantastic_cpp_bugs
Computer Science Club
 
PDF
20140531 serebryany lecture02_find_scary_cpp_bugs
Computer Science Club
 
PDF
20140531 serebryany lecture01_fantastic_cpp_bugs
Computer Science Club
 
PDF
20140511 parallel programming_kalishenko_lecture12
Computer Science Club
 
PDF
20140427 parallel programming_zlobin_lecture11
Computer Science Club
 
PDF
20140420 parallel programming_kalishenko_lecture10
Computer Science Club
 
PDF
20140413 parallel programming_kalishenko_lecture09
Computer Science Club
 
PDF
20140329 graph drawing_dainiak_lecture02
Computer Science Club
 
PDF
20140329 graph drawing_dainiak_lecture01
Computer Science Club
 
PDF
20140310 parallel programming_kalishenko_lecture03-04
Computer Science Club
 
PDF
20140223-SuffixTrees-lecture01-03
Computer Science Club
 
PDF
20140216 parallel programming_kalishenko_lecture01
Computer Science Club
 
PDF
20131106 h10 lecture6_matiyasevich
Computer Science Club
 
PDF
20131027 h10 lecture5_matiyasevich
Computer Science Club
 
PDF
20131027 h10 lecture5_matiyasevich
Computer Science Club
 
PDF
20131013 h10 lecture4_matiyasevich
Computer Science Club
 
PDF
20131006 h10 lecture3_matiyasevich
Computer Science Club
 
PDF
20131006 h10 lecture3_matiyasevich
Computer Science Club
 
20141223 kuznetsov distributed
Computer Science Club
 
Computer Vision
Computer Science Club
 
20140531 serebryany lecture01_fantastic_cpp_bugs
Computer Science Club
 
20140531 serebryany lecture02_find_scary_cpp_bugs
Computer Science Club
 
20140531 serebryany lecture01_fantastic_cpp_bugs
Computer Science Club
 
20140511 parallel programming_kalishenko_lecture12
Computer Science Club
 
20140427 parallel programming_zlobin_lecture11
Computer Science Club
 
20140420 parallel programming_kalishenko_lecture10
Computer Science Club
 
20140413 parallel programming_kalishenko_lecture09
Computer Science Club
 
20140329 graph drawing_dainiak_lecture02
Computer Science Club
 
20140329 graph drawing_dainiak_lecture01
Computer Science Club
 
20140310 parallel programming_kalishenko_lecture03-04
Computer Science Club
 
20140223-SuffixTrees-lecture01-03
Computer Science Club
 
20140216 parallel programming_kalishenko_lecture01
Computer Science Club
 
20131106 h10 lecture6_matiyasevich
Computer Science Club
 
20131027 h10 lecture5_matiyasevich
Computer Science Club
 
20131027 h10 lecture5_matiyasevich
Computer Science Club
 
20131013 h10 lecture4_matiyasevich
Computer Science Club
 
20131006 h10 lecture3_matiyasevich
Computer Science Club
 
20131006 h10 lecture3_matiyasevich
Computer Science Club
 
Ad

20100515 bioinformatics kapushesky_lecture07

  • 1. Pairwise Sequence Alignment Misha Kapushesky Slides: Stuart M. Brown, Fourie Joubert, NYU St. Petersburg Russia 2010
  • 2. Protein Evolution “ For many protein sequences, evolutionary history can be traced back 1-2 billion years” -William Pearson When we align sequences, we assume that they share a common ancestor They are then homologous Protein fold is much more conserved than protein sequence DNA sequences tend to be less informative than protein sequences
  • 3. Definition Homology: related by descent Homologous sequence positions ATTGCGC  ATTGCGC  ATCCGC C ATTGCGC AT-CCGC  ATTGCGC
  • 4. Orthologous and paralogous Orthologous sequences differ because they are found in different species (a speciation event) Paralogous sequences differ due to a gene duplication event Sequences may be both orthologous and paralogous
  • 5. Pairwise Alignment The alignment of two sequences (DNA or protein) is a relatively straightforward computational problem. There are lots of possible alignments. Two sequences can always be aligned. Sequence alignments have to be scored . Often there is more than one solution with the same score.
  • 6. Methods of Alignment By hand - slide sequences on two lines of a word processor Dot plot with windows Rigorous mathematical approach Dynamic programming (slow, optimal) Heuristic methods (fast, approximate) BLAST and FASTA Word matching and hash tables0
  • 7. Align by Hand GATC GCCTA_TTACGTCCTGGAC <-- --> AGGC A TAC G TA_G C CCT TTCGC You still need some kind of scoring system to find the best alignment
  • 8. Percent Sequence Identity The extent to which two nucleotide or amino acid sequences are invariant A C C T G A G – A G A C G T G – G C A G 70% identical mismatch indel
  • 9. Dotplot: A     T     T     C    A     C    A     T     A     T A C A T T A C G T A C Sequence 1 Sequence 2 A dotplot gives an overview of all possible alignments
  • 10. Dotplot: A     T     T     C    A     C    A     T     A     T A C A T T A C G T A C T A C A T T A C G T A C A T A C A C T T A Sequence 1 Sequence 2 One possible alignment: In a dotplot each diagonal corresponds to a possible (ungapped) alignment
  • 11. Insertions / Deletions in a Dotplot T A C T G T C A T T A C T G T T C A T Sequence 1 Sequence 2 T A C T G - T C A T | | | | | | | | | T A C T G T T C A T
  • 12. Hemoglobin  -chain Hemoglobin  -chain Dotplot (Window = 130 / Stringency = 9)
  • 13. Word Size Algorithm T A C G G T A T G A C A G T A T C T A C G G T A T G A C A G T A T C T A C G G T A T G A C A G T A T C T A C G G T A T G A C A G T A T C C T A T  G A C A T A C G G T A T G Word Size = 3 
  • 14. PTHPL ASKTQILPEDLA SEDLTI PTHPL AGERAIGLARLA EEDFGM Score = 7 PTH PLASKTQILPED LASEDLTI PTH PLAGERAIGLAR LAEEDFGM Score = 11  Matrix: PAM250 Window = 12 Stringency = 9 Scoring Matrix Filtering PTHP LASKTQILPEDL ASEDLTI PTHP LAGERAIGLARL AEEDFGM Score = 11  Window / Stringency
  • 15. Dotplot (Window = 18 / Stringency = 10) Hemoglobin  -chain Hemoglobin  -chain
  • 16. Considerations The window/stringency method is more sensitive than the wordsize method (ambiguities are permitted). The smaller the window, the larger the weight of statistical (unspecific) matches. With large windows the sensitivity for short sequences is reduced. Insertions/deletions are not treated explicitly.
  • 17. Alignment methods Rigorous algorithms = Dynamic Programming Needleman-Wunsch (global) Smith-Waterman (local) Heuristic algorithms (faster but approximate) BLAST FASTA
  • 18. The Rocks game N rocks, 2 piles, 2 players Player can Remove 1 rock from either pile Remove 1 rock from each pile Last to remove a rock wins Assume 10 rocks in each pile – winning strategy?
  • 20. Basic principles of dynamic programming - Creation of an alignment path matrix - Stepwise calculation of score values - Backtracking (evaluation of the optimal path)
  • 21. Dynamic Programming Dynamic Programming is a very general programming technique. It is applicable when a large search space can be structured into a succession of stages, such that: the initial stage contains trivial solutions to sub-problems each partial solution in a later stage can be calculated by recurring a fixed number of partial solutions in an earlier stage the final stage contains the overall solution
  • 22. Creation of an alignment path matrix Idea: Build up an optimal alignment using previous solutions for optimal alignments of smaller subsequences Construct matrix F indexed by i and j (one index for each sequence) F ( i,j ) is the score of the best alignment between the initial segment x 1... i of x up to x i and the initial segment y 1... j of y up to y j Build F ( i,j ) recursively beginning with F (0,0) = 0
  • 23.  
  • 24. If F ( i -1, j -1), F ( i -1, j ) and F ( i , j -1) are known we can calculate F ( i , j ) Three possibilities: x i and y j are aligned, F ( i , j ) = F ( i -1, j -1) + s( x i , y j ) x i is aligned to a gap, F ( i , j ) = F ( i -1, j ) - d y j is aligned to a gap, F ( i , j ) = F ( i , j -1) - d The best score up to ( i , j ) will be the largest of the three options Creation of an alignment path matrix
  • 25. H E A G A W G H E E 0 -8 -16 -24 -32 -40 -48 -56 -64 -72 -80 P -8 -2 -9 -17 -25 -33 -42 -49 -57 -65 -73 A -16 -10 -3 -4 -12 -20 -28 -36 -44 -52 -60 W -24 -18 -11 -6 -7 -15 -5 -13 -21 -29 -37 H -32 -14 -18 -13 -8 -9 -13 -7 -3 -11 -19 E -40 -22 -8 -16 -16 -9 -12 -15 -7 3 -5 A -48 -30 -16 -3 -11 -11 -12 -12 -15 -5 2 E -56 -38 -24 -11 -6 -12 -14 -15 -12 -9 1 Backtracking -5 1 - A E E H H G - W W A A G - A P E - H - 0 -25 -5 -20 -13 -3 3 -8 -16 -17 Optimal global alignment: E E
  • 26. Global vs. Local Alignments Global alignment algorithms start at the beginning of two sequences and add gaps to each until the end of one is reached. Local alignment algorithms finds the region (or regions) of highest similarity between two sequences and build the alignment outward from there.
  • 27.  
  • 28. needle (Needleman & Wunsch) creates an end-to-end alignment. Global Alignment Two closely related sequences:
  • 29. Two sequences sharing several regions of local similarity: 1 AGGATTGGAATGCTCAGAAGCAGCTAAAGCGTGTATGCAGGATTGGAATTAAAGAGGAGGTAGACCG.... 67 1 AGGATTGGAATGCTAGGCTTGATTGCCTACCTGTAGCCACATCAGAAGCACTAAAGCGTCAGCGAGACCG 70 |||||||||||||| | | | |||| || | | | || Global Alignment
  • 30. Global Alignment (Needleman-Wunsch) The the Needleman-Wunsch algorithm creates a global alignment over the length of both sequences (needle) Global algorithms are often not effective for highly diverged sequences - do not reflect the biological reality that two sequences may only share limited regions of conserved sequence. Sometimes two sequences may be derived from ancient recombination events where only a single functional domain is shared. Global methods are useful when you want to force two sequences to align over their entire length
  • 31. Local Alignment (Smith-Waterman) Local alignment Identify the most similar sub-region shared between two sequences Smith-Waterman EMBOSS: water
  • 32. Parameters of Sequence Alignment Scoring Systems: Each symbol pairing is assigned a numerical value, based on a symbol comparison table. Gap Penalties: Opening: The cost to introduce a gap Extension: The cost to elongate a gap
  • 33. DNA Scoring Systems -very simple Sequence 1 Sequence 2 A G C T A 1 0 0 0 G 0 1 0 0 C 0 0 1 0 T 0 0 0 1 Match: 1 Mismatch: 0 Score = 5 actaccagttcatttgatacttctcaaa taccattaccgtgttaactgaaaggacttaaagact
  • 34. Protein Scoring Systems PTHPLASKTQILPEDLASEDLTI PTHPLAGERAIGLARLAEEDFGM Sequence 1 Sequence 2 Scoring matrix T:G = -2 T:T = 5 Score = 48 C S T P A G N D . . C 9 S -1 4 T -1 1 5 P -3 -1 -1 7 A 0 1 0 -1 4 G -3 0 -2 -2 0 6 N -3 1 0 -2 -2 0 5 D -3 0 -1 -1 -2 -1 1 6 . . C S T P A G N D . . C 9 S -1 4 T -1 1 5 P -3 -1 -1 7 A 0 1 0 -1 4 G -3 0 -2 -2 0 6 N -3 1 0 -2 -2 0 5 D -3 0 -1 -1 -2 -1 1 6 . .
  • 35. Amino acids have different biochemical and physical properties that influence their relative replaceability in evolution. C P G G A V I L M F Y W H K R E Q D N S T C SH S+S positive charged polar aliphatic aromatic small tiny hydrophobic Protein Scoring Systems
  • 36. Scoring matrices reflect: # of mutations to convert one to another chemical similarity – observed mutation frequencies – the probability of occurrence of each amino acid Widely used scoring matrices: PAM BLOSUM Protein Scoring Systems
  • 37. PAM matrices Family of matrices PAM 80, PAM 120, PAM 250 The number with a PAM matrix represents the evolutionary distance between the sequences on which the matrix is based Greater numbers denote greater distances
  • 38. PAM ( P ercent A ccepted M utations) matrices The numbers of replacements were used to compute a so-called PAM-1 matrix. The PAM-1 matrix reflects an average change of 1% of all amino acid positions. PAM matrices for larger evolutionary distances can be extrapolated from the PAM-1 matrix. PAM250 = 250 mutations per 100 residues. Greater numbers mean bigger evolutionary distance
  • 39. PAM ( P ercent A ccepted M utations) matrices Derived from global alignments of protein families . Family members share at least 85% identity ( Dayhoff et al ., 1978 ). Construction of phylogenetic tree and ancestral sequences of each protein family Computation of number of replacements for each pair of amino acids
  • 40. A R N D C Q E G H I L K M F P S T W Y V B Z A 2 -2 0 0 -2 0 0 1 -1 -1 -2 -1 -1 -3 1 1 1 -6 -3 0 2 1 R -2 6 0 -1 -4 1 -1 -3 2 -2 -3 3 0 -4 0 0 -1 2 -4 -2 1 2 N 0 0 2 2 -4 1 1 0 2 -2 -3 1 -2 -3 0 1 0 -4 -2 -2 4 3 D 0 -1 2 4 -5 2 3 1 1 -2 -4 0 -3 -6 -1 0 0 -7 -4 -2 5 4 C -2 -4 -4 -5 12 -5 -5 -3 -3 -2 -6 -5 -5 -4 -3 0 -2 -8 0 -2 -3 -4 Q 0 1 1 2 -5 4 2 -1 3 -2 -2 1 -1 -5 0 -1 -1 -5 -4 -2 3 5 E 0 -1 1 3 -5 2 4 0 1 -2 -3 0 -2 -5 -1 0 0 -7 -4 -2 4 5 G 1 -3 0 1 -3 -1 0 5 -2 -3 -4 -2 -3 -5 0 1 0 -7 -5 -1 2 1 H -1 2 2 1 -3 3 1 -2 6 -2 -2 0 -2 -2 0 -1 -1 -3 0 -2 3 3 I -1 -2 -2 -2 -2 -2 -2 -3 -2 5 2 -2 2 1 -2 -1 0 -5 -1 4 -1 -1 L -2 -3 -3 -4 -6 -2 -3 -4 -2 2 6 -3 4 2 -3 -3 -2 -2 -1 2 -2 -1 K -1 3 1 0 -5 1 0 -2 0 -2 -3 5 0 -5 -1 0 0 -3 -4 -2 2 2 M -1 0 -2 -3 -5 -1 -2 -3 -2 2 4 0 6 0 -2 -2 -1 -4 -2 2 -1 0 F -3 -4 -3 -6 -4 -5 -5 -5 -2 1 2 -5 0 9 -5 -3 -3 0 7 -1 -3 -4 P 1 0 0 -1 -3 0 -1 0 0 -2 -3 -1 -2 -5 6 1 0 -6 -5 -1 1 1 S 1 0 1 0 0 -1 0 1 -1 -1 -3 0 -2 -3 1 2 1 -2 -3 -1 2 1 T 1 -1 0 0 -2 -1 0 0 -1 0 -2 0 -1 -3 0 1 3 -5 -3 0 2 1 W -6 2 -4 -7 -8 -5 -7 -7 -3 -5 -2 -3 -4 0 -6 -2 -5 17 0 -6 -4 -4 Y -3 -4 -2 -4 0 -4 -4 -5 0 -1 -1 -4 -2 7 -5 -3 -3 0 10 -2 -2 -3 V 0 -2 -2 -2 -2 -2 -2 -1 -2 4 2 -2 2 -1 -1 -1 0 -6 -2 4 0 0 B 2 1 4 5 -3 3 4 2 3 -1 -2 2 -1 -3 1 2 2 -4 -2 0 6 5 Z 1 2 3 4 -4 5 5 1 3 -1 -1 2 0 -4 1 1 1 -4 -3 0 5 6 PAM 250 C -8 17 W W
  • 41. PAM - limitations Based on only one original dataset Examines proteins with few differences (85% identity) Based mainly on small globular proteins so the matrix is biased
  • 42. BLOSUM matrices Different BLOSUM n matrices are calculated independently from BLOCKS (ungapped local alignments) BLOSUM n is based on a cluster of BLOCKS of sequences that share at least n percent identity BLOSUM 62 represents closer sequences than BLOSUM 45
  • 43. Derived from alignments of domains of distantly related proteins ( Henikoff & Henikoff,1992 ). Occurrences of each amino acid pair in each column of each block alignment is counted. The numbers derived from all blocks were used to compute the BLOSUM matrices. A A C E C A - C = 4 A - E = 2 C - E = 2 A - A = 1 C - C = 1 BLOSUM ( Blo cks Su bstitution M atrix) A A C E C
  • 45. BLOSUM ( Blo cks Su bstitution M atrix) Sequences within blocks are clustered according to their level of identity. Clusters are counted as a single sequence. Different BLOSUM matrices differ in the percentage of sequence identity used in clustering. The number in the matrix name (e.g. 62 in BLOSUM62) refers to the percentage of sequence identity used to build the matrix. Greater numbers mean smaller evolutionary distance.
  • 46. PAM Vs. BLOSUM PAM100 = BLOSUM90 PAM120 = BLOSUM80 PAM160 = BLOSUM60 PAM200 = BLOSUM52 PAM250 = BLOSUM45 More distant sequences BLOSUM62 for general use BLOSUM80 for close relations BLOSUM45 for distant relations PAM120 for general use PAM60 for close relations PAM250 for distant relations
  • 47. TIPS on choosing a scoring matrix Generally, BLOSUM matrices perform better than PAM matrices for local similarity searches ( Henikoff & Henikoff, 1993 ). When comparing closely related proteins one should use lower PAM or higher BLOSUM matrices, for distantly related proteins higher PAM or lower BLOSUM matrices. For database searching the commonly used matrix is BLOSUM62.
  • 48. T A T G T G G A A T G A Scoring Insertions and Deletions A T G T - - A A T G C A A T G T A A T G C A T A T G T G G A A T G A The creation of a gap is penalized with a negative score value. insertion / deletion
  • 49. 1 GTGATAGACACAGACCGGTGGCATTGTGG 29 ||| | | ||| | || || | 1 GTGTCGGGAAGAGATAACTCCGATGGTTG 29 Why Gap Penalties? Gaps allowed but not penalized Score: 88 Gaps not permitted Score: 0 1 GTG.ATAG.ACACAGA..CCGGT..GGCATTGTGG 29 ||| || | | | ||| || | | || || | 1 GTGTAT.GGA.AGAGATACC..TCCG..ATGGTTG 29 Match = 5 Mismatch = -4
  • 50. The optimal alignment of two similar sequences is usually that which maximizes the number of matches and minimizes the number of gaps. There is a tradeoff between these two - adding gaps reduces mismatches Permitting the insertion of arbitrarily many gaps can lead to high scoring alignments of non-homologous sequences. Penalizing gaps forces alignments to have relatively few gaps. Why Gap Penalties?
  • 51. Gap Penalties How to balance gaps with mismatches? Gaps must get a steep penalty, or else you’ll end up with nonsense alignments. In real sequences, multi-base (or amino acid) gaps are quite common genetic insertion/deletion events “ Affine” gap penalties give a big penalty for each new gap, but a much smaller “gap extension” penalty.
  • 52. Scoring Insertions and Deletions A T G T T A T A C T A T G T G C G T A T A Total Score: 4 Gap parameters: d = 3 (gap opening) e = 0.1 (gap extension) g = 3 (gap lenght)  (g) = -3 - (3 -1) 0.1 = -3.2 T A T G T G C G T A T A A T G T - - - T A T A C insertion / deletion match = 1 mismatch = 0 Total Score: 8 - 3.2 = 4.8
  • 53. Modification of Gap Penalties 1 V...LSPADKFLTNV 12 | |||| | | | 1 VFTELSPA.K..T.V 11 1 ...VLSPADKFLTNV 12 |||| 1 VFTELSPAKTV.... 11 gap opening penalty = 0 gap extension penalty = 0.1 score = 11.3 Score Matrix: BLOSUM62 gap opening penalty = 3 gap extension penalty = 0.1 score = 6.3
  • 54. BLAST Algorithm Basic Local Alignment Search Tool Fast alignment technique(s) Similar to FASTA algorithms (not used much now) There are more accurate ones, but they’re slower BLAST makes a big use of lookup tables Idea: statistically significant alignments (hits) Will have regions of at least 3 letters same Or at least high scoring with respect to BLOSUM matrix Based on small local alignments CCNDHRKMTCSPNDNNRK TTNDHRMTACSPDNNNKH CCNDHRKMTCSPNDNNRK YTNHHMMTTYSLDNNNKK more likely than
  • 55. BLAST Overview Given a query sequence Q Seven main stages Remove (filter) low complexity regions from Q Harvest k-tuples (triples) from Q Expand each triple into ~50 high scoring words Seed a set of possible alignments Generate high scoring pairs (HSPs) from the seeds Test significance of matches from HSPs Report the alignments found from the HSPs
  • 56. BLAST Algorithm Part 1 Removing Low-complexity Segments Imagine matching HHHHHHHHKMAY and HHHHHHHHURHD The KMAY and URHD are the interesting parts But this pair score highly using BLOSUM It’s a good idea to remove the HHHHHHHs From the query sequence (low complexity) SEG program does this kind of thing Comes with most BLAST implementations Often doesn’t do much, and it can be turned off
  • 57. Removing Low-complexity Segments Given a segment of length L With each amino acid occurring n 1 n 2 … n 20 times Use the following measure for “compositional complexity”: To use this measure Slide a “window” of ~12 residues along Query Sequence Q Use a threshold to determine low complexity windows Use a minimise routine to replace the segment With an optimal minimised segment (or just an X)
  • 58. BLAST Algorithm Part 2 Harvesting k-tuples Collect all the k-tuples of elements in Q k set to 3 for residues and 11 for DNA (can vary) Triples are called ‘words’. Call this set W S T S L S T S D K L M R STS TSL SLS LST
  • 59. BLAST Algorithm Part 3 Finding High Scoring Triples Given a word w from W Find all other words w’ of same length (3), which: Appear in some database sequence Blosum(w,w’) > a threshold T Choose T to limit number to around 50 Call these the high scoring triples (words) for w Example: letting w=PQG, set T to be 13 Suppose that PQG, PEG, PSG, PQA are found in database Blosum(PQG,PQG) = 18, Blosum(PQG,PEG) = 15 Blosum(PQG,PSG) = 13, Blosum(PQG,PQA) = 12 Hence, PQG and PEG only are kept
  • 60. Finding High Scoring Triples For each w in W, find all the high scoring words Organise these sets of words Remembering all the places where w was found in Q Each high scoring triple is going to be a seed In order to generate possible alignment(s) One seed can generate more than one alignment End of the first half of the algorithm Going to find alignments now
  • 61. BLAST Algorithm Part 4 Seeding Possible Alignments Look at first triple V in query sequence Q Actually from Q (not from W - which has omissions) Retrieve the set of ~50 high scoring words Call this set H V Retrieve the list of places in Q where V occurs Call this set P V For every pair ( word, pos ) Where word is from H V and pos is from P V Find all the database sequences D Which have an exact match with word at position pos ’ Store an alignment between Q and D With V matched at pos in Q and pos ’ in D Repeat this for the second triple in Q, and so on
  • 62. Seeding Possible Alignments Example Suppose Q = QQGPHUIQEGQQG Suppose V = QQG, H V = {QQG, QEG} Then P V = {1, 11} Suppose we are looking in the database at: D = PKLMMQQGKQEG Then the alignments seeded are: QQGPHUIQEGQQG word=QQG QQGPHUIQEGQQG word=QQG PKLMMQQGKQEG pos=1 PKLMMQQGKQEG pos=11 QQGPHUIQEGQQG word=QEG QQGPHUIQEGQQG word=QEG PKLMMQQGKQEG pos=1 PKLMMQQGKQEG pos=11
  • 63. BLAST Algorithm Part 5 Generating High Scoring Pairs (HSPs) For each alignment A Where sequences Q and D are matched Original region matching was M Extend M to the left Until the Blosum score begins to decrease Extend M to the right Until the Blosum score begins to decrease Larger stretch of sequence now matches May have higher score than the original triple Call these high scoring pairs Throw away any alignments for which the score S of the extended region M is lower than some cutoff score
  • 64. Extending Alignment Regions Example QQGPHUIQEGQQGKEEDPP Blosum(QQG,QQG) = 16 PKLMMQQGKQEGM QQGPHUIQEGQQGKEEDPP Blosum(QQGK,QQGK) = 21 PKLMMQQGKQEGM QQGPHUIQEGQQGKEEDPP Blosum(QQGKE,QQGKQ) = 23 PKLMMQQGKQEGM QQGPHUIQEGQQGKEEDPP Blosum(QQGKEE,QQGKQE) = 28 PKLMMQQGKQEGM QQGPHUIQEGQQGKEEDPP Blosum(QQGKEED,QQGKQEG) = 27 PKLMMQQGKQEGM So, the extension to the right stops here HSP (before left extension) is QQGKEE, scoring 28
  • 65. BLAST Algorithm Part 6 Checking Statistical Significance Reason we extended alignment regions Give a more accurate picture of the probability of that BLOSUM score occurring by chance Question: is a HSP significant? Suppose we have a HSP such that It scores S for a region of length L in sequences Q & D Then the probability of two random sequences Q’ and D’ scoring S in a region of length L is calculated Where Q’ is same length as Q and D’ is same length as D This probability needs to be low for significance
  • 66. BLAST Algorithm Part 7 Reporting the Alignments For each statistically significant HSP The alignment is reported If a sequence D has two HSPs with Query Q Two different alignments are reported Later versions of BLAST Try and unify the two alignments

Editor's Notes

  • #3: Protein sequencce comparison and protein evolution ISMB 2000