SEQUENCE ALIGNMENT P.S.CHANDRANAND
Objectives General Terms What is Alignment ? Basic concept of Alignment Rationale Behind Alignment Types of Alignment Comparative Analysis Biological Significance of Gaps
Some Definitions   Similarity   The extent to which nucleotide or protein sequences are related  The extent of similarity between two sequences might be expressed based on percent sequence identity and/or conservation.  Identity   The extent to which two (nucleotide or amino acid) sequences are invariant Conservation   Changes at a specific position of an amino acid or less commonly, a DNA sequence, that preserves the physico-chemical properties of the original residue Optimal Alignment   An alignment of two sequences with the highest possible score  Query   The input sequence which is compared to all entries in database
Homologous   refers to conclusion drawn from the data that the two genes or sequences have descended from a common ancestor   Homologous sequences are of two types   Orthologous   Homologous sequences in different species that arose from a common ancestral gene during speciation Parologous   Homologous sequences within a single species that arose by gene duplication
What is Alignment ? Explicit mapping between two or more sequences   To place one sequence over another in such a fashion so as to get maximum similarity SEQUENCE ALIGNMENT  STRUCTURAL  ALIGNMENT
WHY ALIGNMENT IS NECESSARY ? We need to be able to compare sequences for similarities and differences Often what we are looking for are not exact matches, but  similarities Similarity is based on biology
Conserved regions Some regions tend to be more conserved than others Conserved regions (amino acid residues) may suggest which residues are critical for structure or function BUT may just be accident of history
Similarity vs. homology SIMILARITY   – observable quantity that can be expressed as %identity or some suitable measure HOMOLOGY  –  a conclusion drawn from similarity data regarding shared evolutionary history (is it homologous or not?) E.g. human myoglobin and tuna myoglobin – some similarities can be found
Proteins of 100% identity  (Human & Xenopus Myoglobin) MGLSDGEWQLVLNVWGKVEADIPGHGQEV LIRLFKGHPETLEKFDKFKHLKSEDEMKA SEDLKKHGATVLTALGGILKKKGHHEAEI KPLAQSHATKHKIPVKYLEFISECIIQVL QSKHPGDFGADAQGAMNKALELFRKDMAS NYKELGFQG MGLSDGEWQLVLNVWGKVEADIPGHGQEV LIRLFKGHPETLEKFDKFKHLKSEDEMKA SEDLKKHGATVLTALGGILKKKGHHEAEI KPLAQSHATKHKIPVKYLEFISECIIQVL QSKHPGDFGADAQGAMNKALELFRKDMAS NYKELGFQG
MGLSDGEWQLVLNVWGKVEADIPGHGQEV LIRLFKGHPETLEKFDKFKHLKSEDEMKA SEDLKKHGATVLTALGGILKKKGHHEAEI KPLAQSHATKHKIPVKYLEFISECIIQVL QSKHPGDFGADAQGAMNKALELFRKDMAS NYKELGFQG GLSDGEWQ Q VLNVWGKVEADI A GHGQEV LIRLF T GHPETLEKFDKFKHLKTE A EMKA SEDLKKHG TV VLTALGGILKKKGHHEAE L KPLAQSHATKHKIP I KYLEFIS DA II H VL H SKHPGDFGADAQGAM T KALELFR N D I A A K YKELGFQG Proteins with similarity  (H orse P02188  & Xenopus)
Evolutionary Basis Presumption is homologous sequences have diverged from a common ancestor But we do not have the ancestral sequence, only raw sequence from living organisms
Basic Concept of Alignment Firstly, both the sequences are matched in a arbitrary way. Quality of the match is then reflected in terms of score. Then one of the two sequences is moved w.r.t other and match is scored. This process is repeated until we find best scoring alignment. But, if this process is carried out for 2 sequences of length N each (N=10,000), then there will be around N 2  alignments, which is computationally impossible to calculate Thus we look for  Optimal alignment  which is done through  Dynamic programming
What is the rationale behind alignment ? The resemblance of two DNA sequences taken from different organisms means that sequences have arisen from one common ancestral DNA by the process of mutations and selection, modifying the DNA sequence in a specific manner.  The basic mutational processes can be of 3 types: Insertion   an insertion of a base (letter) or several bases to the sequence Deletion   deleting a base (or more) from the sequence Substitution   replacing a sequence base by another .
An alignment just  reflects the  probable  evolutionary history  of the two genes as it is  presumed  that the homologous sequences have diverged from a common ancestral sequence through iterative molecular changes
ALIGNMENT Pairwise alignment    Multiple alignment
Why pairwise alignment? Pairwise alignment is used in database searches. BLAST & FASTA are essentially highly optimized versions of local pairwise alignment. Pairwise alignment is used to compute evolutionary distances, which are used to build phylogenetic trees. Pairwise alignment is used for sequence assembly in shotgun sequencing. Pairwise alignment underlies multiple alignment, which is used to find consensus patterns. Both amino acid sequences and nucleotide sequences are handled in much the same way.
Why multiple sequence alignment   ? Incorporation: Organize data to reflect sequence homology Phylogeny :Infer phylogeny trees from homologous sites Motif : Highlight conserved sites/regions Structure Prediction : Highlight variable sites/regions Extrapolation:  Uncover changes in gene structure Profile: Summarize information The process of aligning sequences is a game involving playing off gaps and mismatches
PAIRWISE ALIGNMENT   Global alignment   Local alignment  Global alignment -  means placing both the complete sequences over one another to find maximum similarity i.e Global alignment algorithms start at the beginning of two sequences and add gaps to each until the end of one is reached Local alignment -  looks for a maximum similarity within the subsequences. i.e Local alignment algorithms finds the region (or regions) of highest similarity between two sequences and build the alignment outward from there
Global Alignment Aligns entire sequence identifies all conserved residues dynamic programming required Computationally intensive, much slower than local alignment eg Needleman & Wunsch method, GAP Local Alignment Identify short conserved sequences complete alignment is not done may miss out on some important conserved residues eg BLAST, FASTP Comparative Analysis of Alignment Techniques
Global vs. Local Alignment
A model for database searching score probabilities Scores resulting from searching with a query sequence against a database follow the Extreme Value Distribution (EDV) (Gumbel, 1955). Using the EDV, the raw alignment scores are converted to a statistical score (E value) that keeps track of the database amino acid composition and the scoring scheme (a.a. exchange matrix)
Extreme Value Distribution Probability density function for the extreme value distribution resulting from parameter values    = 0 and    = 1, [ y  = 1 – exp(- e -x )], where     is the characteristic value and     is the decay constant.  y  = 1 – exp(- e -  ( x -  ) )
Extreme Value Distribution (EDV) You  know that an optimal alignment of two sequences is selected out of many suboptimal alignments, and that a database search is also about selecting the best alignment(s). This bodes well with the EDV which has a right tail that falls off more slowly than the left tail. Compared to using the normal distribution, when using the EDV an alignment has to score further away from the expected mean value to become a significant hit.  real data EDV approximation
Extreme Value Distribution The probability of a score  S  to be larger than a given value  x  can be calculated following the EDV as:  E-value: P ( S     x ) = 1 – exp(- e  -  ( x -  ) ) ,  where      =(ln  Kmn )/  , and  K  a constant that can be estimated from the background amino acid distribution and scoring matrix (see Altschul and Gish, 1996, for a collection of values for    and  K  over a set of widely used scoring matrices).
Extreme Value Distribution Using the equation for     (preceding slide), the probability for the raw alignment score  S  becomes  P ( S     x ) = 1 – exp(- Kmne -  x ). In practice, the probability  P ( S  x ) is estimated using the approximation 1 – exp(- e -x )    e -x , which is valid for large values of  x . This leads to a simplification of the equation for  P ( S  x ): P ( S    x )    e -  (x-  )  = Kmn e -  x . The lower the probability (E value) for a given threshold value x, the more significant the score  S .
Normalised sequence similarity Statistical significance Database searching is commonly performed using an E-value in between 0.1 and 0.001. Low E-values decrease the number of  false positives  in a database search, but increase the number of  false negatives , thereby lowering the sensitivity of the search.
FASTP : Local Alignment Tool Sequence 1  F  L  W  R  T  W  S Sequence 2  S  W  K  T  W  T Method based on lookup tables Lipman & Pearson, Science (1985) vol 227,1435-41 The first widely used program: Lipman & Pearson, 1985 and onwards
Construction of the Lookup Table   Position Number Residue  Seq 1  Seq2  Offset(p1-p2) F  1   -   - L  2   -   - W  3,6  2,5  1(3,2)  1(6,5)  4(6,2)  -2(3,5) R  4   -   - T  5  4,6 1(5,4)  - 1(5,6) S  7   1    6(7,1) K  -   3  - Pos no.  1  2  3  4  5  6  7 Sequence 1  F  L  W  R  T  W  S Sequence 2  S  W  K  T  W  T
Calculation of Offset Frequency Offset  Frequency   1  3   4  1 -1  1 -2  1    6  1 Final Local Alignment Pos no.   1  2  3  4  5  6  7 Sequence 1   F  L  W  R  T  W  S Sequence 2   -  S  W  K  T  W  T
Extreme Value Distribution Using the equation for     (preceding slide), the probability for the raw alignment score  S  becomes  P ( S     x ) = 1 – exp(- Kmne -  x ). In practice, the probability  P ( S  x ) is estimated using the approximation 1 – exp(- e -x )    e -x , which is valid for large values of  x . This leads to a simplification of the equation for  P ( S  x ): P ( S    x )    e -  (x-  )  = Kmn e -  x . The lower the probability (E value) for a given threshold value x, the more significant the score  S .
-Needleman-Wunsch (1970) provided first automatic method -Dynamic Programming to Find Global Alignment Global alignment For sequences that are single-domain For sequences that have not diverged NEEDLEMAN-WUNSCH Algorithm
Gaps What is the biological significance of gaps ? As explained earlier, changes that occur during evolution are categorized into 3 classes: Insertion Deletion Substitutions So, regions where the residues of one sequence correspond to nothing in another, they are interpreted due to either insertion in one sequence or deletion from other. A Gap is a space introduced into an alignment to compensate for insertions and deletions in one sequence relative to another Gaps in alignment are represented as dashes(-).
Gaps How long gaps  must be allowed for optimal alignment  and  how should they be scored  ?  Some gaps can be introduced in alignment to compensate for insertion and deletions but not too many To prevent the accumulation of too many gaps in an alignment, introduction of a gap causes deduction of a fixed amount (the gap score) from the alignment score. So gaps will occur in alignment only when really needed Addition of gaps to optimize an alignment always decreases the quality of an alignment therefore  gap penalty is always negative For example AGGVLIQVG  AGGVLIIQVG AGGVL-IQVG   AGGVLIIQVG
Gaps Two types of gap penalties Linear gap penalty Both  gap opening  (G)  &  gap extension (L)  penalty is same. Affine gap penalty gap opening penalty is higher than gap extension penalty Thus for a gap of length   n   total deduction = G + (n-1) L BLOSUM 62 matrix : -11 gap opening / -1 gap extension BLOSUM 50 matrix : -12 gap opening / -1 gap extension
Summary An alignment just  reflects the  probable  evolutionary history  of the two genes as it is  presumed  that the homologous sequences have diverged from a common ancestral sequence through iterative molecular changes changes that occur during evolution are categorized into 3 classes: Insertion Deletion Substitutions Two types of gap penalties Global alignment   Local alignment Two types of Alignment Linear gap penalty Affine gap penalty

More Related Content

PPTX
Multiple sequence alignment
PPTX
Sequence Alignment
PDF
PPTX
System biology and its tools
PPTX
Sequence similarity tools.pptx
PPTX
SEQUENCE ANALYSIS
PDF
Sequence analysis - Bioinformatics
PDF
MEGA (Molecular Evolutionary Genetics Analysis)
Multiple sequence alignment
Sequence Alignment
System biology and its tools
Sequence similarity tools.pptx
SEQUENCE ANALYSIS
Sequence analysis - Bioinformatics
MEGA (Molecular Evolutionary Genetics Analysis)

What's hot (20)

PPTX
Scoring schemes in bioinformatics
PPT
Alignments
PDF
The ensembl database
DOCX
Protein sequence databases
PPT
Phylogenetic Tree, types and Applicantion
PPTX
Introduction to sequence alignment partii
PPTX
Multiple Sequence Alignment
PDF
Sequence Alignment
PDF
Tech Talk: UCSC Genome Browser
PPTX
Sequence alignment global vs. local
PPTX
Sequence Submission Tools
PPT
Pairwise sequence alignment
DOCX
Major biological nucleotide databases
PPTX
Orthologs,Paralogs & Xenologs
PDF
dot plot analysis
PPTX
sequence of file formats in bioinformatics
PPTX
Sequence Analysis
PPT
Microarray Data Analysis
PPT
Scoring schemes in bioinformatics
Alignments
The ensembl database
Protein sequence databases
Phylogenetic Tree, types and Applicantion
Introduction to sequence alignment partii
Multiple Sequence Alignment
Sequence Alignment
Tech Talk: UCSC Genome Browser
Sequence alignment global vs. local
Sequence Submission Tools
Pairwise sequence alignment
Major biological nucleotide databases
Orthologs,Paralogs & Xenologs
dot plot analysis
sequence of file formats in bioinformatics
Sequence Analysis
Microarray Data Analysis
Ad

Viewers also liked (20)

PPT
Sequence Alignment In Bioinformatics
PPTX
Introduction to sequence alignment
PPTX
Parwati sihag
PDF
sequence alignment
PPT
B.sc biochem i bobi u 3.1 sequence alignment
PPTX
Sequence alig Sequence Alignment Pairwise alignment:-
PPT
Multiple sequence alignment
PPTX
Global and local alignment (bioinformatics)
PPTX
Sequence alignment
PPT
DOC
Multiple sequence alignment
PPT
RNA secondary structure prediction
PPT
Dotplots for Bioinformatics
PPTX
PPT
Blast fasta 4
PPTX
blast bioinformatics
PPTX
Introduction to bioinformatics
PPTX
Application of bioinformatics
PPT
Application of Bioinformatics in different fields of sciences
PDF
Basics of bioinformatics
Sequence Alignment In Bioinformatics
Introduction to sequence alignment
Parwati sihag
sequence alignment
B.sc biochem i bobi u 3.1 sequence alignment
Sequence alig Sequence Alignment Pairwise alignment:-
Multiple sequence alignment
Global and local alignment (bioinformatics)
Sequence alignment
Multiple sequence alignment
RNA secondary structure prediction
Dotplots for Bioinformatics
Blast fasta 4
blast bioinformatics
Introduction to bioinformatics
Application of bioinformatics
Application of Bioinformatics in different fields of sciences
Basics of bioinformatics
Ad

Similar to Sequence alignment belgaum (20)

PPT
How the blast work
PPT
Seq alignment
PPTX
Sequence homology search and multiple sequence alignment(1)
PPTX
MULTIPLE SEQUENCE ALIGNMENT
PPTX
Microarray and its application
PPT
Laboratory 1 sequence_alignments
PPT
B.sc biochem i bobi u 3.1 sequence alignment
PPT
Bioinformatica 08-12-2011-t8-go-hmm
PPTX
Bioinformatics
PPTX
4. sequence alignment.pptx
PDF
The derivation of ungapped global protein alignment score distributions - Part1
PPTX
Sequence alignment.pptx
PPTX
Computation and System Biology Assignment Help
PPT
Protein Evolution and Sequence Analysis.ppt
PPT
Bioinformatica 20-10-2011-t3-scoring matrices
PPT
5.4 mining sequence patterns in biological data
PPTX
Sequence Alignment
PPT
20100515 bioinformatics kapushesky_lecture07
PDF
Blast fasta
PDF
AI 바이오 (4일차).pdf
How the blast work
Seq alignment
Sequence homology search and multiple sequence alignment(1)
MULTIPLE SEQUENCE ALIGNMENT
Microarray and its application
Laboratory 1 sequence_alignments
B.sc biochem i bobi u 3.1 sequence alignment
Bioinformatica 08-12-2011-t8-go-hmm
Bioinformatics
4. sequence alignment.pptx
The derivation of ungapped global protein alignment score distributions - Part1
Sequence alignment.pptx
Computation and System Biology Assignment Help
Protein Evolution and Sequence Analysis.ppt
Bioinformatica 20-10-2011-t3-scoring matrices
5.4 mining sequence patterns in biological data
Sequence Alignment
20100515 bioinformatics kapushesky_lecture07
Blast fasta
AI 바이오 (4일차).pdf

More from National Institute of Biologics (20)

PDF
Waters protein therapeutics application proctocols
PDF
Potential aggregation prone regions in biotherapeutics
DOCX
How the biologics landscape is evolving
PDF
Evaluation of antibody drugs quality safety
PDF
Approved m abs_feb_2015
PDF
Translating next generation sequencing to practice
PDF
From biomarkers to diagnostics –the road to success
PDF
Defining your-target-product-profile in-vitro-diagnostic-products
PDF
Accelerating development and approval of targeted cancer therapies
PDF
Canonical structures for the hypervariable regions of immunoglobulins
PDF
Canonical correlation
PDF
Development trends for human monoclonal antibody therapeutics
PDF
Therapeutic fc fusion proteins and peptides as successful alternatives to ant...
PDF
Fc fusion proteins and fc rn - structural insights for longer-lasting and mor...
PDF
Therapeutic antibodies for autoimmunity and inflammation
PDF
Introduction to current and future protein therapeutics - a protein engineeri...
PDF
Pharmaceutical monoclonal antibodies production - guidelines to cell engine...
PDF
Intended use of reference products & who international standards or reference...
PDF
How dissimilarly similar are biosimilars
PDF
Evaluation of similar biotherapeutic products (SBP's) scientific principles ...
Waters protein therapeutics application proctocols
Potential aggregation prone regions in biotherapeutics
How the biologics landscape is evolving
Evaluation of antibody drugs quality safety
Approved m abs_feb_2015
Translating next generation sequencing to practice
From biomarkers to diagnostics –the road to success
Defining your-target-product-profile in-vitro-diagnostic-products
Accelerating development and approval of targeted cancer therapies
Canonical structures for the hypervariable regions of immunoglobulins
Canonical correlation
Development trends for human monoclonal antibody therapeutics
Therapeutic fc fusion proteins and peptides as successful alternatives to ant...
Fc fusion proteins and fc rn - structural insights for longer-lasting and mor...
Therapeutic antibodies for autoimmunity and inflammation
Introduction to current and future protein therapeutics - a protein engineeri...
Pharmaceutical monoclonal antibodies production - guidelines to cell engine...
Intended use of reference products & who international standards or reference...
How dissimilarly similar are biosimilars
Evaluation of similar biotherapeutic products (SBP's) scientific principles ...

Sequence alignment belgaum

  • 2. Objectives General Terms What is Alignment ? Basic concept of Alignment Rationale Behind Alignment Types of Alignment Comparative Analysis Biological Significance of Gaps
  • 3. Some Definitions Similarity The extent to which nucleotide or protein sequences are related The extent of similarity between two sequences might be expressed based on percent sequence identity and/or conservation. Identity The extent to which two (nucleotide or amino acid) sequences are invariant Conservation Changes at a specific position of an amino acid or less commonly, a DNA sequence, that preserves the physico-chemical properties of the original residue Optimal Alignment An alignment of two sequences with the highest possible score Query The input sequence which is compared to all entries in database
  • 4. Homologous refers to conclusion drawn from the data that the two genes or sequences have descended from a common ancestor Homologous sequences are of two types Orthologous Homologous sequences in different species that arose from a common ancestral gene during speciation Parologous Homologous sequences within a single species that arose by gene duplication
  • 5. What is Alignment ? Explicit mapping between two or more sequences To place one sequence over another in such a fashion so as to get maximum similarity SEQUENCE ALIGNMENT STRUCTURAL ALIGNMENT
  • 6. WHY ALIGNMENT IS NECESSARY ? We need to be able to compare sequences for similarities and differences Often what we are looking for are not exact matches, but similarities Similarity is based on biology
  • 7. Conserved regions Some regions tend to be more conserved than others Conserved regions (amino acid residues) may suggest which residues are critical for structure or function BUT may just be accident of history
  • 8. Similarity vs. homology SIMILARITY – observable quantity that can be expressed as %identity or some suitable measure HOMOLOGY – a conclusion drawn from similarity data regarding shared evolutionary history (is it homologous or not?) E.g. human myoglobin and tuna myoglobin – some similarities can be found
  • 9. Proteins of 100% identity (Human & Xenopus Myoglobin) MGLSDGEWQLVLNVWGKVEADIPGHGQEV LIRLFKGHPETLEKFDKFKHLKSEDEMKA SEDLKKHGATVLTALGGILKKKGHHEAEI KPLAQSHATKHKIPVKYLEFISECIIQVL QSKHPGDFGADAQGAMNKALELFRKDMAS NYKELGFQG MGLSDGEWQLVLNVWGKVEADIPGHGQEV LIRLFKGHPETLEKFDKFKHLKSEDEMKA SEDLKKHGATVLTALGGILKKKGHHEAEI KPLAQSHATKHKIPVKYLEFISECIIQVL QSKHPGDFGADAQGAMNKALELFRKDMAS NYKELGFQG
  • 10. MGLSDGEWQLVLNVWGKVEADIPGHGQEV LIRLFKGHPETLEKFDKFKHLKSEDEMKA SEDLKKHGATVLTALGGILKKKGHHEAEI KPLAQSHATKHKIPVKYLEFISECIIQVL QSKHPGDFGADAQGAMNKALELFRKDMAS NYKELGFQG GLSDGEWQ Q VLNVWGKVEADI A GHGQEV LIRLF T GHPETLEKFDKFKHLKTE A EMKA SEDLKKHG TV VLTALGGILKKKGHHEAE L KPLAQSHATKHKIP I KYLEFIS DA II H VL H SKHPGDFGADAQGAM T KALELFR N D I A A K YKELGFQG Proteins with similarity (H orse P02188 & Xenopus)
  • 11. Evolutionary Basis Presumption is homologous sequences have diverged from a common ancestor But we do not have the ancestral sequence, only raw sequence from living organisms
  • 12. Basic Concept of Alignment Firstly, both the sequences are matched in a arbitrary way. Quality of the match is then reflected in terms of score. Then one of the two sequences is moved w.r.t other and match is scored. This process is repeated until we find best scoring alignment. But, if this process is carried out for 2 sequences of length N each (N=10,000), then there will be around N 2 alignments, which is computationally impossible to calculate Thus we look for Optimal alignment which is done through Dynamic programming
  • 13. What is the rationale behind alignment ? The resemblance of two DNA sequences taken from different organisms means that sequences have arisen from one common ancestral DNA by the process of mutations and selection, modifying the DNA sequence in a specific manner. The basic mutational processes can be of 3 types: Insertion an insertion of a base (letter) or several bases to the sequence Deletion deleting a base (or more) from the sequence Substitution replacing a sequence base by another .
  • 14. An alignment just reflects the probable evolutionary history of the two genes as it is presumed that the homologous sequences have diverged from a common ancestral sequence through iterative molecular changes
  • 15. ALIGNMENT Pairwise alignment Multiple alignment
  • 16. Why pairwise alignment? Pairwise alignment is used in database searches. BLAST & FASTA are essentially highly optimized versions of local pairwise alignment. Pairwise alignment is used to compute evolutionary distances, which are used to build phylogenetic trees. Pairwise alignment is used for sequence assembly in shotgun sequencing. Pairwise alignment underlies multiple alignment, which is used to find consensus patterns. Both amino acid sequences and nucleotide sequences are handled in much the same way.
  • 17. Why multiple sequence alignment ? Incorporation: Organize data to reflect sequence homology Phylogeny :Infer phylogeny trees from homologous sites Motif : Highlight conserved sites/regions Structure Prediction : Highlight variable sites/regions Extrapolation: Uncover changes in gene structure Profile: Summarize information The process of aligning sequences is a game involving playing off gaps and mismatches
  • 18. PAIRWISE ALIGNMENT Global alignment Local alignment Global alignment - means placing both the complete sequences over one another to find maximum similarity i.e Global alignment algorithms start at the beginning of two sequences and add gaps to each until the end of one is reached Local alignment - looks for a maximum similarity within the subsequences. i.e Local alignment algorithms finds the region (or regions) of highest similarity between two sequences and build the alignment outward from there
  • 19. Global Alignment Aligns entire sequence identifies all conserved residues dynamic programming required Computationally intensive, much slower than local alignment eg Needleman & Wunsch method, GAP Local Alignment Identify short conserved sequences complete alignment is not done may miss out on some important conserved residues eg BLAST, FASTP Comparative Analysis of Alignment Techniques
  • 20. Global vs. Local Alignment
  • 21. A model for database searching score probabilities Scores resulting from searching with a query sequence against a database follow the Extreme Value Distribution (EDV) (Gumbel, 1955). Using the EDV, the raw alignment scores are converted to a statistical score (E value) that keeps track of the database amino acid composition and the scoring scheme (a.a. exchange matrix)
  • 22. Extreme Value Distribution Probability density function for the extreme value distribution resulting from parameter values  = 0 and  = 1, [ y = 1 – exp(- e -x )], where  is the characteristic value and  is the decay constant. y = 1 – exp(- e -  ( x -  ) )
  • 23. Extreme Value Distribution (EDV) You know that an optimal alignment of two sequences is selected out of many suboptimal alignments, and that a database search is also about selecting the best alignment(s). This bodes well with the EDV which has a right tail that falls off more slowly than the left tail. Compared to using the normal distribution, when using the EDV an alignment has to score further away from the expected mean value to become a significant hit. real data EDV approximation
  • 24. Extreme Value Distribution The probability of a score S to be larger than a given value x can be calculated following the EDV as: E-value: P ( S  x ) = 1 – exp(- e -  ( x -  ) ) , where  =(ln Kmn )/  , and K a constant that can be estimated from the background amino acid distribution and scoring matrix (see Altschul and Gish, 1996, for a collection of values for  and K over a set of widely used scoring matrices).
  • 25. Extreme Value Distribution Using the equation for  (preceding slide), the probability for the raw alignment score S becomes P ( S  x ) = 1 – exp(- Kmne -  x ). In practice, the probability P ( S  x ) is estimated using the approximation 1 – exp(- e -x )  e -x , which is valid for large values of x . This leads to a simplification of the equation for P ( S  x ): P ( S  x )  e -  (x-  ) = Kmn e -  x . The lower the probability (E value) for a given threshold value x, the more significant the score S .
  • 26. Normalised sequence similarity Statistical significance Database searching is commonly performed using an E-value in between 0.1 and 0.001. Low E-values decrease the number of false positives in a database search, but increase the number of false negatives , thereby lowering the sensitivity of the search.
  • 27. FASTP : Local Alignment Tool Sequence 1 F L W R T W S Sequence 2 S W K T W T Method based on lookup tables Lipman & Pearson, Science (1985) vol 227,1435-41 The first widely used program: Lipman & Pearson, 1985 and onwards
  • 28. Construction of the Lookup Table Position Number Residue Seq 1 Seq2 Offset(p1-p2) F 1 - - L 2 - - W 3,6 2,5 1(3,2) 1(6,5) 4(6,2) -2(3,5) R 4 - - T 5 4,6 1(5,4) - 1(5,6) S 7 1 6(7,1) K - 3 - Pos no. 1 2 3 4 5 6 7 Sequence 1 F L W R T W S Sequence 2 S W K T W T
  • 29. Calculation of Offset Frequency Offset Frequency 1 3 4 1 -1 1 -2 1 6 1 Final Local Alignment Pos no. 1 2 3 4 5 6 7 Sequence 1 F L W R T W S Sequence 2 - S W K T W T
  • 30. Extreme Value Distribution Using the equation for  (preceding slide), the probability for the raw alignment score S becomes P ( S  x ) = 1 – exp(- Kmne -  x ). In practice, the probability P ( S  x ) is estimated using the approximation 1 – exp(- e -x )  e -x , which is valid for large values of x . This leads to a simplification of the equation for P ( S  x ): P ( S  x )  e -  (x-  ) = Kmn e -  x . The lower the probability (E value) for a given threshold value x, the more significant the score S .
  • 31. -Needleman-Wunsch (1970) provided first automatic method -Dynamic Programming to Find Global Alignment Global alignment For sequences that are single-domain For sequences that have not diverged NEEDLEMAN-WUNSCH Algorithm
  • 32. Gaps What is the biological significance of gaps ? As explained earlier, changes that occur during evolution are categorized into 3 classes: Insertion Deletion Substitutions So, regions where the residues of one sequence correspond to nothing in another, they are interpreted due to either insertion in one sequence or deletion from other. A Gap is a space introduced into an alignment to compensate for insertions and deletions in one sequence relative to another Gaps in alignment are represented as dashes(-).
  • 33. Gaps How long gaps must be allowed for optimal alignment and how should they be scored ? Some gaps can be introduced in alignment to compensate for insertion and deletions but not too many To prevent the accumulation of too many gaps in an alignment, introduction of a gap causes deduction of a fixed amount (the gap score) from the alignment score. So gaps will occur in alignment only when really needed Addition of gaps to optimize an alignment always decreases the quality of an alignment therefore gap penalty is always negative For example AGGVLIQVG AGGVLIIQVG AGGVL-IQVG AGGVLIIQVG
  • 34. Gaps Two types of gap penalties Linear gap penalty Both gap opening (G) & gap extension (L) penalty is same. Affine gap penalty gap opening penalty is higher than gap extension penalty Thus for a gap of length n total deduction = G + (n-1) L BLOSUM 62 matrix : -11 gap opening / -1 gap extension BLOSUM 50 matrix : -12 gap opening / -1 gap extension
  • 35. Summary An alignment just reflects the probable evolutionary history of the two genes as it is presumed that the homologous sequences have diverged from a common ancestral sequence through iterative molecular changes changes that occur during evolution are categorized into 3 classes: Insertion Deletion Substitutions Two types of gap penalties Global alignment Local alignment Two types of Alignment Linear gap penalty Affine gap penalty