SlideShare a Scribd company logo
Basic bioinformatics concepts,
                  databases and tools

                                           Module 3
                                   Sequence analysis
                                            Joachim Jacob

                                http://www.bits.vib.be


Updated Feb 2012
http://dl.dropbox.com/u/18352887/BITS_training_material/Link%20to%20mod3-intro_H1_2012_SeqAn.pdf
In this third module, we will discuss the
possible analyses of sequences
      
          Module 1
             Sequence databases and keyword searching
      
          Module 2
             Sequence similarity
      
          Module 3
             Sequence analysis: types, interpretation, results
In this third module, we will discuss the
possible analyses of sequences
Sequence analysis tries to read
sequences to infer biological properties

       AGCTACTACGGACTACTAGCAGCTACCTCTCTG
       - is this coding sequence?
       - can this sequence bind a certain TF?
       - what is the melting temperature?
       - what is the GC content?
       - does it fold into a stable secondary structure?
       …
Tools that can predict a biological
feature are trained with examples
     Automatic annotation
                              vs.
                 experimentally verified annotations


       - Training dataset of sequences (← exp. verified)
       - An algorithm defines parameters used for
         prediction
       - The algorithm determines/classifies whether the
          sequence(s) contains the feature (→ automatic
          annotation)
The assumption to being able to read
biological function is the central paradigm

         DNA → protein sequence → structure →
          activity (binding, enzymatic activity,
          regulatory,...)
         
             So the premise to do analysis: biological
             function can be read from the (DNA)
             sequence.
         
             Predictions always serve as a basis for further
             experiments.
Analysis can be as simple as measuring
properties or predicting features
          
              Protein
                 −   Metrics (e.g. how many alanines in my seq)
                 −   Modifications and other predictions
                 −   Domains and motifs
          
              DNA/RNA
                 −   Metrics (e.g. how many GC)
                 −   Predicting
                      
                          Gene prediction
                      
                          Promotor
                      
                          Structure
Simple protein sequence analysis

                      
                          One might be interested in:
                          
                              pI (isoelectric point) prediction
                          
                              Composition metrics
                          
                              Hydrophobicity calculation
                          
                              Reverse translation (protein → dna)
                          
                              Occurrence of simple patterns (e.g. does
                              KDL occurs and how many times)
                          
                              ...
                                             http://en.wikipedia.org/wiki/Hydrophobicity_scales
http://www.sigmaaldrich.com/life-science/metabolomics/learning-center/amino-acid-reference-cha
Protein sequence analysis tools are
gathered on Expasy
      http://www.expasy.org/tools (SIB)
      
          Others:
          
              http://www.ebi.ac.uk/Tools/protein.html
          
              http://bioweb.pasteur.fr/protein/intro-en.html
          
              SMS2
Never trust a tool's output blindly
                  
                      Interpreting depends on the kind of output
                      When a prediction result is obtained, the question
                       arises 'Is it true?' (in biological sense)

   
       Programs giving a 'binary'         
                                              Programs giving score/P-
       result: 1 or 0, a hit or a miss.       value result: the chance that
                                              the 'result' is 'not real' → the
       Approach: You should                   lower, the better
        comparing different
        prediction programs for           Approach: asses the p-value
        higher confidence.
                                          E.g. ScanProsite for a motif
       E.g. SignalP for signal
         peptide prediction.
The basis for the prediction of features
is nearly always a sequence alignment
     Based on experimentally verified sequence
       annotations, a multiple sequence alignment
       is constructed




       Different methods exist to capture the information
       gained from this multiple sequence alignment
Alignment reveals similar residues
   which can indicate identical structure

                                                      Same structure,
                                                      hence most likely
                                                      same function




Most protein pairs with more than 25-
30 out of 100 identical residues were                 Chances are that
found to be structurally similar.                     the structure is not
Also proteins with <10% identity can                  the same
have similar structure.
http://peds.oxfordjournals.org/content/12/2/85.long
The structure of a protein sequence
determines his biological function
Number of                 Primary = AA chain
 Reported
structures   Feb 2012: ~ 535 000 in Swissprot
              Secondary = structural entities
             (helix, beta-strands, beta-sheets,
                                         loops)


                                 Tertiary = 3D
                   Nov 2011: ~ 80 000 in PDB


                   Quaternary = interactions

                                          http://en.wikipedia.org/wiki/Protein_structure
Degree of similarity with other
   sequences varies over the length
                                 More conserved

Homologous
Histone H1
protein sequences
Protein sequences can consist of
structurally different parts
       Domain
        part of the tertiary structure of a protein that can exist,
        function and evolve independently of the rest, linked to a
        certain biological function
       Motif
        part (not necessarily contiguous) of the primary structure of a
        protein that corresponds to the signature of a biological
        function. Can be associated with a domain.
       Feature
        part of the sequence for which some annotation has been
        added. Some features correspond to domain or motif
        assignments.
Based on motifs and domains, proteins
are assigned to families
       Nearly synonymous with gene family
       Evolutionary related proteins
       Significant structural similarity of domains is reflected in
          sequence similarity, and is due to a common ancestral
          sequence part, resulting in domain families.
Domains and motifs are represented by
simple and complex methods
                                                    domain




                                                             Gapped alignment
Motif/domain in silico can be represented by
      1. Regular expression / pattern
      2. Frequency matrix / profile
      3. Machine learning techniques : Hidden Markov Model


                  http://bioinfo.uncc.edu/zhx/binf8312/lecture-7-SequenceAnalyses.pdf
Regular expressions / patterns are the
  simplest way to represent motifs
              A representation of all residues with equal probability.


                123456         Position:    1.    2.      3.     4.      5.     6.
                ATPKAE
                KKPKAA                     [AKT] [AKLT]   P    [AK]     [APT] [ADEKT-]
                AKPKAK
                TKPKPA
                AKPKT-
                AKPAAK         ? Does this sequence match:     AKPKTE
                KLPKAD                      V     V       V      V       V      V
                AKPKAA
Consensus:     AKPKAA          ? And this sequence:            KKPETE
                                         V      V         V      X       V      V

                               ? And what about this one:      TLPATE
For every position the most             V      V         V       V       V      V
Frequently occurring residue
Frequency matrices or profiles include
   the chance of observing the residues
               For every position of a motif, a list of all amino acids is
                 made with their frequency. Position-specific
                 weight/scoring matrix or profile. More sensitive way.
                             Profile
                123456       Position:       1.   2.          3.        4.       5.        6.
                ATPKAE
                KKPKAA                A 0.625     0           0         1/8      6/8       3/8
                AKPKAK                D 0         0           0         0        0         1/8
                TKPKPA                E 0         0           0         0        0         1/8
                AKPKT-                K 0.25      6/8         0         7/8      0         2/8
                AKPAAK                L 0         1/8         0         0        0         0
                KLPKAD                P 0         0           1         0        1/8       0
                AKPKAA                T 1/8       1/8         0         0        1/8       0
 Consensus:     AKPKA-                -  0        0           0         0        0         1/8
                                      Sum 1       1           1         1        1         1
                              ? Query: AKPKTE
                              ? Query: KKPETE
                              ? Query: TLPATE
Example: http://expasy.org/prosite/PS51092             http://prosite.expasy.org/prosuser.html#meth2
How good a sequence matches a
 profile is reported with a score


                      PSWM: scores
             123456   Position:       1.         2.       3.       4.       5.       6.
             ATPKAE
             KKPKAA               A   2.377      -2.358   -2.358   0.257    2.631    1.676
             AKPKAK               D   -2.358     -2.358   -2.358   -2.358   -2.358   0.257
             TKPKPA               E   -2.358     -2.358   -2.358   -2.358   -2.358   0.257
             AKPKT-               K   1.134      2.631    -2.358   2.847    -2.358   1.134
             AKPAAK               L   -2.358     0.257    -2.358   -2.358   -2.358   -2.358
                                  P   -2.358     -2.358   0.257    -2.358   0.257    -2.358
             KLPKAD               T   0.257      0.257    -2.358   -2.358   0.257    -2.358
             AKPKAA
Consensus:   AKPKA-
                      ? Query: AKPKTE            Score = 11.4
                      ? Query: KKPETE            Score = 5.0
                      ? Query: TLPATE            Score = 4.3

                                               http://prosite.expasy.org/prosuser.html#meth2
A hidden Markov Model takes also into
account the gaps in an alignment
     The schematic representation of a HMM




           http://www.myoops.org/twocw/mit/NR/rdonlyres/Electrical-Engineering-
Building a HMM from a multiple
sequence alignment
Use HMMER to very sensitively search
protein database with a HMM




            You can search with a profile in a
                  sequence database
Some profile adjustments to the BLAST
  protocol exist for particular purposes
                PSI-BLAST to identify distantly related proteins
                    PSI-BLAST (position specific iterated)
                    After a search result, a profile is made of the similar
                     sequences, and this is used again to search a database
                PHI-BLAST protein with matching of a pattern
                    PHI-BLAST (pattern hit initiated): you provide a pattern,
                     which all BLAST results should satisfy.
                CSI-BLAST is more sensitive than PSI-BLAST in
                 identifying distantly related proteins

PSI BLAST http://www.ncbi.nlm.nih.gov/blast/Blast.cgi?CMD=Web&PAGE=Proteins&PROGRAM=b
PHI BLAST http://www.ncbi.nlm.nih.gov/blast/Blast.cgi?CMD=Web&PAGE=Proteins&PROGRAM=b
CSI BLAST http://toolkit.tuebingen.mpg.de/cs_blast
Many databases exist that keep patterns,
profiles or models related to function
   
       Motif / domain databases (see NCBI bookshelf for good overview)
       
           http://www.ebi.ac.uk/interpro/ - integrated db
       
           http://expasy.org/prosite/ (motifs)
       
           PFAM – hidden markov profiles (domains)

       
           CDD (Conserved domains database) (NCBI - integrated)
       
           Prodom (domain) (automatic extraction)
       
           SMART (domain)
       
           PRINTS (motif) sets of local alignments without gaps, used as frequency
           matrices, made by searching manually made "seed alignments" against
           UniProt sequences
Prosite is a database gathering
patterns from sequence alignments
         
             ScanProsite tool : search the
             prosite database for a pattern ( present or
             not )
         Example : [DE](2)-H-S-{P}-x(2)-P-x(2,4)-C>


         You can retrieve sequences which correspond to a pattern,
           you made up yourself, observed in an alignment or an
           known one. The syntax is specific, but not difficult: see
           link below!




        http://prosite.expasy.org/scanprosite/scanprosite-doc.html#pattern_syntax
Interpro classifies the protein data into
families based on the domain and motifs
            Interpro takes all existing motif and domains
              databases as input ('signatures'), and aligns
              them to create protein domain families. This
              reduces redundancy. Each domain is than
              given an identifier IPRxxxxxxx.

            Uneven size of motifs and families between
              families are handled by 'relations' :
                parent - child and contains - found in
                   Families,...                    Regions, domains, ...




                            http://www.ebi.ac.uk/interpro/user_manual.html#type
Interpro summarizes domains and motifs
from a dozen of domain databases




                            http://www.ebi.ac.uk/interpro/databases.html
             ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/README.html#2
InterPro entries are grouped in types
        
            Family
             Entries span complete sequence
        
            Domain
             Biologically functional units
        
            Repeat
        
            Region
        
            Conserved site
        
            Active site
        
            Binding site
        
            PTM site
InterPro entries are grouped in types
You can search your sequence for
known domains on InterProScan




                 Interproscan http://www.ebi.ac.uk/Tools/pfa/iprscan/
A sequence logo provides a visual
summary of a motif
      Creating a sequence logo
        Create a nicely looking logo of a motif sequence:
          size of letters indicated frequency.
        
            Weblogo - a basic web application to create
            colorful logo's
        
            IceLogo - a powerful web application to create
            customized logo's
A sequence logo provides a visual
   summary of a motif
               iceLogo

                123456
                ATPKAE
                KKPKAA
                AKPKAK
                TKPKPA
                AKPKT-
                AKPAAK
                KLPKAD
                AKPKAA
 Consensus:     AKPKA-




http://www.bits.vib.be/wiki/index.php/Exercises_on_multiple_sequence_alignment#Sequence_logo
There is always a chance that a prediction
of a feature by a tool is false


Number                                         Number
  of                                             of
matches                                        matches


          True negatives                                 True negatives



                                   True positives                              True positives

                                     Score                                             Score
                       Threshold                                       Threshold
                                                     False negatives                 False positives


              Ideal situation                              Reality of the databases
Assessing the performance of categorizing
  tools with sensitivity and specificity
                                                 PREDICTION
  “Confusion matrix”                Feature is          Feature is
                                    predicted          NOT predicted



                                       True                 False
Sequence contains feature             positive            Negatives
                                                           “Type I
                                                            error”
  TRUTH


                                       False                    True
Sequence does NOT contain feature     positive                negative
                                      “Type II
                                       error”
Assessing the performance of categorizing
  tools with sensitivity and specificity
                                                  PREDICTION
  “Confusion matrix”                 Feature is          Feature is
                                     predicted          NOT predicted



                                          Sensitivity
Sequence contains feature           True positives/(TP + FN)


  TRUTH


                                        False                    True
Sequence does NOT contain feature      positive                negative
Assessing the performance of categorizing
  tools with sensitivity and specificity
                                             PREDICTION
  “Confusion matrix”                Feature is      Feature is
                                    predicted      NOT predicted




Sequence contains feature

  TRUTH

                                                    Selectivity
Sequence does NOT contain feature
                                                   or Specificity
                                                   TN/(FP + TN)
Assessing the performance of categorizing
  tools with sensitivity and specificity
                                                    PREDICTION
  “Confusion matrix”                  Feature is                Feature is
                                      predicted                NOT predicted




Sequence contains feature
                                                    error rate*
  TRUTH
                                                   FP+FN/total

Sequence does NOT contain feature




                                    * misclassification rate
Assessing the performance of categorizing
  tools with sensitivity and specificity
                                             PREDICTION
  “Confusion matrix”                Feature is      Feature is
                                    predicted      NOT predicted




Sequence contains feature
                                            Accuracy
  TRUTH
                                           TP+TN/total

Sequence does NOT contain feature
Protein sequences can be searched for
  potential modifications
                
                    http://www.expasy.org/tools/
                    
                        e.g. modification (phosphorylation, acetylation,...)


                To deal with the confidence in the results, try different
                  tools, and make a graph (venn diagram) to compare the
                  results
                    
                        E.g. predict secreted proteins by signalP and
                        RPSP, combine results in Venn
                         −   http://bioinformatics.psb.ugent.be/webtools/Venn/
                         −   http://www.cmbi.ru.nl/cdd/biovenn/

Overview SignalPeptide prediction tools: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2788353/
Protein sequences can be searched for
secondary structural elements
     Based on know structures, machine learning
       models of secondary structure elements
       are made and can be searched for.
      See http://bioinf.cs.ucl.ac.uk/psipred/
In case of multiple analyses on multiple
sequences, mark instead of filter

                    Starting set of sequences

Worse                                                                             Better
Analysis filter 1
                                                           Analysis filter 1
                                                           Analysis filter 2
                                                           Analysis filter 3

Analysis filter 2




Analysis filter 3

                              After performing all analyses on all sequences, different filters
                              on the results can be applied (e.g. secreted sequence,
                    !         phosphorylated and containing a motif)
NA sequences
NA sequence analyses
        GC% http://mobyle.pasteur.fr/cgi-bin/portal.py?#forms::geecee
        Melting temperature
        For primer development, such as with Primer3

        Structure
        Codon usage
              Codon usage table with cusp
              Codon adaptation index calculation with cai
        ...

        A lot of tools can be found at the Mobyle Portal:
Profiles and models are being used to
model biological function in NA seqs
         To detect Transcription factor binding sites
         TRANSFAC : commercial (BIOBASE, Wolfenbüttel, Germany), started as
           work of Edgard Wingender, contains eukaryotic binding sites as
           consensus sequences and as PSSMs. Also TRANSCompel with modules of
           binding sites.
         ooTFD : commercial (IFTI, Pittsburgh PA, USA), started as work of David
            Gosh, contains prokaryotic and eukaroytic binding sites as consensus
            sequences and as PSSMs.
         JASPAR : open access, only representative sets of higher eukaryote binding
            sites as PSSMs. Can be searched against sequence or sequence pair at
            Consite.
         OregAnno : open access, collection of individual eukaryotic binding sites
            with their localization in the genome
         PAZAR : collection of open access TF databanks
Sequence logos can give an insight in
the important residues of binding sites

    DNA: an entry from JASPAR: tata box




     A   [ 61 16 352   3 354 268 360 222 155 56 83 82 82 68 77 ]
     C   [145 46   0 10    0   0   3   2 44 135 147 127 118 107 101 ]
     G   [152 18   2   2   5   0 10 44 157 150 128 128 128 139 140 ]
     T   [ 31 309 35 374 30 121    6 121 33 48 31 52 61 75 71 ]
The RNA world has the Vienna servers


          http://rna.tbi.univie.ac.at/
             −   secondary structure prediction of
                 ribosomal sequences
             −   siRNA design
RNA families can be modeled by
conserved bases and structure
     
         RNA motifs (http://rfam.sanger.ac.uk/search)
     Rfam is a databank of RNA motifs and families. It is
       made at the Sanger Centre (Hinxton, UK), from a
       subset of EMBL (well-annotated standard
       sequences excluding synthetic sequences + the
       WGS) using the INFERNAL suite of Soan Eddy. It
       contains local alignments with gaps with included
       secondary structure annotation + CMs.
Some interesting links

    Nucleic acid structure
      Unafold - Program accessible through webinterface
      After designing primers, you might want to check
      whether the primer product does (not) adapt a stable
      secondary structure.
    Some collections of links
          −   Good overview at http://www.imb-jena.de/RNA.html
          −   European Ribosomale RNA database (VIB PSB)
Prediction of genes in genomes rely on
the integration of multiple signals
        Signals surrounding the gene (transcription factor binding
            sites, promoters, transcription terminators, splice
            sites, polyA sites, ribosome binding sites,...)
        → profile matching
        Differences in composition between coding and
            noncoding DNA (codon preference), the presence of
            an Open Reading Frame (ORF)
        → compositional analyses
        Similarity with known genes, aligning ESTs and (in
            translation) similarity with known proteins and the
            presence of protein motifs
        → similarity searches
Prediction of genes in genomes rely on
the integration of multiple signals
                    Signals
                    Composition
                    Similarity

e.g. potential methylation sites (profiles)
                                              Alignment of ESTs




GC
Software for prediction genes
          EMBOSS
                 −   simple software under EMBOSS : syco (codon frequency), wobble (%GC 3rd base),
                     tcode (Ficket
                     statistic : correlation between bases at distance 3)
          Examples of software using HMM model of gene :
          Wise2 : using also similarity with known proteins
           http://www.ebi.ac.uk/Tools/Wise2
          GENSCAN : commercial (Chris Burge, Stanford U.) but free for academics, has
            models for human/A. thaliana/maize, used at EBI and NCBI for genome
            annotation http://mobyle.pasteur.fr/cgi-bin/portal.py?#forms::genscan
          GeneMark : commercial (GeneProbe, Atlanta GA, USA) but free for academic
            users, developed by Mark Borodovsky, has models for many prokaryotic and
            eukaryotic organisms http://exon.gatech.edu




Tutorial on gene prediction http://www.embl.de/~seqanal/courses/spring00/GenePred.00.html
Short addendum about downloading files
    FTP, e.g. ftp://ftp.ebi.ac.uk/pub/databases/interpro/

           –   'file transfer protocol'
           –   Most browsers have integrated ftp 'client'
           –   Free, easy to download files, possibility to
               resume after fails
    HTTP, e.g. http://www.ncbi.nlm.nih.gov/entrez
           Standard protocol for internet traffic,
           Slowest method
     Aspera – for large datasets (>10GB) downloads
           In use in the short read archive (SRA)
           Fastest method available currently
Conclusion

          Prediction vs. experimental verified
              Different algorithms need to be compared
              Predictions need to be validated by independent
               method


              Software <-> Databases
              Questions? Get social!
                → www.seqanswers.com
                → http://biostar.stackexchange.com
              Always only basis for further wet-lab research
Summary   In this third module, we will discuss the possible analyses of sequences
          Sequence analysis tries to read sequences to infer biological properties
          Tools that can predict a biological feature are trained with examples
          The assumption to being able to read biological function is the central paradigm
          Analysis can be as simple as measuring properties or predicting features
          Protein sequence analysis tools are gathered on Expasy
          Never trust a tool's output blindly
          The basis for the prediction of features is nearly always a sequence alignment
          Alignment reveals similar residues which can indicate identical structure
          The structure of a protein sequence determines his biological function
          Degree of similarity with other sequences varies over the length
          Protein sequences can consist of structurally different parts
          Based on motifs and domains, proteins are assigned to families
          Domains and motifs are represented by simple and complex methods
          Regular expressions / patterns are the simplest way to represent motifs
          Frequency matrices or profiles include the chance of observing the residues
          How good a sequence matches a profile is reported with a score
          A hidden Markov Model takes also into account the gaps in an alignment
          Use HMMER to very sensitively search protein database with a HMM
          Some profile adjustments to the BLAST protocol exist for particular purposes
          Many databases exist that keep patterns, profiles or models related to function
          Prosite is a database gathering patterns from sequence alignments
          Interpro classifies the protein data into families based on the domain and motifs
          Interpro summarizes domains and motifs from a dozen of domain databases
          You can search your sequence for known domains on InterProScan
          A sequence logo can provide a visual summary of a motif
          Protein sequences can be searched for potential modifications
          Protein sequences can be searched for secondary structural elements
          In case of multiple analyses on multiple sequences, mark instead of filter
          Profiles and models are being used to model biological function in NA seqs
          Sequence logos can give an insight in the important residues of binding sites
          The RNA world has the Vienna servers
          RNA families can be modeled by conserved bases and structure
          Prediction of genes in genomes rely on the integration of multiple signals

More Related Content

PPT
Clustal
Benittabenny
 
PPTX
Functional genomics
ajay301
 
PDF
Introduction to next generation sequencing
VHIR Vall d’Hebron Institut de Recerca
 
PPTX
Clustal W - Multiple Sequence alignment
The Oxford College Engineering
 
PPTX
Kegg databse
Rashi Srivastava
 
PPTX
Protein fold recognition and ab_initio modeling
Bioinformatics and Computational Biosciences Branch
 
Clustal
Benittabenny
 
Functional genomics
ajay301
 
Introduction to next generation sequencing
VHIR Vall d’Hebron Institut de Recerca
 
Clustal W - Multiple Sequence alignment
The Oxford College Engineering
 
Kegg databse
Rashi Srivastava
 
Protein fold recognition and ab_initio modeling
Bioinformatics and Computational Biosciences Branch
 

What's hot (20)

PPT
Genome annotation 2013
Karan Veer Singh
 
PPTX
Scoring schemes in bioinformatics (blosum)
SumatiHajela
 
PPTX
sequence of file formats in bioinformatics
nadeem akhter
 
PPT
Phylogenetic analysis
National Institute of Biologics
 
PPTX
Sequence database
Dr.M.Prasad Naidu
 
PPTX
Kegg
msfbi1521
 
PPTX
Chou fasman algorithm for protein structure prediction
Roshan Karunarathna
 
PPT
Protein protein interaction
Aashish Patel
 
PPTX
Dynamic programming
Zohaib HUSSAIN
 
PPTX
Sequence alig Sequence Alignment Pairwise alignment:-
naveed ul mushtaq
 
PDF
NGS: Mapping and de novo assembly
Bioinformatics and Computational Biosciences Branch
 
PPTX
Phylogenetic data analysis
Md. Dilshad karim
 
PPTX
Bioinformatics introduction
Hafiz Muhammad Zeeshan Raza
 
PPTX
Comparative genomics
Jajati Keshari Nayak
 
PPTX
Entrez databases
Hafiz Muhammad Zeeshan Raza
 
PPT
RNA secondary structure prediction
Muhammed sadiq
 
PPTX
How to submit a sequence in NCBI
Minhaz Ahmed
 
PPT
Biological databases
Prasanthperceptron
 
PPTX
Sequence comparison techniques
ruchibioinfo
 
PPT
Protein structure prediction (1)
Sabahat Ali
 
Genome annotation 2013
Karan Veer Singh
 
Scoring schemes in bioinformatics (blosum)
SumatiHajela
 
sequence of file formats in bioinformatics
nadeem akhter
 
Phylogenetic analysis
National Institute of Biologics
 
Sequence database
Dr.M.Prasad Naidu
 
Kegg
msfbi1521
 
Chou fasman algorithm for protein structure prediction
Roshan Karunarathna
 
Protein protein interaction
Aashish Patel
 
Dynamic programming
Zohaib HUSSAIN
 
Sequence alig Sequence Alignment Pairwise alignment:-
naveed ul mushtaq
 
NGS: Mapping and de novo assembly
Bioinformatics and Computational Biosciences Branch
 
Phylogenetic data analysis
Md. Dilshad karim
 
Bioinformatics introduction
Hafiz Muhammad Zeeshan Raza
 
Comparative genomics
Jajati Keshari Nayak
 
Entrez databases
Hafiz Muhammad Zeeshan Raza
 
RNA secondary structure prediction
Muhammed sadiq
 
How to submit a sequence in NCBI
Minhaz Ahmed
 
Biological databases
Prasanthperceptron
 
Sequence comparison techniques
ruchibioinfo
 
Protein structure prediction (1)
Sabahat Ali
 
Ad

Viewers also liked (20)

PPT
Sequence Alignment In Bioinformatics
Nikesh Narayanan
 
DOC
Protein databases
bansalaman80
 
PDF
Basics of bioinformatics
Abhishek Vatsa
 
PPT
Bioinformatics
JTADrexel
 
PPT
BITs: Genome browsers and interpretation of gene lists.
BITS
 
PDF
BITS: Basics of sequence databases
BITS
 
PDF
BITS: Basics of Sequence similarity
BITS
 
PDF
BITS: Overview of important biological databases beyond sequences
BITS
 
PPTX
Interview with NCBI Staff Scientist Carol Scott
wookyluvr
 
PPT
L01 ecture 01-
MUBOSScz
 
PDF
Global Bioinformatics Market
David2591
 
PPT
BITS: Introduction to Linux - Software installation the graphical and the co...
BITS
 
PDF
Genevestigator
BITS
 
PDF
Lokala banksystem utan vinstkrav - för tillväxt och hållbar utveckling
Jonas Lagander
 
PDF
Projekt sociala ekonomin i motala - slutrapport 2015
Jonas Lagander
 
PDF
Besök kimstad rapport förstudie
Jonas Lagander
 
PPTX
BITS training - UCSC Genome Browser - Part 2
BITS
 
PPTX
Smart Print & Hybrid Database
akipower
 
PPTX
Bioinformatics t2-databases v2014
Prof. Wim Van Criekinge
 
Sequence Alignment In Bioinformatics
Nikesh Narayanan
 
Protein databases
bansalaman80
 
Basics of bioinformatics
Abhishek Vatsa
 
Bioinformatics
JTADrexel
 
BITs: Genome browsers and interpretation of gene lists.
BITS
 
BITS: Basics of sequence databases
BITS
 
BITS: Basics of Sequence similarity
BITS
 
BITS: Overview of important biological databases beyond sequences
BITS
 
Interview with NCBI Staff Scientist Carol Scott
wookyluvr
 
L01 ecture 01-
MUBOSScz
 
Global Bioinformatics Market
David2591
 
BITS: Introduction to Linux - Software installation the graphical and the co...
BITS
 
Genevestigator
BITS
 
Lokala banksystem utan vinstkrav - för tillväxt och hållbar utveckling
Jonas Lagander
 
Projekt sociala ekonomin i motala - slutrapport 2015
Jonas Lagander
 
Besök kimstad rapport förstudie
Jonas Lagander
 
BITS training - UCSC Genome Browser - Part 2
BITS
 
Smart Print & Hybrid Database
akipower
 
Bioinformatics t2-databases v2014
Prof. Wim Van Criekinge
 
Ad

Similar to BITS: Basics of sequence analysis (20)

PPT
Bioinformatica 08-12-2011-t8-go-hmm
Prof. Wim Van Criekinge
 
PPTX
Bioinformatica t8-go-hmm
Prof. Wim Van Criekinge
 
PPTX
Mapping protein to function
Abhik Seal
 
PDF
Signals of Evolution: Conservation, Specificity Determining Positions and Coe...
Asociación Argentina de Bioinformática y Biología Computacional
 
PPTX
Informal presentation on bioinformatics
Atai Rabby
 
PDF
Discovery Of Functional Protein Linear Motifs Using a Greaddy Algorithm and I...
Asociación Argentina de Bioinformática y Biología Computacional
 
PPT
Project report-on-bio-informatics
Daniela Rotariu
 
PPTX
Bioinformatic, and tools by kk sahu
KAUSHAL SAHU
 
PPTX
Bioinformatics life sciences_v2015
Prof. Wim Van Criekinge
 
PPTX
Bioinformatica t3-scoring matrices
Prof. Wim Van Criekinge
 
PDF
Protein Structure Alignment and Comparison
Natalio Krasnogor
 
PPT
Protein Evolution and Sequence Analysis.ppt
Francis de Castro
 
PPTX
Bioinformatica t7-protein structure
Prof. Wim Van Criekinge
 
PDF
A Frequency Domain Approach to Protein Sequence Similarity Analysis and Funct...
sipij
 
PDF
Protein function and bioinformatics
Neil Saunders
 
PDF
57 bio infomark
phdcao
 
PDF
HMM’S INTERPOLATION OF PROTIENS FOR PROFILE ANALYSIS
ijcseit
 
PPTX
Bioinformatics
Arockiyajainmary
 
PPTX
STRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICS
SHEETHUMOLKS
 
PPT
Subtypes of Associated Protein-DNA (Transcription Factor-Transcription Factor...
Cyrus Chan
 
Bioinformatica 08-12-2011-t8-go-hmm
Prof. Wim Van Criekinge
 
Bioinformatica t8-go-hmm
Prof. Wim Van Criekinge
 
Mapping protein to function
Abhik Seal
 
Signals of Evolution: Conservation, Specificity Determining Positions and Coe...
Asociación Argentina de Bioinformática y Biología Computacional
 
Informal presentation on bioinformatics
Atai Rabby
 
Discovery Of Functional Protein Linear Motifs Using a Greaddy Algorithm and I...
Asociación Argentina de Bioinformática y Biología Computacional
 
Project report-on-bio-informatics
Daniela Rotariu
 
Bioinformatic, and tools by kk sahu
KAUSHAL SAHU
 
Bioinformatics life sciences_v2015
Prof. Wim Van Criekinge
 
Bioinformatica t3-scoring matrices
Prof. Wim Van Criekinge
 
Protein Structure Alignment and Comparison
Natalio Krasnogor
 
Protein Evolution and Sequence Analysis.ppt
Francis de Castro
 
Bioinformatica t7-protein structure
Prof. Wim Van Criekinge
 
A Frequency Domain Approach to Protein Sequence Similarity Analysis and Funct...
sipij
 
Protein function and bioinformatics
Neil Saunders
 
57 bio infomark
phdcao
 
HMM’S INTERPOLATION OF PROTIENS FOR PROFILE ANALYSIS
ijcseit
 
Bioinformatics
Arockiyajainmary
 
STRUCTURAL GENOMICS, FUNCTIONAL GENOMICS, COMPARATIVE GENOMICS
SHEETHUMOLKS
 
Subtypes of Associated Protein-DNA (Transcription Factor-Transcription Factor...
Cyrus Chan
 

More from BITS (20)

PDF
RNA-seq for DE analysis: detecting differential expression - part 5
BITS
 
PDF
RNA-seq for DE analysis: extracting counts and QC - part 4
BITS
 
PDF
RNA-seq for DE analysis: the biology behind observed changes - part 6
BITS
 
PDF
RNA-seq: analysis of raw data and preprocessing - part 2
BITS
 
PDF
RNA-seq: general concept, goal and experimental design - part 1
BITS
 
PDF
RNA-seq: Mapping and quality control - part 3
BITS
 
PDF
Productivity tips - Introduction to linux for bioinformatics
BITS
 
PDF
Text mining on the command line - Introduction to linux for bioinformatics
BITS
 
PDF
The structure of Linux - Introduction to Linux for bioinformatics
BITS
 
PDF
Managing your data - Introduction to Linux for bioinformatics
BITS
 
PDF
Introduction to Linux for bioinformatics
BITS
 
PDF
BITS - Genevestigator to easily access transcriptomics data
BITS
 
PDF
BITS - Comparative genomics: the Contra tool
BITS
 
PDF
BITS - Comparative genomics on the genome level
BITS
 
PDF
BITS - Comparative genomics: gene family analysis
BITS
 
PDF
BITS - Introduction to comparative genomics
BITS
 
PDF
BITS - Protein inference from mass spectrometry data
BITS
 
PDF
BITS - Overview of sequence databases for mass spectrometry data analysis
BITS
 
PDF
BITS - Search engines for mass spec data
BITS
 
PDF
BITS - Introduction to proteomics
BITS
 
RNA-seq for DE analysis: detecting differential expression - part 5
BITS
 
RNA-seq for DE analysis: extracting counts and QC - part 4
BITS
 
RNA-seq for DE analysis: the biology behind observed changes - part 6
BITS
 
RNA-seq: analysis of raw data and preprocessing - part 2
BITS
 
RNA-seq: general concept, goal and experimental design - part 1
BITS
 
RNA-seq: Mapping and quality control - part 3
BITS
 
Productivity tips - Introduction to linux for bioinformatics
BITS
 
Text mining on the command line - Introduction to linux for bioinformatics
BITS
 
The structure of Linux - Introduction to Linux for bioinformatics
BITS
 
Managing your data - Introduction to Linux for bioinformatics
BITS
 
Introduction to Linux for bioinformatics
BITS
 
BITS - Genevestigator to easily access transcriptomics data
BITS
 
BITS - Comparative genomics: the Contra tool
BITS
 
BITS - Comparative genomics on the genome level
BITS
 
BITS - Comparative genomics: gene family analysis
BITS
 
BITS - Introduction to comparative genomics
BITS
 
BITS - Protein inference from mass spectrometry data
BITS
 
BITS - Overview of sequence databases for mass spectrometry data analysis
BITS
 
BITS - Search engines for mass spec data
BITS
 
BITS - Introduction to proteomics
BITS
 

Recently uploaded (20)

PPTX
Tips Management in Odoo 18 POS - Odoo Slides
Celine George
 
PPTX
Basics and rules of probability with real-life uses
ravatkaran694
 
PPTX
Dakar Framework Education For All- 2000(Act)
santoshmohalik1
 
PPTX
Artificial-Intelligence-in-Drug-Discovery by R D Jawarkar.pptx
Rahul Jawarkar
 
PPTX
Measures_of_location_-_Averages_and__percentiles_by_DR SURYA K.pptx
Surya Ganesh
 
PPTX
CONCEPT OF CHILD CARE. pptx
AneetaSharma15
 
PDF
Module 2: Public Health History [Tutorial Slides]
JonathanHallett4
 
PDF
Health-The-Ultimate-Treasure (1).pdf/8th class science curiosity /samyans edu...
Sandeep Swamy
 
DOCX
pgdei-UNIT -V Neurological Disorders & developmental disabilities
JELLA VISHNU DURGA PRASAD
 
PPTX
HISTORY COLLECTION FOR PSYCHIATRIC PATIENTS.pptx
PoojaSen20
 
PPTX
Command Palatte in Odoo 18.1 Spreadsheet - Odoo Slides
Celine George
 
DOCX
SAROCES Action-Plan FOR ARAL PROGRAM IN DEPED
Levenmartlacuna1
 
PPTX
How to Close Subscription in Odoo 18 - Odoo Slides
Celine George
 
PPTX
Continental Accounting in Odoo 18 - Odoo Slides
Celine George
 
PPTX
Care of patients with elImination deviation.pptx
AneetaSharma15
 
PPTX
CARE OF UNCONSCIOUS PATIENTS .pptx
AneetaSharma15
 
PPTX
Virus sequence retrieval from NCBI database
yamunaK13
 
PDF
The Minister of Tourism, Culture and Creative Arts, Abla Dzifa Gomashie has e...
nservice241
 
DOCX
Modul Ajar Deep Learning Bahasa Inggris Kelas 11 Terbaru 2025
wahyurestu63
 
PPTX
Introduction to pediatric nursing in 5th Sem..pptx
AneetaSharma15
 
Tips Management in Odoo 18 POS - Odoo Slides
Celine George
 
Basics and rules of probability with real-life uses
ravatkaran694
 
Dakar Framework Education For All- 2000(Act)
santoshmohalik1
 
Artificial-Intelligence-in-Drug-Discovery by R D Jawarkar.pptx
Rahul Jawarkar
 
Measures_of_location_-_Averages_and__percentiles_by_DR SURYA K.pptx
Surya Ganesh
 
CONCEPT OF CHILD CARE. pptx
AneetaSharma15
 
Module 2: Public Health History [Tutorial Slides]
JonathanHallett4
 
Health-The-Ultimate-Treasure (1).pdf/8th class science curiosity /samyans edu...
Sandeep Swamy
 
pgdei-UNIT -V Neurological Disorders & developmental disabilities
JELLA VISHNU DURGA PRASAD
 
HISTORY COLLECTION FOR PSYCHIATRIC PATIENTS.pptx
PoojaSen20
 
Command Palatte in Odoo 18.1 Spreadsheet - Odoo Slides
Celine George
 
SAROCES Action-Plan FOR ARAL PROGRAM IN DEPED
Levenmartlacuna1
 
How to Close Subscription in Odoo 18 - Odoo Slides
Celine George
 
Continental Accounting in Odoo 18 - Odoo Slides
Celine George
 
Care of patients with elImination deviation.pptx
AneetaSharma15
 
CARE OF UNCONSCIOUS PATIENTS .pptx
AneetaSharma15
 
Virus sequence retrieval from NCBI database
yamunaK13
 
The Minister of Tourism, Culture and Creative Arts, Abla Dzifa Gomashie has e...
nservice241
 
Modul Ajar Deep Learning Bahasa Inggris Kelas 11 Terbaru 2025
wahyurestu63
 
Introduction to pediatric nursing in 5th Sem..pptx
AneetaSharma15
 

BITS: Basics of sequence analysis

  • 1. Basic bioinformatics concepts, databases and tools Module 3 Sequence analysis Joachim Jacob http://www.bits.vib.be Updated Feb 2012 http://dl.dropbox.com/u/18352887/BITS_training_material/Link%20to%20mod3-intro_H1_2012_SeqAn.pdf
  • 2. In this third module, we will discuss the possible analyses of sequences  Module 1 Sequence databases and keyword searching  Module 2 Sequence similarity  Module 3 Sequence analysis: types, interpretation, results
  • 3. In this third module, we will discuss the possible analyses of sequences
  • 4. Sequence analysis tries to read sequences to infer biological properties AGCTACTACGGACTACTAGCAGCTACCTCTCTG - is this coding sequence? - can this sequence bind a certain TF? - what is the melting temperature? - what is the GC content? - does it fold into a stable secondary structure? …
  • 5. Tools that can predict a biological feature are trained with examples Automatic annotation vs. experimentally verified annotations - Training dataset of sequences (← exp. verified) - An algorithm defines parameters used for prediction - The algorithm determines/classifies whether the sequence(s) contains the feature (→ automatic annotation)
  • 6. The assumption to being able to read biological function is the central paradigm DNA → protein sequence → structure → activity (binding, enzymatic activity, regulatory,...)  So the premise to do analysis: biological function can be read from the (DNA) sequence.  Predictions always serve as a basis for further experiments.
  • 7. Analysis can be as simple as measuring properties or predicting features  Protein − Metrics (e.g. how many alanines in my seq) − Modifications and other predictions − Domains and motifs  DNA/RNA − Metrics (e.g. how many GC) − Predicting  Gene prediction  Promotor  Structure
  • 8. Simple protein sequence analysis  One might be interested in:  pI (isoelectric point) prediction  Composition metrics  Hydrophobicity calculation  Reverse translation (protein → dna)  Occurrence of simple patterns (e.g. does KDL occurs and how many times)  ... http://en.wikipedia.org/wiki/Hydrophobicity_scales http://www.sigmaaldrich.com/life-science/metabolomics/learning-center/amino-acid-reference-cha
  • 9. Protein sequence analysis tools are gathered on Expasy http://www.expasy.org/tools (SIB)  Others:  http://www.ebi.ac.uk/Tools/protein.html  http://bioweb.pasteur.fr/protein/intro-en.html  SMS2
  • 10. Never trust a tool's output blindly  Interpreting depends on the kind of output When a prediction result is obtained, the question arises 'Is it true?' (in biological sense)  Programs giving a 'binary'  Programs giving score/P- result: 1 or 0, a hit or a miss. value result: the chance that the 'result' is 'not real' → the Approach: You should lower, the better comparing different prediction programs for Approach: asses the p-value higher confidence. E.g. ScanProsite for a motif E.g. SignalP for signal peptide prediction.
  • 11. The basis for the prediction of features is nearly always a sequence alignment Based on experimentally verified sequence annotations, a multiple sequence alignment is constructed Different methods exist to capture the information gained from this multiple sequence alignment
  • 12. Alignment reveals similar residues which can indicate identical structure Same structure, hence most likely same function Most protein pairs with more than 25- 30 out of 100 identical residues were Chances are that found to be structurally similar. the structure is not Also proteins with <10% identity can the same have similar structure. http://peds.oxfordjournals.org/content/12/2/85.long
  • 13. The structure of a protein sequence determines his biological function Number of Primary = AA chain Reported structures Feb 2012: ~ 535 000 in Swissprot Secondary = structural entities (helix, beta-strands, beta-sheets, loops) Tertiary = 3D Nov 2011: ~ 80 000 in PDB Quaternary = interactions http://en.wikipedia.org/wiki/Protein_structure
  • 14. Degree of similarity with other sequences varies over the length More conserved Homologous Histone H1 protein sequences
  • 15. Protein sequences can consist of structurally different parts Domain part of the tertiary structure of a protein that can exist, function and evolve independently of the rest, linked to a certain biological function Motif part (not necessarily contiguous) of the primary structure of a protein that corresponds to the signature of a biological function. Can be associated with a domain. Feature part of the sequence for which some annotation has been added. Some features correspond to domain or motif assignments.
  • 16. Based on motifs and domains, proteins are assigned to families Nearly synonymous with gene family Evolutionary related proteins Significant structural similarity of domains is reflected in sequence similarity, and is due to a common ancestral sequence part, resulting in domain families.
  • 17. Domains and motifs are represented by simple and complex methods domain Gapped alignment Motif/domain in silico can be represented by 1. Regular expression / pattern 2. Frequency matrix / profile 3. Machine learning techniques : Hidden Markov Model http://bioinfo.uncc.edu/zhx/binf8312/lecture-7-SequenceAnalyses.pdf
  • 18. Regular expressions / patterns are the simplest way to represent motifs A representation of all residues with equal probability. 123456 Position: 1. 2. 3. 4. 5. 6. ATPKAE KKPKAA [AKT] [AKLT] P [AK] [APT] [ADEKT-] AKPKAK TKPKPA AKPKT- AKPAAK ? Does this sequence match: AKPKTE KLPKAD V V V V V V AKPKAA Consensus: AKPKAA ? And this sequence: KKPETE V V V X V V ? And what about this one: TLPATE For every position the most V V V V V V Frequently occurring residue
  • 19. Frequency matrices or profiles include the chance of observing the residues For every position of a motif, a list of all amino acids is made with their frequency. Position-specific weight/scoring matrix or profile. More sensitive way. Profile 123456 Position: 1. 2. 3. 4. 5. 6. ATPKAE KKPKAA A 0.625 0 0 1/8 6/8 3/8 AKPKAK D 0 0 0 0 0 1/8 TKPKPA E 0 0 0 0 0 1/8 AKPKT- K 0.25 6/8 0 7/8 0 2/8 AKPAAK L 0 1/8 0 0 0 0 KLPKAD P 0 0 1 0 1/8 0 AKPKAA T 1/8 1/8 0 0 1/8 0 Consensus: AKPKA- - 0 0 0 0 0 1/8 Sum 1 1 1 1 1 1 ? Query: AKPKTE ? Query: KKPETE ? Query: TLPATE Example: http://expasy.org/prosite/PS51092 http://prosite.expasy.org/prosuser.html#meth2
  • 20. How good a sequence matches a profile is reported with a score PSWM: scores 123456 Position: 1. 2. 3. 4. 5. 6. ATPKAE KKPKAA A 2.377 -2.358 -2.358 0.257 2.631 1.676 AKPKAK D -2.358 -2.358 -2.358 -2.358 -2.358 0.257 TKPKPA E -2.358 -2.358 -2.358 -2.358 -2.358 0.257 AKPKT- K 1.134 2.631 -2.358 2.847 -2.358 1.134 AKPAAK L -2.358 0.257 -2.358 -2.358 -2.358 -2.358 P -2.358 -2.358 0.257 -2.358 0.257 -2.358 KLPKAD T 0.257 0.257 -2.358 -2.358 0.257 -2.358 AKPKAA Consensus: AKPKA- ? Query: AKPKTE Score = 11.4 ? Query: KKPETE Score = 5.0 ? Query: TLPATE Score = 4.3 http://prosite.expasy.org/prosuser.html#meth2
  • 21. A hidden Markov Model takes also into account the gaps in an alignment The schematic representation of a HMM http://www.myoops.org/twocw/mit/NR/rdonlyres/Electrical-Engineering-
  • 22. Building a HMM from a multiple sequence alignment
  • 23. Use HMMER to very sensitively search protein database with a HMM You can search with a profile in a sequence database
  • 24. Some profile adjustments to the BLAST protocol exist for particular purposes PSI-BLAST to identify distantly related proteins PSI-BLAST (position specific iterated) After a search result, a profile is made of the similar sequences, and this is used again to search a database PHI-BLAST protein with matching of a pattern PHI-BLAST (pattern hit initiated): you provide a pattern, which all BLAST results should satisfy. CSI-BLAST is more sensitive than PSI-BLAST in identifying distantly related proteins PSI BLAST http://www.ncbi.nlm.nih.gov/blast/Blast.cgi?CMD=Web&PAGE=Proteins&PROGRAM=b PHI BLAST http://www.ncbi.nlm.nih.gov/blast/Blast.cgi?CMD=Web&PAGE=Proteins&PROGRAM=b CSI BLAST http://toolkit.tuebingen.mpg.de/cs_blast
  • 25. Many databases exist that keep patterns, profiles or models related to function  Motif / domain databases (see NCBI bookshelf for good overview)  http://www.ebi.ac.uk/interpro/ - integrated db  http://expasy.org/prosite/ (motifs)  PFAM – hidden markov profiles (domains)  CDD (Conserved domains database) (NCBI - integrated)  Prodom (domain) (automatic extraction)  SMART (domain)  PRINTS (motif) sets of local alignments without gaps, used as frequency matrices, made by searching manually made "seed alignments" against UniProt sequences
  • 26. Prosite is a database gathering patterns from sequence alignments  ScanProsite tool : search the prosite database for a pattern ( present or not ) Example : [DE](2)-H-S-{P}-x(2)-P-x(2,4)-C> You can retrieve sequences which correspond to a pattern, you made up yourself, observed in an alignment or an known one. The syntax is specific, but not difficult: see link below! http://prosite.expasy.org/scanprosite/scanprosite-doc.html#pattern_syntax
  • 27. Interpro classifies the protein data into families based on the domain and motifs Interpro takes all existing motif and domains databases as input ('signatures'), and aligns them to create protein domain families. This reduces redundancy. Each domain is than given an identifier IPRxxxxxxx. Uneven size of motifs and families between families are handled by 'relations' : parent - child and contains - found in Families,... Regions, domains, ... http://www.ebi.ac.uk/interpro/user_manual.html#type
  • 28. Interpro summarizes domains and motifs from a dozen of domain databases http://www.ebi.ac.uk/interpro/databases.html ftp://ftp.ebi.ac.uk/pub/software/unix/iprscan/README.html#2
  • 29. InterPro entries are grouped in types  Family Entries span complete sequence  Domain Biologically functional units  Repeat  Region  Conserved site  Active site  Binding site  PTM site
  • 30. InterPro entries are grouped in types
  • 31. You can search your sequence for known domains on InterProScan Interproscan http://www.ebi.ac.uk/Tools/pfa/iprscan/
  • 32. A sequence logo provides a visual summary of a motif Creating a sequence logo Create a nicely looking logo of a motif sequence: size of letters indicated frequency.  Weblogo - a basic web application to create colorful logo's  IceLogo - a powerful web application to create customized logo's
  • 33. A sequence logo provides a visual summary of a motif iceLogo 123456 ATPKAE KKPKAA AKPKAK TKPKPA AKPKT- AKPAAK KLPKAD AKPKAA Consensus: AKPKA- http://www.bits.vib.be/wiki/index.php/Exercises_on_multiple_sequence_alignment#Sequence_logo
  • 34. There is always a chance that a prediction of a feature by a tool is false Number Number of of matches matches True negatives True negatives True positives True positives Score Score Threshold Threshold False negatives False positives Ideal situation Reality of the databases
  • 35. Assessing the performance of categorizing tools with sensitivity and specificity PREDICTION “Confusion matrix” Feature is Feature is predicted NOT predicted True False Sequence contains feature positive Negatives “Type I error” TRUTH False True Sequence does NOT contain feature positive negative “Type II error”
  • 36. Assessing the performance of categorizing tools with sensitivity and specificity PREDICTION “Confusion matrix” Feature is Feature is predicted NOT predicted Sensitivity Sequence contains feature True positives/(TP + FN) TRUTH False True Sequence does NOT contain feature positive negative
  • 37. Assessing the performance of categorizing tools with sensitivity and specificity PREDICTION “Confusion matrix” Feature is Feature is predicted NOT predicted Sequence contains feature TRUTH Selectivity Sequence does NOT contain feature or Specificity TN/(FP + TN)
  • 38. Assessing the performance of categorizing tools with sensitivity and specificity PREDICTION “Confusion matrix” Feature is Feature is predicted NOT predicted Sequence contains feature error rate* TRUTH FP+FN/total Sequence does NOT contain feature * misclassification rate
  • 39. Assessing the performance of categorizing tools with sensitivity and specificity PREDICTION “Confusion matrix” Feature is Feature is predicted NOT predicted Sequence contains feature Accuracy TRUTH TP+TN/total Sequence does NOT contain feature
  • 40. Protein sequences can be searched for potential modifications  http://www.expasy.org/tools/  e.g. modification (phosphorylation, acetylation,...) To deal with the confidence in the results, try different tools, and make a graph (venn diagram) to compare the results  E.g. predict secreted proteins by signalP and RPSP, combine results in Venn − http://bioinformatics.psb.ugent.be/webtools/Venn/ − http://www.cmbi.ru.nl/cdd/biovenn/ Overview SignalPeptide prediction tools: http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2788353/
  • 41. Protein sequences can be searched for secondary structural elements Based on know structures, machine learning models of secondary structure elements are made and can be searched for. See http://bioinf.cs.ucl.ac.uk/psipred/
  • 42. In case of multiple analyses on multiple sequences, mark instead of filter Starting set of sequences Worse Better Analysis filter 1 Analysis filter 1 Analysis filter 2 Analysis filter 3 Analysis filter 2 Analysis filter 3 After performing all analyses on all sequences, different filters on the results can be applied (e.g. secreted sequence, ! phosphorylated and containing a motif)
  • 44. NA sequence analyses GC% http://mobyle.pasteur.fr/cgi-bin/portal.py?#forms::geecee Melting temperature For primer development, such as with Primer3 Structure Codon usage Codon usage table with cusp Codon adaptation index calculation with cai ... A lot of tools can be found at the Mobyle Portal:
  • 45. Profiles and models are being used to model biological function in NA seqs To detect Transcription factor binding sites TRANSFAC : commercial (BIOBASE, Wolfenbüttel, Germany), started as work of Edgard Wingender, contains eukaryotic binding sites as consensus sequences and as PSSMs. Also TRANSCompel with modules of binding sites. ooTFD : commercial (IFTI, Pittsburgh PA, USA), started as work of David Gosh, contains prokaryotic and eukaroytic binding sites as consensus sequences and as PSSMs. JASPAR : open access, only representative sets of higher eukaryote binding sites as PSSMs. Can be searched against sequence or sequence pair at Consite. OregAnno : open access, collection of individual eukaryotic binding sites with their localization in the genome PAZAR : collection of open access TF databanks
  • 46. Sequence logos can give an insight in the important residues of binding sites  DNA: an entry from JASPAR: tata box A [ 61 16 352 3 354 268 360 222 155 56 83 82 82 68 77 ] C [145 46 0 10 0 0 3 2 44 135 147 127 118 107 101 ] G [152 18 2 2 5 0 10 44 157 150 128 128 128 139 140 ] T [ 31 309 35 374 30 121 6 121 33 48 31 52 61 75 71 ]
  • 47. The RNA world has the Vienna servers http://rna.tbi.univie.ac.at/ − secondary structure prediction of ribosomal sequences − siRNA design
  • 48. RNA families can be modeled by conserved bases and structure  RNA motifs (http://rfam.sanger.ac.uk/search) Rfam is a databank of RNA motifs and families. It is made at the Sanger Centre (Hinxton, UK), from a subset of EMBL (well-annotated standard sequences excluding synthetic sequences + the WGS) using the INFERNAL suite of Soan Eddy. It contains local alignments with gaps with included secondary structure annotation + CMs.
  • 49. Some interesting links Nucleic acid structure Unafold - Program accessible through webinterface After designing primers, you might want to check whether the primer product does (not) adapt a stable secondary structure. Some collections of links − Good overview at http://www.imb-jena.de/RNA.html − European Ribosomale RNA database (VIB PSB)
  • 50. Prediction of genes in genomes rely on the integration of multiple signals Signals surrounding the gene (transcription factor binding sites, promoters, transcription terminators, splice sites, polyA sites, ribosome binding sites,...) → profile matching Differences in composition between coding and noncoding DNA (codon preference), the presence of an Open Reading Frame (ORF) → compositional analyses Similarity with known genes, aligning ESTs and (in translation) similarity with known proteins and the presence of protein motifs → similarity searches
  • 51. Prediction of genes in genomes rely on the integration of multiple signals Signals Composition Similarity e.g. potential methylation sites (profiles) Alignment of ESTs GC
  • 52. Software for prediction genes EMBOSS − simple software under EMBOSS : syco (codon frequency), wobble (%GC 3rd base), tcode (Ficket statistic : correlation between bases at distance 3) Examples of software using HMM model of gene : Wise2 : using also similarity with known proteins http://www.ebi.ac.uk/Tools/Wise2 GENSCAN : commercial (Chris Burge, Stanford U.) but free for academics, has models for human/A. thaliana/maize, used at EBI and NCBI for genome annotation http://mobyle.pasteur.fr/cgi-bin/portal.py?#forms::genscan GeneMark : commercial (GeneProbe, Atlanta GA, USA) but free for academic users, developed by Mark Borodovsky, has models for many prokaryotic and eukaryotic organisms http://exon.gatech.edu Tutorial on gene prediction http://www.embl.de/~seqanal/courses/spring00/GenePred.00.html
  • 53. Short addendum about downloading files FTP, e.g. ftp://ftp.ebi.ac.uk/pub/databases/interpro/ – 'file transfer protocol' – Most browsers have integrated ftp 'client' – Free, easy to download files, possibility to resume after fails HTTP, e.g. http://www.ncbi.nlm.nih.gov/entrez Standard protocol for internet traffic, Slowest method Aspera – for large datasets (>10GB) downloads In use in the short read archive (SRA) Fastest method available currently
  • 54. Conclusion  Prediction vs. experimental verified  Different algorithms need to be compared  Predictions need to be validated by independent method  Software <-> Databases  Questions? Get social! → www.seqanswers.com → http://biostar.stackexchange.com  Always only basis for further wet-lab research
  • 55. Summary In this third module, we will discuss the possible analyses of sequences Sequence analysis tries to read sequences to infer biological properties Tools that can predict a biological feature are trained with examples The assumption to being able to read biological function is the central paradigm Analysis can be as simple as measuring properties or predicting features Protein sequence analysis tools are gathered on Expasy Never trust a tool's output blindly The basis for the prediction of features is nearly always a sequence alignment Alignment reveals similar residues which can indicate identical structure The structure of a protein sequence determines his biological function Degree of similarity with other sequences varies over the length Protein sequences can consist of structurally different parts Based on motifs and domains, proteins are assigned to families Domains and motifs are represented by simple and complex methods Regular expressions / patterns are the simplest way to represent motifs Frequency matrices or profiles include the chance of observing the residues How good a sequence matches a profile is reported with a score A hidden Markov Model takes also into account the gaps in an alignment Use HMMER to very sensitively search protein database with a HMM Some profile adjustments to the BLAST protocol exist for particular purposes Many databases exist that keep patterns, profiles or models related to function Prosite is a database gathering patterns from sequence alignments Interpro classifies the protein data into families based on the domain and motifs Interpro summarizes domains and motifs from a dozen of domain databases You can search your sequence for known domains on InterProScan A sequence logo can provide a visual summary of a motif Protein sequences can be searched for potential modifications Protein sequences can be searched for secondary structural elements In case of multiple analyses on multiple sequences, mark instead of filter Profiles and models are being used to model biological function in NA seqs Sequence logos can give an insight in the important residues of binding sites The RNA world has the Vienna servers RNA families can be modeled by conserved bases and structure Prediction of genes in genomes rely on the integration of multiple signals