SlideShare a Scribd company logo
Bioinformatics
Bioinformatics
 The analysis of biological information using
computers and statistical techniques.
 The science of developing and utilizing computer
databases and algorithms to accelerate and
enhance biological research
 Marriage of Computer Science and Biology
 Organize, store, analyze, visualize genomic data
 Utilizes methods from Computer Science,
Mathematics, Statistics and Biology
 Margaret Oakley Dayhoff is the father & mother of
Bioinformatics
 At the convergence of two revolutions: the ultra-fast
growth of biological data, and the information
revolution
Bioinformatics
Example 1
 Compare proteins with similar sequences (for
instance –kinases) and understand what the
similarities and differences mean.
Example 2
 Look at the genome and predict where genes are
(promoters; transcription binding sites; introns;
exons)
 Predict the 3-dimensional structure of a protein from
its primary sequence
Example 3
 Correlate between gene expression and disease
Example 4
A gene chip – quantifying gene
expression in different tissues
under different conditions
May be used for personalized
medicine
Genomics
Study of sequences, gene organization &
mutations at the DNA level
the study of information flow within a cell
The Human Genome Project
3 billion bases 30,000 genes
http://www.genome.gov/
Goals Established for the Human Genome
Project
 Began in 1990
 Identify all of the genes in human DNA.
 Determine the sequence of the 3 billion chemical
nucleotide bases that make up human DNA.
 Store this information in data bases.
 Develop faster, more efficient sequencing
technologies.
 Develop tools for data analysis.
 Address the ethical, legal, and social issues (ELSI)
that may arise form the project.
Published
•The International Human Genome Sequencing Consortium
published their results in Nature, 409 (6822): 860-921,
2001.”Initial Sequencing and Analysis of the Human
Genome”
•Celera Genomics published their results in Science, Vol
291(5507): 1304-1351, 2001.“The Sequence of the Human
Genome”
• How to characterize new diseases?
• What new treatments can be discovered?
• How do we treat individual patients? Tailoring
treatments?
Impact of Genomics on
Medicine
Implications for Biomedicine
• Physicians will use genetic information to
diagnose and treat disease.
• Virtually all medical conditions have a
genetic component
• Faster drug development research:
(pharmacogenomics)
• Individualized drugs
• All Biologists/Doctors will use gene sequence
information in their daily work
Proteomics
• Uses information determined by biochemical/crystal
structure methods
• Visualization of protein structure
• Make protein-protein comparisons
• Used to determine:
conformation/folding
antibody binding sites
protein-protein interactions
computer aided drug design
Transcriptomics
The term can be applied to the total set of transcripts in a given
organism, or to the specific subset of transcripts present in a
particular cell type
The transcriptome reflects the genes that are being actively
expressed at any given time
Eg. DNA microarray technology
Database
 A collection of data that needs to be:
 Structured
 Searchable
 Updated (periodically)
 Cross referenced
 Challenge:
 To change “meaningless” data into useful information that
can be accessed and analysed the best way possible.
Biological databases
 Like any other database
 Data organization for optimal analysis
 Data is of different types
 Raw data (DNA, RNA, protein sequences)
 Curated data (DNA, RNA and protein annotated
sequences and structures, expression data)
Raw Biological data
Nucleic Acids (DNA)
Raw Biological data
Amino acid residues (proteins)
Curated Biological Data
DNA, nucleotide sequences
Gene boundaries, topology Gene structure
Introns, exons, ORFs, splicing
Expression data Mass spectometry
Curated Biological data
3D Structures, folds
Bioinformatics final
Primary Seq db’s is collaborative- data exchanged daily
Databases 1: nucleotide sequence
 The main DNA sequence db are
EMBL (Europe)/GenBank (USA) /DDBJ (Japan)
 There are also specialized databases for the different types
of RNAs (i.e. tRNA, rRNA, mi RNA…)
 3D structure (DNA and RNA)
 Others: Aberrant splicing db; Eucaryotic promoter db
(EPD); RNA editing sites, Multimedia Telomere Resource
……
EMBL/GenBank/DDJB
 These 3 db contain mainly the same informations
within 2-3 days (few differences in the format and
syntax)
 Serve as archives containing all sequences (single
genes, ESTs, complete genomes, etc.) derived from:
 Genome projects and sequencing centers
 Individual scientists
 Patent offices (i.e. European Patent Office, EPO)
 Non-confidential data are exchanged daily
 Currently: 8.3 x106 sequences, over 9.7 x109 bp;
 Sequences from > 50’000 different species;
GenBank entry: example
LOCUS HSERPG 3398 bp DNA PRI 22-JUN-1993
DEFINITION Human gene for erythropoietin.
ACCESSION X02158
VERSION X02158.1 GI:31224
KEYWORDS erythropoietin; glycoprotein hormone; hormone; signal peptide.
SOURCE human.
ORGANISM Homo sapiens
Eukaryota; Metazoa; Chordata; Vertebrata; Mammalia; Eutheria;
Primates; Catarrhini; Hominidae; Homo.
REFERENCE 1 (bases 1 to 3398)
AUTHORS Jacobs,K., Shoemaker,C., Rudersdorf,R., Neill,S.D., Kaufman,R.J.,
Mufson,A., Seehra,J., Jones,S.S., Hewick,R., Fritsch,E.F.,
Kawakita,M., Shimizu,T. and Miyake,T.
TITLE Isolation and characterization of genomic and cDNA clones of human
erythropoietin
JOURNAL Nature 313 (6005), 806-810 (1985)
MEDLINE 85137899
COMMENT Data kindly reviewed (24-FEB-1986) by K. Jacobs.
FEATURES Location/Qualifiers
source 1..3398
/organism="Homo sapiens"
/db_xref="taxon:9606"
mRNA join(397..627,1194..1339,1596..1682,2294..2473,2608..3327)
exon 397..627
/number=1
sig_peptide join(615..627,1194..1261)
CDS join(615..627,1194..1339,1596..1682,2294..2473,2608..2763)
/codon_start=1
/product="erythropoietin"
/protein_id="CAA26095.1"
/db_xref="GI:312304"
/db_xref="SWISS-PROT:P01588"
/translation="MGVHECPAWLWLLLSLLSLPLGLPVLGAPPRLICDSRVLQRYLL
EAKEAENITTGCAEHCSLNENITVPDTKVNFYAWKRMEVGQQAVEVWQGLALLSEAVL
RGQALLVNSSQPWEPLQLHVDKAVSGLRSLTTLLRALGAQKEAISPPDAASAAPLRTI
…
GenBank entry (cont.)
TADTFRKLFRVYSNFLRGKLKLYTGEACRTGDR"
intron 628..1193
/number=1
exon 1194..1339
/number=2
mat_peptide join(1262..1339,1596..1682,2294..2473,2608..2760)
/product="erythropoietin"
intron 1340..1595
/number=2
exon 1596..1682
/number=3
intron 1683..2293
/number=3
exon 2294..2473
/number=4
intron 2474..2607
/number=4
exon 2608..3327
/note="3' untranslated region"
/number=5
BASE COUNT 698 a 1034 c 991 g 675 t
ORIGIN
1 agcttctggg cttccagacc cagctacttt gcggaactca gcaacccagg catctctgag
61 tctccgccca agaccgggat gccccccagg aggtgtccgg gagcccagcc tttcccagat
121 agcagctccg ccagtcccaa gggtgcgcaa ccggctgcac tcccctcccg cgacccaggg
181 cccgggagca gcccccatga cccacacgca cgtctgcagc agccccgtca gccccggagc
241 ctcaacccag gcgtcctgcc cctgctctga ccccgggtgg cccctacccc tggcgacccc
Database 2: protein sequences
 SWISS-PROT: created in 1986 (A.Bairoch) http://www.expasy.org/sprot/
 TrEMBL: created in 1996; complement to SWISS-PROT; derived from
EMBL CDS translations (« proteomic » version of EMBL)
 PIR-PSD: Protein Information Resources http://pir.georgetown.edu/
 Genpept: « proteomic » version of GenBank
 Many specialized protein databases for specific families or groups of
proteins.
 Examples: AMSDb (antibacterial peptides), GPCRDB (7 TM
receptors), IMGT (immune system) YPD (Yeast) etc.
SWISS-PROT
 Collaboration between the SIB (CH) and EMBL/EBI (UK)
 Fully annotated (manually), non-redundant, cross-
referenced, documented protein sequence database.
 ~113 ’000 sequences from more than 6’800 different
species; 70 ’000 references (publications); 550 ’000
cross-references (databases); ~200 Mb of annotations.
 Weekly releases; available from about 50 servers across
the world, the main source being ExPASy
TrEMBL (Translation of EMBL)
 It is impossible to cope with the quantity of newly
generated data AND to maintain the high quality of
SWISS-PROT -> TrEMBL, created in 1996.
 TrEMBL is automatically generated (from annotated EMBL
coding sequences (CDS)) and annotated using software
tools.
 Contains all what is not in SWISS-PROT.
SWISS-PROT + TrEMBL = all known protein sequences.
 Well-structured SWISS-PROT-like resource.
a Swiss-Prot entry…
overview
sequence
Accession
number
Entry name
TrEMBL: example
Original TrEMBL entry which has been integrated into the SWISS-PROT
EPO_HUMAN entry and thus which is not found in TrEMBL anymore.
Bioinformatics final
PDB
 Protein Data Bank, managed by RCSB
 Currently there are ~13’000 structures for about 4’000
different molecules, but far less protein family !
 There are also databases that contain data derived
from PDB. Examples: HSSP (homology-derived
secondary structure of proteins), SWISS-3DIMAGE
(images)…
Restriction enzyme
PDB: example
HEADER LYASE(OXO-ACID) 01-OCT-91 12CA 12CA 2
COMPND CARBONIC ANHYDRASE /II (CARBONATE DEHYDRATASE) (/HCA II) 12CA 3
COMPND 2 (E.C.4.2.1.1) MUTANT WITH VAL 121 REPLACED BY ALA (/V121A) 12CA 4
SOURCE HUMAN (HOMO SAPIENS) RECOMBINANT PROTEIN 12CA 5
AUTHOR S.K.NAIR,D.W.CHRISTIANSON 12CA 6
REVDAT 1 15-OCT-92 12CA 0 12CA 7
JRNL AUTH S.K.NAIR,T.L.CALDERONE,D.W.CHRISTIANSON,C.A.FIERKE 12CA 8
JRNL TITL ALTERING THE MOUTH OF A HYDROPHOBIC POCKET. 12CA 9
JRNL TITL 2 STRUCTURE AND KINETICS OF HUMAN CARBONIC ANHYDRASE 12CA 10
JRNL TITL 3 /II$ MUTANTS AT RESIDUE VAL-121 12CA 11
JRNL REF J.BIOL.CHEM. V. 266 17320 1991 12CA 12
JRNL REFN ASTM JBCHA3 US ISSN 0021-9258 071 12CA 13
REMARK 1 12CA 14
REMARK 2 12CA 15
REMARK 2 RESOLUTION. 2.4 ANGSTROMS. 12CA 16
REMARK 3 12CA 17
REMARK 3 REFINEMENT. 12CA 18
REMARK 3 PROGRAM PROLSQ 12CA 19
REMARK 3 AUTHORS HENDRICKSON,KONNERT 12CA 20
REMARK 3 R VALUE 0.170 12CA 21
REMARK 3 RMSD BOND DISTANCES 0.011 ANGSTROMS 12CA 22
REMARK 3 RMSD BOND ANGLES 1.3 DEGREES 12CA 23
REMARK 4 12CA 24
REMARK 4 N-TERMINAL RESIDUES SER 2, HIS 3, HIS 4 AND C-TERMINAL 12CA 25
REMARK 4 RESIDUE LYS 260 WERE NOT LOCATED IN THE DENSITY MAPS AND, 12CA 26
REMARK 4 THEREFORE, NO COORDINATES ARE INCLUDED FOR THESE RESIDUES. 12CA 27
………
ATOM 1 N TRP 5 8.519 -0.751 10.738 1.00 13.37 12CA 89
ATOM 2 CA TRP 5 7.743 -1.668 11.585 1.00 13.42 12CA 90
ATOM 3 C TRP 5 6.786 -2.502 10.667 1.00 13.47 12CA 91
ATOM 4 O TRP 5 6.422 -2.085 9.607 1.00 13.57 12CA 92
ATOM 5 CB TRP 5 6.997 -0.917 12.645 1.00 13.34 12CA 93
ATOM 6 CG TRP 5 5.784 -0.209 12.221 1.00 13.40 12CA 94
ATOM 7 CD1 TRP 5 5.681 1.084 11.797 1.00 13.29 12CA 95
ATOM 8 CD2 TRP 5 4.417 -0.667 12.221 1.00 13.34 12CA 96
ATOM 9 NE1 TRP 5 4.388 1.418 11.515 1.00 13.30 12CA 97
ATOM 10 CE2 TRP 5 3.588 0.375 11.797 1.00 13.35 12CA 98
ATOM 11 CE3 TRP 5 3.837 -1.877 12.645 1.00 13.39 12CA 99
ATOM 12 CZ2 TRP 5 2.216 0.208 11.656 1.00 13.39 12CA 100
ATOM 13 CZ3 TRP 5 2.465 -2.043 12.504 1.00 13.33 12CA 101
ATOM 14 CH2 TRP 5 1.654 -1.001 12.009 1.00 13.34 12CA 102
…….
Bioinformatics final
Bioinformatics final
Species-oriented databases
 SGD: Saccharomyces Genome Database; molecular
biology and genetics of the yeast Saccharomyces
cerevisiae (baker's or budding yeast)
 FlyBase: A Database of the Drosophila Genome
 MGI: Mouse Genome Informatics
 RGD: Rat Genome Database
 WormBase - database for the nematode
Caenorhabditis elegans
 The Arabidopsis Information Resource (TAIR) -
database for the brassica family plant Arabidopsis
thaliana
Mouse Genome Informatics
Bioinformatics final
Sequence Alignment
Tools
Database Searching:
BLAST:
NCBI, Web Interface: http://www.ncbi.nlm.nih.gov/BLAST/
WuBLAST http://blast.wustl.edu
FASTA: http://www.ebi.ac.uk/fasta3/
Smith-Waterman
Par-Align: http://dna.uio.no/search/
Multiple Sequence Alignment:
CLUSTALW: http://www-igbmc.u-strasbg.fr/BioInfo/ClustalX/Top.html
DiAlign, Web Interface: http://genomatix.gsf.de/cgi-bin/dialign/dialign.pl
MSA:http://www.ncbi.nlm.nih.gov/CBBresearch/Schaffer/msa.html
Web Interface: http://bioweb.pasteur.fr/seqanal/interfaces/msa-simple.html
Bioinformatics final
Uses of BLAST
Query a database for sequences similar to an
input sequence
Uses of BLAST
 Identify previously characterized sequences
Query a database for sequences similar to an
input sequence
Uses of BLAST
 Identify previously characterized sequences.
 Find phylogenetically related sequences.
Query a database for sequences similar to an
input sequence
Uses of BLAST
 Identify previously characterized sequences.
 Find phylogenetically related sequences.
 Identify possible functions based on similarities
to known sequences.
Query a database for sequences similar to an
input sequence
Sequence Homology Software
 NCBI-BLAST
 Run by the National Center for Biotechnology
Information
 BLAST uses a heuristic algorithm based on the
Smith-Waterman algorithm
 Algorithm searches database for a small string
within the query (default 11 for nucleotide searches),
then when it detects a match, searches for shared
nucleotides at each end of the seed to extend the
match
 Gaps are taken into account, then the matches are
presented in order of statistical significance
 http://www.ncbi.nlm.nih.gov/BLAST/
Different Types of BLAST
 Nucleotide-nucleotide BLAST (BLASTN):
 Basic nucleutide sequence searches
 The BLAST that you used for your sequences
 Protein-protein BLAST (BLASTP):
 Similar technology used to search amino acid
sequences
 Position-Specific Iterative BLAST (PSI-
BLAST):
 A more advance protein BLAST useful for analyzing
relationships between divergently evolved proteins.
Different Types of BLAST
 BLASTX and tBLASTN variants:
 Use six-frame translation for proteins and
nucleotides, respectively, in the search
 MegaBLAST:
 Used for BLASTing several sequences at once to
cut down on processing load and server reporting-
time
Types of BLAST:
Graphic courtesy of Joel Graber.
Interpreting BLAST Results
 Query Coverage
 The percent of the query sequence matched by the
database entry
 Max Ident
 The percent identity, i.e. the percent that the genes
match up within the limits of the full match (e.g.
deletions or additions reduce this value)
FASTA
 Another method for local sequence alignment.
 Maintained by Dr. William Pearson at the
University of Virginia.
 http://www.infobiogen.fr/doc/Fasta/docfasta.html
FASTA Versions
FASTA-nucleotide or protein sequence searching
FASTx/ FASTy compares a translated DNA query sequence
to a protein sequence database (forward or
backward translation of the query)
tFASTx/ tFASTy -compares protein query sequence to
DNA sequence database that has been translated into three
forward and three reverse reading frames
Multiple Sequence Alignment
Software
 Clustal (free)
 ClustalX – Software
 ClustalW – Web
 DNAStar ($$$)
 Functionality is similar, but difference is in
interface, tools, and speed of algorithms
 http://www.ebi.ac.uk/clustalw/
Multiple Sequence Alignment
(MSA)
• Multiple sequence alignment (MSA) can be seen as a
generalization of Pairwise Sequence Alignment - instead
of aligning two sequences, n sequences are aligned
simultaneously, where n is > 2
• Definition: A multiple sequence alignment is an alignment
of n > 2 sequences obtained by inserting gaps (“-”) into
sequences
FastA Format
 A sequence in FASTA format begins with a single-
line description, followed by lines of sequence
data.
 The description line is distinguished from the
sequence data by a greater-than (">") symbol in
the first column.
 It is recommended that all lines of text be shorter
than 80 characters in length.
Other Completed Genomes
 Haemophilus influenzae
 Escherichia coli
 Bacillus subtilus
 Helicobacter pylori
 Borrelia burgdorferi
 Streptococcus pneumoniae
 Saccharomyces cerevisiae
 Caenorhabditis elegans
 Arabidopsis thaliana
 Archaeoglobus fulgidus
 Methanobacterium thermoautotrophicum
 Methanococcus jannaschii
 Mycoplasma pneumoniae
 Mycoplasm genitaliu
 Rickettsia prowazekii
 Mycobacterium tuberculosis
 Treponema pallidum
 Staphylococcus aureus
 And more!
Completed Plant Genomes
 Arabidopsis thaliana
Completed Insect Genomes
 Drosophila melanogaster
Completed Rodent Genomes
 Mus musculus
Protien- Three Structure Levels
Beta
Sheet
Helix
Loop
PDB ID: 12as
Primary structure: sequence of amino
acids
– e.g., DRVYIHPF
Secondary structure: local folding
patterns
– e.g., alpha-helix, beta-sheet, loop
Tertiary structure: complete 3D fold
Different Levels of Protein Structure Prediction
Protein Structure Prediction
 Regular Secondary Structure Prediction (-helix -sheet)
 APSSP2: Highly accurate method for secondary structure prediction
 Combines memory based reasoning ( MBR) and ANN methods
 Irregular secondary structure prediction methods (Tight turns)
 Betatpred: Consensus method for -turns prediction
 Statistical methods combined
 Kaur and Raghava (2001) Bioinformatics
 Bteval : Benchmarking of -turns prediction
 Kaur and Raghava (2002) J. Bioinformatics and Computational Biology, 1:495:504
 BetaTpred2: Highly accurate method for predicting -turns (ANN)
 Multiple alignment and secondary structure information
 Kaur and Raghava (2003) Protein Sci 12:627-34
 BetaTurns: Prediction of -turn types in proteins
 Evolutionary information
 Kaur and Raghava (2004) Bioinformatics 20:2751-8.
 AlphaPred: Prediction of -turns in proteins
 Kaur and Raghava (2004) Proteins: Structure, Function, and Genetics 55:83-90
 GammaPred: Prediction of -turns in proteins
 Kaur and Raghava (2004) Protein Science; 12:923-929.
Protein Structure Prediction
 BhairPred: Prediction of Super secondary structure prediction
 Prediction of Beta Hairpins
 Secondary structure and surface accessibility used as input
 Manish et al. (2005) Nucleic Acids Research
 TBBpred: Prediction of outer membrane proteins
 Prediction of trans membrane beta barrel proteins
 Prediction of beta barrel regions
 Application of ANN and SVM + Evolutionary information
 Natt et al. (2004) Proteins: 56:11-8
 ARNHpred: Analysis and prediction side chain, backbone
interactions
 Prediction of aromatic NH interactions
 Kaur and Raghava (2004) FEBS Letters 564:47-57 .
 SARpred: Prediction of surface accessibility
 Multiple alignment (PSIBLAST) and Secondary structure information
 ANN: Two layered network (sequence-structure-structure)
 Garg et al., (2005) Proteins
 PepStr: Prediction of tertiary structure of Bioactive peptides
Homology Modeling
The Best Match
DRVYIHPFADRVYIHPFAQuery
Sequence:
Protein sequence
classification database
• PSI-BLAST
• HMM
• Smith-Waterman algorithm
Phylogenetic analysis
 Evolution studies
 Systematic biology
 Medical research and epidemiology
 Ecology
Bioinformatics final
Terminology
Rooted tree Unrooted tree
Advantages of molecular
phylogenetic analysis
 Analogous features (share common function, but
NOT common ancestry) can be misleading
 DNA sequences more simple to model, we only
have the four states A, C, G, T
 DNA samples for sequence analysis easy to
prepare
 Softwares used :
 Phylip
 Paup
Modern drug discovery process
Target
identification
Target
validation
Lead
identification
Lead
optimization
Preclinical
phase
Drug
discovery
2-5 years
• Drug discovery is an expensive process involving high R & D cost and
extensive clinical testing
• A typical development time is estimated to be 10-15 years.
6-9 years
Design
 Comparison of Sequences: Identify targets
 Homology modelling: active site prediction
 Systems Biology: Identify targets
 Databases: Manage information
 In silico screening (Ligand based, receptor
based): Iterative steps of Molecular docking.
 Pharmacogenomic databases: assist safety
related issues
Molecular Docking
RL
• Docking is the computational determination of binding affinity
between molecules (protein structure and ligand).
• Given a protein and a ligand find out the binding free energy of
the complex formed by docking them.
L
R
Molecular Docking: classification
 Docking or Computer aided drug designing can be
broadly classified
 Receptor based methods- make use of the structure of the target
protein.
 Ligand based methods- based on the known inhibitors
Receptor based methods
 Uses the 3D structure of the target receptor to search
for the potential candidate compounds that can
modulate the target function.
 These involve molecular docking of each compound
in the chemical database into the binding site of the
target and predicting the electrostatic fit between
them.
 The compounds are ranked using an appropriate
scoring function such that the scores correlate with
the binding affinity.
 Receptor based method has been successfully
applied in many targets
Ligand based strategy
 In the absence of the structural information of the
target, ligand based method make use of the
information provided by known inhibitors for the target
receptor.
 Structures similar to the known inhibitors are identified
from chemical databases.
Ligand based strategy
Search for similar compounds
database known actives structures found
Example: use of Bioinformatics in
Drug discovery
Identification of novel drug targets against human
malaria
Docking Software :
Dock
Autodock
Gold
Top 100 solutions
Out of top 40 only 10 compounds available for purchase
Drug-like compound
library (1,000,00)
Molecular
docking
Ligand docked into protein’s
active site
Insilico identification of novel inhibitors against PfClpQ , a novel
drug target of P.falciparum by high throughput docking
PfclpQ
Rasmol
RasMol2 is a molecular graphics program intended for the visualization of proteins,
nucleic acids and small molecules
The program is aimed at display, teaching and generation of publication quality images
The program reads in molecular co-ordinate files and interactively displays the molecule
the screen in a variety of representations and color schemes
Supported input file formats include Brookhaven Protein Databank (PDB),
Sybyl Mol2 formats, Molecular Design Limited's (MDL) Mol file format,
Minnesota Supercomputer Centre's (MSC)
XYZ (XMol) format CHARMm format files
Bioinformatics final
Stephen James
Department of Bioinformatics
Mar Athanasios College for Advanced Studies(MACFAST)
Email : stephen@macfast.org
Phone: 9746935363

More Related Content

What's hot (20)

PPTX
PAPER 3.1 ~ HUMAN GENOME PROJECT
Nusrat Gulbarga
 
PPTX
Human genome project
sabahayat3
 
PPT
DNA Technology
mgsonline
 
PDF
Pattemore 2015
Julie Pattemore
 
PPTX
Application of genomics in animals
Usman Arshad
 
PPTX
Haendel clingenetics.3.14.14
mhaendel
 
PDF
EU PathoNGenTraceConsortium:cgMLST Evolvement and Challenges for Harmonization
European Centre for Disease Prevention and Control (ECDC)
 
PPTX
GENOMICS AND BIOINFORMATICS
sandeshGM
 
PPTX
Human genome
shoaa311
 
PPT
The Emerging Global Community of Microbial Metagenomics Researchers
Larry Smarr
 
PPTX
Big data nebraska
Adina Chuang Howe
 
PDF
Human genome project (2) converted
GAnchal
 
PPT
PROKARYOTIC TRANSCRIPTOMICS AND METAGENOMICS
Lubna MRL
 
PPT
Genome Sequencing Project
guestd53a1
 
PPT
Microbial Metagenomics Drives a New Cyberinfrastructure
Larry Smarr
 
PPTX
human genome project by varaprasad
Varaprasad Padala
 
PPTX
Human genome project
SakshamAgrawal34
 
PPTX
Impact of the human genome project on medical advancement in India.
University of Petroleum and Energy studies
 
PPT
Human genome
Veah Pascasio
 
PDF
Targeted RNA Sequencing, Urban Metagenomics, and Astronaut Genomics
QIAGEN
 
PAPER 3.1 ~ HUMAN GENOME PROJECT
Nusrat Gulbarga
 
Human genome project
sabahayat3
 
DNA Technology
mgsonline
 
Pattemore 2015
Julie Pattemore
 
Application of genomics in animals
Usman Arshad
 
Haendel clingenetics.3.14.14
mhaendel
 
EU PathoNGenTraceConsortium:cgMLST Evolvement and Challenges for Harmonization
European Centre for Disease Prevention and Control (ECDC)
 
GENOMICS AND BIOINFORMATICS
sandeshGM
 
Human genome
shoaa311
 
The Emerging Global Community of Microbial Metagenomics Researchers
Larry Smarr
 
Big data nebraska
Adina Chuang Howe
 
Human genome project (2) converted
GAnchal
 
PROKARYOTIC TRANSCRIPTOMICS AND METAGENOMICS
Lubna MRL
 
Genome Sequencing Project
guestd53a1
 
Microbial Metagenomics Drives a New Cyberinfrastructure
Larry Smarr
 
human genome project by varaprasad
Varaprasad Padala
 
Human genome project
SakshamAgrawal34
 
Impact of the human genome project on medical advancement in India.
University of Petroleum and Energy studies
 
Human genome
Veah Pascasio
 
Targeted RNA Sequencing, Urban Metagenomics, and Astronaut Genomics
QIAGEN
 

Similar to Bioinformatics final (20)

PPT
Role of bioinformatics in life sciences research
Anshika Bansal
 
PPTX
Biological database ppt(1).pptx Introuction
RAJESHKUMAR428748
 
PPTX
Biological database ppt(1).pptx Introuction
RAJESHKUMAR428748
 
PPTX
Introduction to databases.pptx
sworna kumari chithiraivelu
 
PPT
Lecture 1 Introduction to Bioinformatics BCH 433.ppt
KelechiChukwuemeka
 
PDF
Bioinformatics introduction
DrGopaSarma
 
PPTX
Biological Databases | Access to sequence data and related information
NahalMalik1
 
PDF
Group 5 DNA Tech - Ecology & Envt
Jessica Kabigting
 
PPTX
Bioinformatics- An overwiew..................
SoumitraNath9
 
PPTX
Biological databases
Tamanna Syeda
 
PPT
Intro bioinfo
Vinitha Nair
 
PPT
Intro bioinfo
Vinitha Nair
 
DOC
Protein databases
bansalaman80
 
PPTX
Lesson2_1_FundamentalGenomics in Genomics.pptx
Joseph Asamoah-Asare MPhil
 
PPTX
BioInformatics Tools -Genomics , Proteomics and metablomics
AyeshaYousaf20
 
PPTX
Biological database
Iqbal college Peringammala TVM
 
PDF
Dn abarcode
vp1221210130
 
PPT
RML NCBI Resources
Jackie Wirz, PhD
 
PPT
Bioinformatics - Discovering the Bio Logic Of Nature
Robert Cormia
 
Role of bioinformatics in life sciences research
Anshika Bansal
 
Biological database ppt(1).pptx Introuction
RAJESHKUMAR428748
 
Biological database ppt(1).pptx Introuction
RAJESHKUMAR428748
 
Introduction to databases.pptx
sworna kumari chithiraivelu
 
Lecture 1 Introduction to Bioinformatics BCH 433.ppt
KelechiChukwuemeka
 
Bioinformatics introduction
DrGopaSarma
 
Biological Databases | Access to sequence data and related information
NahalMalik1
 
Group 5 DNA Tech - Ecology & Envt
Jessica Kabigting
 
Bioinformatics- An overwiew..................
SoumitraNath9
 
Biological databases
Tamanna Syeda
 
Intro bioinfo
Vinitha Nair
 
Intro bioinfo
Vinitha Nair
 
Protein databases
bansalaman80
 
Lesson2_1_FundamentalGenomics in Genomics.pptx
Joseph Asamoah-Asare MPhil
 
BioInformatics Tools -Genomics , Proteomics and metablomics
AyeshaYousaf20
 
Biological database
Iqbal college Peringammala TVM
 
Dn abarcode
vp1221210130
 
RML NCBI Resources
Jackie Wirz, PhD
 
Bioinformatics - Discovering the Bio Logic Of Nature
Robert Cormia
 
Ad

More from Rainu Rajeev (6)

PDF
Jsir 59(2) 87 101
Rainu Rajeev
 
PDF
Areps siddiqui etal 2013
Rainu Rajeev
 
PDF
Marine microbiology ecology & applications colin munn
Rainu Rajeev
 
PDF
Agricultural biotechnology
Rainu Rajeev
 
PDF
Marine board pp17_microcean
Rainu Rajeev
 
PPTX
Protein structure 2
Rainu Rajeev
 
Jsir 59(2) 87 101
Rainu Rajeev
 
Areps siddiqui etal 2013
Rainu Rajeev
 
Marine microbiology ecology & applications colin munn
Rainu Rajeev
 
Agricultural biotechnology
Rainu Rajeev
 
Marine board pp17_microcean
Rainu Rajeev
 
Protein structure 2
Rainu Rajeev
 
Ad

Recently uploaded (20)

PPTX
LESSON 2 PSYCHOSOCIAL DEVELOPMENT.pptx L
JeanCarolColico1
 
PDF
A young gas giant and hidden substructures in a protoplanetary disk
Sérgio Sacani
 
PPT
Introduction of animal physiology in vertebrates
S.B.P.G. COLLEGE BARAGAON VARANASI
 
PDF
2025-06-10 TWDB Agency Updates & Legislative Outcomes
tagdpa
 
PPTX
mode_of_action_of_fungicides_final[1] (2).pptx
MrRABIRANJAN
 
PDF
WUCHERIA BANCROFTI-converted-compressed.pdf
S.B.P.G. COLLEGE BARAGAON VARANASI
 
PPTX
Quarter 4 - Module 4A -Plate Tectonics-Seismic waves in Earth's Mechanism.pptx
JunimarAggabao
 
PDF
Unit-3 ppt.pdf organic chemistry unit 3 heterocyclic
visionshukla007
 
PPTX
MICROBIOLOGY PART-1 INTRODUCTION .pptx
Mohit Kumar
 
PPTX
Hypothalamus_nuclei_ structure_functions.pptx
muralinath2
 
PPTX
GB1 Q1 04 Life in a Cell (1).pptx GRADE 11
JADE ACOSTA
 
PPTX
Clinical trial monitoring &safety monitoring in clinical trials.2nd Sem m pha...
Mijamadhu
 
PDF
Unit-3 ppt.pdf organic chemistry - 3 unit 3
visionshukla007
 
PDF
Continuous Model-Based Engineering of Software-Intensive Systems: Approaches,...
Hugo Bruneliere
 
PPTX
Akshay tunneling .pptx_20250331_165945_0000.pptx
akshaythaker18
 
PPTX
Anatomy and physiology of digestive system.pptx
Ashwini I Chuncha
 
PDF
Primordial Black Holes and the First Stars
Sérgio Sacani
 
PPTX
Pratik inorganic chemistry silicon based ppt
akshaythaker18
 
PPTX
Lamarckism is one of the earliest theories of evolution, proposed before Darw...
Laxman Khatal
 
PDF
Carbon-richDustInjectedintotheInterstellarMediumbyGalacticWCBinaries Survives...
Sérgio Sacani
 
LESSON 2 PSYCHOSOCIAL DEVELOPMENT.pptx L
JeanCarolColico1
 
A young gas giant and hidden substructures in a protoplanetary disk
Sérgio Sacani
 
Introduction of animal physiology in vertebrates
S.B.P.G. COLLEGE BARAGAON VARANASI
 
2025-06-10 TWDB Agency Updates & Legislative Outcomes
tagdpa
 
mode_of_action_of_fungicides_final[1] (2).pptx
MrRABIRANJAN
 
WUCHERIA BANCROFTI-converted-compressed.pdf
S.B.P.G. COLLEGE BARAGAON VARANASI
 
Quarter 4 - Module 4A -Plate Tectonics-Seismic waves in Earth's Mechanism.pptx
JunimarAggabao
 
Unit-3 ppt.pdf organic chemistry unit 3 heterocyclic
visionshukla007
 
MICROBIOLOGY PART-1 INTRODUCTION .pptx
Mohit Kumar
 
Hypothalamus_nuclei_ structure_functions.pptx
muralinath2
 
GB1 Q1 04 Life in a Cell (1).pptx GRADE 11
JADE ACOSTA
 
Clinical trial monitoring &safety monitoring in clinical trials.2nd Sem m pha...
Mijamadhu
 
Unit-3 ppt.pdf organic chemistry - 3 unit 3
visionshukla007
 
Continuous Model-Based Engineering of Software-Intensive Systems: Approaches,...
Hugo Bruneliere
 
Akshay tunneling .pptx_20250331_165945_0000.pptx
akshaythaker18
 
Anatomy and physiology of digestive system.pptx
Ashwini I Chuncha
 
Primordial Black Holes and the First Stars
Sérgio Sacani
 
Pratik inorganic chemistry silicon based ppt
akshaythaker18
 
Lamarckism is one of the earliest theories of evolution, proposed before Darw...
Laxman Khatal
 
Carbon-richDustInjectedintotheInterstellarMediumbyGalacticWCBinaries Survives...
Sérgio Sacani
 

Bioinformatics final

  • 2. Bioinformatics  The analysis of biological information using computers and statistical techniques.  The science of developing and utilizing computer databases and algorithms to accelerate and enhance biological research  Marriage of Computer Science and Biology  Organize, store, analyze, visualize genomic data  Utilizes methods from Computer Science, Mathematics, Statistics and Biology  Margaret Oakley Dayhoff is the father & mother of Bioinformatics
  • 3.  At the convergence of two revolutions: the ultra-fast growth of biological data, and the information revolution Bioinformatics
  • 4. Example 1  Compare proteins with similar sequences (for instance –kinases) and understand what the similarities and differences mean.
  • 5. Example 2  Look at the genome and predict where genes are (promoters; transcription binding sites; introns; exons)
  • 6.  Predict the 3-dimensional structure of a protein from its primary sequence Example 3
  • 7.  Correlate between gene expression and disease Example 4 A gene chip – quantifying gene expression in different tissues under different conditions May be used for personalized medicine
  • 8. Genomics Study of sequences, gene organization & mutations at the DNA level the study of information flow within a cell
  • 9. The Human Genome Project 3 billion bases 30,000 genes http://www.genome.gov/
  • 10. Goals Established for the Human Genome Project  Began in 1990  Identify all of the genes in human DNA.  Determine the sequence of the 3 billion chemical nucleotide bases that make up human DNA.  Store this information in data bases.  Develop faster, more efficient sequencing technologies.  Develop tools for data analysis.  Address the ethical, legal, and social issues (ELSI) that may arise form the project.
  • 11. Published •The International Human Genome Sequencing Consortium published their results in Nature, 409 (6822): 860-921, 2001.”Initial Sequencing and Analysis of the Human Genome” •Celera Genomics published their results in Science, Vol 291(5507): 1304-1351, 2001.“The Sequence of the Human Genome”
  • 12. • How to characterize new diseases? • What new treatments can be discovered? • How do we treat individual patients? Tailoring treatments? Impact of Genomics on Medicine
  • 13. Implications for Biomedicine • Physicians will use genetic information to diagnose and treat disease. • Virtually all medical conditions have a genetic component • Faster drug development research: (pharmacogenomics) • Individualized drugs • All Biologists/Doctors will use gene sequence information in their daily work
  • 14. Proteomics • Uses information determined by biochemical/crystal structure methods • Visualization of protein structure • Make protein-protein comparisons • Used to determine: conformation/folding antibody binding sites protein-protein interactions computer aided drug design
  • 15. Transcriptomics The term can be applied to the total set of transcripts in a given organism, or to the specific subset of transcripts present in a particular cell type The transcriptome reflects the genes that are being actively expressed at any given time Eg. DNA microarray technology
  • 16. Database  A collection of data that needs to be:  Structured  Searchable  Updated (periodically)  Cross referenced  Challenge:  To change “meaningless” data into useful information that can be accessed and analysed the best way possible.
  • 17. Biological databases  Like any other database  Data organization for optimal analysis  Data is of different types  Raw data (DNA, RNA, protein sequences)  Curated data (DNA, RNA and protein annotated sequences and structures, expression data)
  • 19. Raw Biological data Amino acid residues (proteins)
  • 20. Curated Biological Data DNA, nucleotide sequences Gene boundaries, topology Gene structure Introns, exons, ORFs, splicing Expression data Mass spectometry
  • 21. Curated Biological data 3D Structures, folds
  • 23. Primary Seq db’s is collaborative- data exchanged daily
  • 24. Databases 1: nucleotide sequence  The main DNA sequence db are EMBL (Europe)/GenBank (USA) /DDBJ (Japan)  There are also specialized databases for the different types of RNAs (i.e. tRNA, rRNA, mi RNA…)  3D structure (DNA and RNA)  Others: Aberrant splicing db; Eucaryotic promoter db (EPD); RNA editing sites, Multimedia Telomere Resource ……
  • 25. EMBL/GenBank/DDJB  These 3 db contain mainly the same informations within 2-3 days (few differences in the format and syntax)  Serve as archives containing all sequences (single genes, ESTs, complete genomes, etc.) derived from:  Genome projects and sequencing centers  Individual scientists  Patent offices (i.e. European Patent Office, EPO)  Non-confidential data are exchanged daily  Currently: 8.3 x106 sequences, over 9.7 x109 bp;  Sequences from > 50’000 different species;
  • 26. GenBank entry: example LOCUS HSERPG 3398 bp DNA PRI 22-JUN-1993 DEFINITION Human gene for erythropoietin. ACCESSION X02158 VERSION X02158.1 GI:31224 KEYWORDS erythropoietin; glycoprotein hormone; hormone; signal peptide. SOURCE human. ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Vertebrata; Mammalia; Eutheria; Primates; Catarrhini; Hominidae; Homo. REFERENCE 1 (bases 1 to 3398) AUTHORS Jacobs,K., Shoemaker,C., Rudersdorf,R., Neill,S.D., Kaufman,R.J., Mufson,A., Seehra,J., Jones,S.S., Hewick,R., Fritsch,E.F., Kawakita,M., Shimizu,T. and Miyake,T. TITLE Isolation and characterization of genomic and cDNA clones of human erythropoietin JOURNAL Nature 313 (6005), 806-810 (1985) MEDLINE 85137899 COMMENT Data kindly reviewed (24-FEB-1986) by K. Jacobs. FEATURES Location/Qualifiers source 1..3398 /organism="Homo sapiens" /db_xref="taxon:9606" mRNA join(397..627,1194..1339,1596..1682,2294..2473,2608..3327) exon 397..627 /number=1 sig_peptide join(615..627,1194..1261) CDS join(615..627,1194..1339,1596..1682,2294..2473,2608..2763) /codon_start=1 /product="erythropoietin" /protein_id="CAA26095.1" /db_xref="GI:312304" /db_xref="SWISS-PROT:P01588" /translation="MGVHECPAWLWLLLSLLSLPLGLPVLGAPPRLICDSRVLQRYLL EAKEAENITTGCAEHCSLNENITVPDTKVNFYAWKRMEVGQQAVEVWQGLALLSEAVL RGQALLVNSSQPWEPLQLHVDKAVSGLRSLTTLLRALGAQKEAISPPDAASAAPLRTI …
  • 27. GenBank entry (cont.) TADTFRKLFRVYSNFLRGKLKLYTGEACRTGDR" intron 628..1193 /number=1 exon 1194..1339 /number=2 mat_peptide join(1262..1339,1596..1682,2294..2473,2608..2760) /product="erythropoietin" intron 1340..1595 /number=2 exon 1596..1682 /number=3 intron 1683..2293 /number=3 exon 2294..2473 /number=4 intron 2474..2607 /number=4 exon 2608..3327 /note="3' untranslated region" /number=5 BASE COUNT 698 a 1034 c 991 g 675 t ORIGIN 1 agcttctggg cttccagacc cagctacttt gcggaactca gcaacccagg catctctgag 61 tctccgccca agaccgggat gccccccagg aggtgtccgg gagcccagcc tttcccagat 121 agcagctccg ccagtcccaa gggtgcgcaa ccggctgcac tcccctcccg cgacccaggg 181 cccgggagca gcccccatga cccacacgca cgtctgcagc agccccgtca gccccggagc 241 ctcaacccag gcgtcctgcc cctgctctga ccccgggtgg cccctacccc tggcgacccc
  • 28. Database 2: protein sequences  SWISS-PROT: created in 1986 (A.Bairoch) http://www.expasy.org/sprot/  TrEMBL: created in 1996; complement to SWISS-PROT; derived from EMBL CDS translations (« proteomic » version of EMBL)  PIR-PSD: Protein Information Resources http://pir.georgetown.edu/  Genpept: « proteomic » version of GenBank  Many specialized protein databases for specific families or groups of proteins.  Examples: AMSDb (antibacterial peptides), GPCRDB (7 TM receptors), IMGT (immune system) YPD (Yeast) etc.
  • 29. SWISS-PROT  Collaboration between the SIB (CH) and EMBL/EBI (UK)  Fully annotated (manually), non-redundant, cross- referenced, documented protein sequence database.  ~113 ’000 sequences from more than 6’800 different species; 70 ’000 references (publications); 550 ’000 cross-references (databases); ~200 Mb of annotations.  Weekly releases; available from about 50 servers across the world, the main source being ExPASy
  • 30. TrEMBL (Translation of EMBL)  It is impossible to cope with the quantity of newly generated data AND to maintain the high quality of SWISS-PROT -> TrEMBL, created in 1996.  TrEMBL is automatically generated (from annotated EMBL coding sequences (CDS)) and annotated using software tools.  Contains all what is not in SWISS-PROT. SWISS-PROT + TrEMBL = all known protein sequences.  Well-structured SWISS-PROT-like resource.
  • 32. TrEMBL: example Original TrEMBL entry which has been integrated into the SWISS-PROT EPO_HUMAN entry and thus which is not found in TrEMBL anymore.
  • 34. PDB  Protein Data Bank, managed by RCSB  Currently there are ~13’000 structures for about 4’000 different molecules, but far less protein family !  There are also databases that contain data derived from PDB. Examples: HSSP (homology-derived secondary structure of proteins), SWISS-3DIMAGE (images)… Restriction enzyme
  • 35. PDB: example HEADER LYASE(OXO-ACID) 01-OCT-91 12CA 12CA 2 COMPND CARBONIC ANHYDRASE /II (CARBONATE DEHYDRATASE) (/HCA II) 12CA 3 COMPND 2 (E.C.4.2.1.1) MUTANT WITH VAL 121 REPLACED BY ALA (/V121A) 12CA 4 SOURCE HUMAN (HOMO SAPIENS) RECOMBINANT PROTEIN 12CA 5 AUTHOR S.K.NAIR,D.W.CHRISTIANSON 12CA 6 REVDAT 1 15-OCT-92 12CA 0 12CA 7 JRNL AUTH S.K.NAIR,T.L.CALDERONE,D.W.CHRISTIANSON,C.A.FIERKE 12CA 8 JRNL TITL ALTERING THE MOUTH OF A HYDROPHOBIC POCKET. 12CA 9 JRNL TITL 2 STRUCTURE AND KINETICS OF HUMAN CARBONIC ANHYDRASE 12CA 10 JRNL TITL 3 /II$ MUTANTS AT RESIDUE VAL-121 12CA 11 JRNL REF J.BIOL.CHEM. V. 266 17320 1991 12CA 12 JRNL REFN ASTM JBCHA3 US ISSN 0021-9258 071 12CA 13 REMARK 1 12CA 14 REMARK 2 12CA 15 REMARK 2 RESOLUTION. 2.4 ANGSTROMS. 12CA 16 REMARK 3 12CA 17 REMARK 3 REFINEMENT. 12CA 18 REMARK 3 PROGRAM PROLSQ 12CA 19 REMARK 3 AUTHORS HENDRICKSON,KONNERT 12CA 20 REMARK 3 R VALUE 0.170 12CA 21 REMARK 3 RMSD BOND DISTANCES 0.011 ANGSTROMS 12CA 22 REMARK 3 RMSD BOND ANGLES 1.3 DEGREES 12CA 23 REMARK 4 12CA 24 REMARK 4 N-TERMINAL RESIDUES SER 2, HIS 3, HIS 4 AND C-TERMINAL 12CA 25 REMARK 4 RESIDUE LYS 260 WERE NOT LOCATED IN THE DENSITY MAPS AND, 12CA 26 REMARK 4 THEREFORE, NO COORDINATES ARE INCLUDED FOR THESE RESIDUES. 12CA 27 ………
  • 36. ATOM 1 N TRP 5 8.519 -0.751 10.738 1.00 13.37 12CA 89 ATOM 2 CA TRP 5 7.743 -1.668 11.585 1.00 13.42 12CA 90 ATOM 3 C TRP 5 6.786 -2.502 10.667 1.00 13.47 12CA 91 ATOM 4 O TRP 5 6.422 -2.085 9.607 1.00 13.57 12CA 92 ATOM 5 CB TRP 5 6.997 -0.917 12.645 1.00 13.34 12CA 93 ATOM 6 CG TRP 5 5.784 -0.209 12.221 1.00 13.40 12CA 94 ATOM 7 CD1 TRP 5 5.681 1.084 11.797 1.00 13.29 12CA 95 ATOM 8 CD2 TRP 5 4.417 -0.667 12.221 1.00 13.34 12CA 96 ATOM 9 NE1 TRP 5 4.388 1.418 11.515 1.00 13.30 12CA 97 ATOM 10 CE2 TRP 5 3.588 0.375 11.797 1.00 13.35 12CA 98 ATOM 11 CE3 TRP 5 3.837 -1.877 12.645 1.00 13.39 12CA 99 ATOM 12 CZ2 TRP 5 2.216 0.208 11.656 1.00 13.39 12CA 100 ATOM 13 CZ3 TRP 5 2.465 -2.043 12.504 1.00 13.33 12CA 101 ATOM 14 CH2 TRP 5 1.654 -1.001 12.009 1.00 13.34 12CA 102 …….
  • 39. Species-oriented databases  SGD: Saccharomyces Genome Database; molecular biology and genetics of the yeast Saccharomyces cerevisiae (baker's or budding yeast)  FlyBase: A Database of the Drosophila Genome  MGI: Mouse Genome Informatics  RGD: Rat Genome Database  WormBase - database for the nematode Caenorhabditis elegans  The Arabidopsis Information Resource (TAIR) - database for the brassica family plant Arabidopsis thaliana Mouse Genome Informatics
  • 41. Sequence Alignment Tools Database Searching: BLAST: NCBI, Web Interface: http://www.ncbi.nlm.nih.gov/BLAST/ WuBLAST http://blast.wustl.edu FASTA: http://www.ebi.ac.uk/fasta3/ Smith-Waterman Par-Align: http://dna.uio.no/search/ Multiple Sequence Alignment: CLUSTALW: http://www-igbmc.u-strasbg.fr/BioInfo/ClustalX/Top.html DiAlign, Web Interface: http://genomatix.gsf.de/cgi-bin/dialign/dialign.pl MSA:http://www.ncbi.nlm.nih.gov/CBBresearch/Schaffer/msa.html Web Interface: http://bioweb.pasteur.fr/seqanal/interfaces/msa-simple.html
  • 43. Uses of BLAST Query a database for sequences similar to an input sequence
  • 44. Uses of BLAST  Identify previously characterized sequences Query a database for sequences similar to an input sequence
  • 45. Uses of BLAST  Identify previously characterized sequences.  Find phylogenetically related sequences. Query a database for sequences similar to an input sequence
  • 46. Uses of BLAST  Identify previously characterized sequences.  Find phylogenetically related sequences.  Identify possible functions based on similarities to known sequences. Query a database for sequences similar to an input sequence
  • 47. Sequence Homology Software  NCBI-BLAST  Run by the National Center for Biotechnology Information  BLAST uses a heuristic algorithm based on the Smith-Waterman algorithm  Algorithm searches database for a small string within the query (default 11 for nucleotide searches), then when it detects a match, searches for shared nucleotides at each end of the seed to extend the match  Gaps are taken into account, then the matches are presented in order of statistical significance  http://www.ncbi.nlm.nih.gov/BLAST/
  • 48. Different Types of BLAST  Nucleotide-nucleotide BLAST (BLASTN):  Basic nucleutide sequence searches  The BLAST that you used for your sequences  Protein-protein BLAST (BLASTP):  Similar technology used to search amino acid sequences  Position-Specific Iterative BLAST (PSI- BLAST):  A more advance protein BLAST useful for analyzing relationships between divergently evolved proteins.
  • 49. Different Types of BLAST  BLASTX and tBLASTN variants:  Use six-frame translation for proteins and nucleotides, respectively, in the search  MegaBLAST:  Used for BLASTing several sequences at once to cut down on processing load and server reporting- time
  • 50. Types of BLAST: Graphic courtesy of Joel Graber.
  • 51. Interpreting BLAST Results  Query Coverage  The percent of the query sequence matched by the database entry  Max Ident  The percent identity, i.e. the percent that the genes match up within the limits of the full match (e.g. deletions or additions reduce this value)
  • 52. FASTA  Another method for local sequence alignment.  Maintained by Dr. William Pearson at the University of Virginia.  http://www.infobiogen.fr/doc/Fasta/docfasta.html
  • 53. FASTA Versions FASTA-nucleotide or protein sequence searching FASTx/ FASTy compares a translated DNA query sequence to a protein sequence database (forward or backward translation of the query) tFASTx/ tFASTy -compares protein query sequence to DNA sequence database that has been translated into three forward and three reverse reading frames
  • 54. Multiple Sequence Alignment Software  Clustal (free)  ClustalX – Software  ClustalW – Web  DNAStar ($$$)  Functionality is similar, but difference is in interface, tools, and speed of algorithms  http://www.ebi.ac.uk/clustalw/
  • 55. Multiple Sequence Alignment (MSA) • Multiple sequence alignment (MSA) can be seen as a generalization of Pairwise Sequence Alignment - instead of aligning two sequences, n sequences are aligned simultaneously, where n is > 2 • Definition: A multiple sequence alignment is an alignment of n > 2 sequences obtained by inserting gaps (“-”) into sequences
  • 56. FastA Format  A sequence in FASTA format begins with a single- line description, followed by lines of sequence data.  The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column.  It is recommended that all lines of text be shorter than 80 characters in length.
  • 57. Other Completed Genomes  Haemophilus influenzae  Escherichia coli  Bacillus subtilus  Helicobacter pylori  Borrelia burgdorferi  Streptococcus pneumoniae  Saccharomyces cerevisiae
  • 58.  Caenorhabditis elegans  Arabidopsis thaliana  Archaeoglobus fulgidus  Methanobacterium thermoautotrophicum  Methanococcus jannaschii  Mycoplasma pneumoniae  Mycoplasm genitaliu  Rickettsia prowazekii  Mycobacterium tuberculosis
  • 59.  Treponema pallidum  Staphylococcus aureus  And more!
  • 60. Completed Plant Genomes  Arabidopsis thaliana Completed Insect Genomes  Drosophila melanogaster Completed Rodent Genomes  Mus musculus
  • 61. Protien- Three Structure Levels Beta Sheet Helix Loop PDB ID: 12as Primary structure: sequence of amino acids – e.g., DRVYIHPF Secondary structure: local folding patterns – e.g., alpha-helix, beta-sheet, loop Tertiary structure: complete 3D fold
  • 62. Different Levels of Protein Structure Prediction
  • 63. Protein Structure Prediction  Regular Secondary Structure Prediction (-helix -sheet)  APSSP2: Highly accurate method for secondary structure prediction  Combines memory based reasoning ( MBR) and ANN methods  Irregular secondary structure prediction methods (Tight turns)  Betatpred: Consensus method for -turns prediction  Statistical methods combined  Kaur and Raghava (2001) Bioinformatics  Bteval : Benchmarking of -turns prediction  Kaur and Raghava (2002) J. Bioinformatics and Computational Biology, 1:495:504  BetaTpred2: Highly accurate method for predicting -turns (ANN)  Multiple alignment and secondary structure information  Kaur and Raghava (2003) Protein Sci 12:627-34  BetaTurns: Prediction of -turn types in proteins  Evolutionary information  Kaur and Raghava (2004) Bioinformatics 20:2751-8.  AlphaPred: Prediction of -turns in proteins  Kaur and Raghava (2004) Proteins: Structure, Function, and Genetics 55:83-90  GammaPred: Prediction of -turns in proteins  Kaur and Raghava (2004) Protein Science; 12:923-929.
  • 64. Protein Structure Prediction  BhairPred: Prediction of Super secondary structure prediction  Prediction of Beta Hairpins  Secondary structure and surface accessibility used as input  Manish et al. (2005) Nucleic Acids Research  TBBpred: Prediction of outer membrane proteins  Prediction of trans membrane beta barrel proteins  Prediction of beta barrel regions  Application of ANN and SVM + Evolutionary information  Natt et al. (2004) Proteins: 56:11-8  ARNHpred: Analysis and prediction side chain, backbone interactions  Prediction of aromatic NH interactions  Kaur and Raghava (2004) FEBS Letters 564:47-57 .  SARpred: Prediction of surface accessibility  Multiple alignment (PSIBLAST) and Secondary structure information  ANN: Two layered network (sequence-structure-structure)  Garg et al., (2005) Proteins  PepStr: Prediction of tertiary structure of Bioactive peptides
  • 65. Homology Modeling The Best Match DRVYIHPFADRVYIHPFAQuery Sequence: Protein sequence classification database • PSI-BLAST • HMM • Smith-Waterman algorithm
  • 66. Phylogenetic analysis  Evolution studies  Systematic biology  Medical research and epidemiology  Ecology
  • 69. Advantages of molecular phylogenetic analysis  Analogous features (share common function, but NOT common ancestry) can be misleading  DNA sequences more simple to model, we only have the four states A, C, G, T  DNA samples for sequence analysis easy to prepare  Softwares used :  Phylip  Paup
  • 70. Modern drug discovery process Target identification Target validation Lead identification Lead optimization Preclinical phase Drug discovery 2-5 years • Drug discovery is an expensive process involving high R & D cost and extensive clinical testing • A typical development time is estimated to be 10-15 years. 6-9 years
  • 71. Design  Comparison of Sequences: Identify targets  Homology modelling: active site prediction  Systems Biology: Identify targets  Databases: Manage information  In silico screening (Ligand based, receptor based): Iterative steps of Molecular docking.  Pharmacogenomic databases: assist safety related issues
  • 72. Molecular Docking RL • Docking is the computational determination of binding affinity between molecules (protein structure and ligand). • Given a protein and a ligand find out the binding free energy of the complex formed by docking them. L R
  • 73. Molecular Docking: classification  Docking or Computer aided drug designing can be broadly classified  Receptor based methods- make use of the structure of the target protein.  Ligand based methods- based on the known inhibitors
  • 74. Receptor based methods  Uses the 3D structure of the target receptor to search for the potential candidate compounds that can modulate the target function.  These involve molecular docking of each compound in the chemical database into the binding site of the target and predicting the electrostatic fit between them.  The compounds are ranked using an appropriate scoring function such that the scores correlate with the binding affinity.  Receptor based method has been successfully applied in many targets
  • 75. Ligand based strategy  In the absence of the structural information of the target, ligand based method make use of the information provided by known inhibitors for the target receptor.  Structures similar to the known inhibitors are identified from chemical databases.
  • 76. Ligand based strategy Search for similar compounds database known actives structures found
  • 77. Example: use of Bioinformatics in Drug discovery Identification of novel drug targets against human malaria Docking Software : Dock Autodock Gold
  • 78. Top 100 solutions Out of top 40 only 10 compounds available for purchase Drug-like compound library (1,000,00) Molecular docking Ligand docked into protein’s active site Insilico identification of novel inhibitors against PfClpQ , a novel drug target of P.falciparum by high throughput docking PfclpQ
  • 79. Rasmol RasMol2 is a molecular graphics program intended for the visualization of proteins, nucleic acids and small molecules The program is aimed at display, teaching and generation of publication quality images The program reads in molecular co-ordinate files and interactively displays the molecule the screen in a variety of representations and color schemes Supported input file formats include Brookhaven Protein Databank (PDB), Sybyl Mol2 formats, Molecular Design Limited's (MDL) Mol file format, Minnesota Supercomputer Centre's (MSC) XYZ (XMol) format CHARMm format files
  • 81. Stephen James Department of Bioinformatics Mar Athanasios College for Advanced Studies(MACFAST) Email : [email protected] Phone: 9746935363