02.databases slides

Biomolecular databases
Bioinformatics
Jacques van HeldenFORMER ADDRESS (1999-2011)
Université Libre de Bruxelles, Belgique
Bioinformatique des Génomes et des Réseaux (BiGRe lab)
http://www.bigre.ulb.ac.be/
NEW ADDRESS (since Nov 1st,2011)
Jacques.van-Helden@univ-amu.fr
Université d’Aix-Marseille, France
Lab. Technological Advances for Genomics and Clinics
(TAGC, INSERM Unit U1090)
http://tagc.univ-mrs.fr/
B!GRe
Bioinformatique des
Génomes et Réseaux
!"#$%&'&()#*' *,-*%#". /&0("%&1)#. *%, #')%)#
! "#$
Inserm U1090

Contents
 Examples of biological databases
 Nucleic sequences: Genbank, EMBL, and DDBJ
 Protein sequences: UniProt
 The Gene Ontology (GO) project
 Issues and perspectives for biological databases

Examples of biomolecular databases
Biomolecular Databases
Jacques.van.Helden@ulb.ac.be
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe)

Examples of biomolecular databases
 Sequence and structure databases
 Protein sequences (UniProt)
 DNA sequences (EMBL, Genbank, DDBJ)
 3D structures (PDB)
 Structural motifs (CATH)
 Sequence motifs (PROSITE, PRODOM)
 Genome sequences and annotations
 Genome-specific databases (SGD, FlyBase, AceDB, PlasmoDB, …)
 Multiple genomes (Integr8, NCBI, KEGG, TIGR, …)
 Molecular functions
 Transcriptional regulation (TRANSFAC, RegulonDB, InteractDB)
 Enzymatic catalysis (Expasy, LIGAND/KEGG, BRENDA)
 Transport (YTPdb)
 Biological processes
 Metabolic pathways (EcoCyc, LIGAND/KEGG, Biocatalysis/biodegradation)
 Signal transduction pathways (CSNdb, Transpath)
 Protein-protein interactions (DIP, BIND, MINT)
 Gene networks (GeneNet, FlyNets)

Databases of databases
 There are hundreds of databases related to molecular biology and biochemistry.
New databases are created every year.
 Every year, the first issue of Nucleic Acids Research is dedicated to biological
databases
 http://nar.oupjournals.org/
 2011 Issue: http://nar.oxfordjournals.org/content/39/suppl_1
 The same journal maintains a database of databases: the Molecular Biology
Database Collection
 http://www.oxfordjournals.org/nar/database/c/
 Some bioinformatics centres maintain multiple database, with cross-links
between them. The SRS server at EBI holds an impressive collection of
databases.
 http://srs.ebi.ac.uk/

Nucleic sequence databases:
GenBank, EMBL, and DDBJ

Okubo et al. (2006) NAR 34: D6-D9
Nucleic sequence databases
 To publish an article dealing with a sequence, scientific journals impose to have
previously deposited this sequence in a reference database.
 There are 3 main repositories for nucleic acid sequences.
 Sequences deposited in any of these 3 databases are automatically
synchronized in the 2 other ones.

Adapted from Didier Gonze
The sequencing pace
 Nucleic sequences
 Genbank (April 2011) http://www.ncbi.nlm.nih.gov/genbank/
• 126,551,501,141 bases in 135,440,924 sequence records in the
traditional GenBank divisions
• 191,401,393,188 bases in 62,715,288 sequence records in the
Whole Genome Ssequencing
 Entire genomes
 GOLD Release V.2 (Oct 2011) contains ~2000 completely sequenced
genomes.
 http://www.genomesonline.org/gold_statistics.htm
 Protein sequences
 Essentially obtained by translation of putative genes in nucleic
sequences (almost no direct protein sequencing).
 UniProtKB/TrEMBL (2011) contains 17 millions of protein sequences.
 http://www.ebi.ac.uk/swissprot/sptr_stats/index.html

Size of the nucleotide database
EMBL Nucleotide Sequence Database: Release Notes - Release 113 September 2012
http://www.ebi.ac.uk/embl/Documentation/Release_notes/current/relnotes.html
Class entries nucleotides
------------------------------------------------------------------
CON:Constructed 7,236,371 359,112,791,043
EST:Expressed Sequence Tag 73,715,376 40,997,082,803
GSS:Genome Sequence Scan 34,528,104 21,985,922,905
HTC:High Throughput CDNA sequencing 491,770 594,229,662
HTG:High Throughput Genome sequencing 152,599 25,159,746,658
PAT:Patents 24,364,832 12,117,896,594
STD:Standard 13,920,617 37,665,112,606
STS:Sequence Tagged Site 1,322,570 636,037,867
TSA:Transcriptome Shotgun Assembly 8,085,693 5,663,938,279
WGS:Whole Genome Shotgun
Total
88,288,431
-----------
252,106,363
305,661,696,545
---------------
450,481,663,919
Division entries nucleotides
------------------------------------------------------------------
ENV:Environmental Samples 30,908,230 14,420,391,278
FUN:Fungi 6,522,586 11,614,472,226
HUM:Human 32,094,500 38,072,362,804
INV:Invertebrates 31,907,138 52,527,673,643
MAM:Other Mammals 40,012,731 145,678,620,711
MUS:Mus musculus 11,745,671 19,701,637,499
PHG:Bacteriophage 8,511 85,549,111
PLN:Plants 52,428,994 55,570,452,118
PRO:Prokaryotes 2,808,489 28,807,572,238
ROD:Rodents 6,554,012 33,326,106,733
SYN:Synthetic 4,045,013 782,174,055
TGN:Transgenic 285,307 849,743,891
UNC:Unclassified 8,617,225 4,957,442,673
VRL:Viruses 1,358,528 1,518,575,082
VRT:Other Vertebrates
Total
22,809,428
-----------
252,106,363
42,568,889,857
---------------
450,481,663,919

Genbank (NCBI - USA)
http://www.ncbi.nlm.nih.gov/Genbank/

The EMBL Nucleotide Sequence Database (EBI - UK)
http://www.ebi.ac.uk/embl/

DDBJ - DNA Data Bank of Japan
http://www.ddbj.nig.ac.jp/

URL Sequences
Bases
(without
shotgun)
bases
(including
shotgun) Organisms
DDBJ http://www.ddbj.nig.ac.jp/ 2.0E+06 1.7E+09
EMBL
GenBank
http://www.ebi.ac.uk/embl/
http://www.ncbi.nlm.nih.gov/ 4.6E+07 5.1E+10
1.0E+11
1.0E+11
2.0E+05
2.1E+05
Size of the nucleic sequence databases
 Summary of database contents for the 3 main databases of nucleic sequences.
 Source: NAR database issue January 2006.

UniProt : protein sequences
and functional annotations

UniProt - the Universal Protein Resource
http://www.uniprot.org/
 Database content (Sept 2012)
 UniProtKB:
• 24,532,088 entries
• Translation of EMBL coding sequences
(non-redundant with Swiss-Prot)
 UniProtKB/Swiss-Prot section (reviewed):
• 537,505 entries
• annotation by experts
• high information content
• many references to the literature
• good reliability of the information
 The rest (90% of the entries)
• Automatic annotation by sequence
similarity.
 Features
 The most comprehensive protein database in
the world.
 A huge team: >100 annotators + developers.
 Annotation by experts: annotators are
specialized for different types of proteins or
organisms.
 World-wide recognized as an essential
resource.
 References
 Bairoch et al. The SWISS-PROT protein
sequence data bank. Nucleic Acids Res (1991)
vol. 19 Suppl pp. 2247-9
 The UniProt Consortium. The Universal Protein
Resource (UniProt) 2009. Nucleic Acids Res
(2008). Database Issue.
Number of entries (polypeptides) in Swiss-Prot
http://www.expasy.org/sprot/relnotes/relstat.html
Taxonomic distribution of the sequences
Within Eukaryotes

UniProt example - Human Pax-6 protein
Header : name and synonyms

Human-based annotation by specialists

Structured annotation : keywords and Gene Ontology terms

Protein interactions; Alternative products

Detailed description of regions, variations, and secondary structure

Peptidic sequence

References to original publications

Cross-references to many databases (fragment shown)

3D Structure of macromolecules

PDB - The Protein Data Bank
http://www.rcsb.org/pdb/

Genome browsers

EnsEMBL Genome Browser (Sanger Institute + EBI)
http://www.ensembl.org/

UCSC Genome Browser (University California Santa Cruz - USA)
http://genome.ucsc.edu/
Human gene Pax6 aligned with Vertebrate genomes

Drosophila gene eyeless (homolog to Pax6) aligned with Insect genomes

Drosophila 120kb chromosomal region covering the Achaete-Scute Complex

ECR Browser
http://ecrbrowser.dcode.org/

EnsEMBL - Example: Drosophila gene Pax6
http://www.ensembl.org/

Comparative genomics

Integr8 - access to complete genomes and proteomes
http://www.ebi.ac.uk/integr8/

Integr8 - genome summaries

Integr8 - clusters of orthologous genes (COGs)

Integr8 - clusters of paralogous genes

Databases of protein domains

Prosite - protein domains, families and functional sites
http://www.expasy.ch/prosite/

Prosite - aligned sequences and logo
 Some of the sequences that were
used to built the Prosite profile for
the Zn(2)-C6 fungal-type DNA-
binding domain
(ZN2_CY6_FUNGAL_2,
PS50048).
 The Sequence Logo (below)
indicates the level of conservation
of each residue in each column of
the alignment.
 Note the 6 cysteines,
characteristic of this domain.

Prosite - Example of profile matrix

Prosite - Example of sequence logo

Prosite - Example of domain signature
 The domain signature is a string-based pattern representing the residues that
are characteristic of a domain.

PFAM (Sanger Institute - UK) http://pfam.sanger.ac.uk/
Protein families represented by multiple sequence alignments and hidden Markov models (HMMs)

CATH - Protein Structure Classification
http://www.cathdb.info/
 CATH is a hierarchical classification of
protein domain structures, which clusters
proteins at four major levels:
 Class (C),
 Architecture (A),
 Topology (T)
 Homologous superfamily (H).
 The boundaries and assignments for
each protein domain are determined
using a combination of automated and
manual procedures which include
computational techniques, empirical and
statistical evidence, literature review and
expert analysis.
 References
 Orengo et al. The CATH Database
provides insights into protein structure/
function relationships. Nucleic Acids Res
(1999) vol. 27 (1) pp. 275-9
 Cuff et al. The CATH classification
revisited--architectures reviewed and new
ways to characterize structural divergence
in superfamilies. Nucleic Acids Res (2008)
pp.

CATH - Protein Structure Classification
http://www.cathdb.info/

InterPro (EBI - UK)
http://www.ebi.ac.uk/interpro/

InterPro (EBI - UK)
Antennapedia-like Homeobox (entry IPR001827)

The Gene Ontology (GO) database

Ontology definition
 Ontologie: partie de la métaphysique qui s'intéresse à l'être en tant qu'être,
indépendamment de ses déterminations particulières
 Ontology: part of the metaphysics that focusses on the being as a beging, independently of
its particular determinations
Le Petit Robert - dictionnaire alphabétique et analogique de la langue française. 1993

The "bio-ontologies"
 Answer to the problem of inconsistencies in the annotations
 Controlled vocabulary
 Hierarchical classification between the terms of the controlled vocabulary
 E.g.: The Gene Ontology
 molecular function ontology
 process ontology
 cellular component ontology

Gene ontology: molecular functions

Gene ontology: cellular components

Gene Ontology Database
http://www.geneontology.org/

Gene Ontology Database (http://www.geneontology.org/)
Example: methionine biosynthetic process

Status of GO annotations (NAR DB issue 2006)
 Term definitions
 Biological process terms
 Molecular function terms
 Cellular component terms
 Sequence Ontology terms
9,805
7,076
1,574
963
 Genomes with annotation 30
 Excludes annotations from UniProt, which represent 261 annotated proteomes.
 Annotated gene products
 Total
 Electronic only
 Manually curated
1,618,739
1,460,632
158,107

QuickGO (http://www.ebi.ac.uk/QuickGO/)
 Web site
http://www.ebi.ac.uk/QuickGO/
 A user-friendly Web interface to
the Gene Ontology.
 Graphical display of the
hierarchical relationships
between terms.
 Convenient browsing between
classes.

Remarks on "bio-ontologies"
 Improvement compared to free text
 controlled vocabulary (choice among synonyms)
 hierarchical relationships between the concepts
 Nothing to do with the philosophical concept of ontology
 A "bio-ontologies" is usually nothing more than a taxonomical classification of
the terms of a controlled vocabulary
 Multiple possibilities of classification criteria
 e.g. compartment subtypes (plasma membrane is a membrane)
 e.g. compartment locations (nucleus is inside cytoplasm is inside plasma
membrane)
 To be useful, should remain purpose-based
 each biologist might wish to define his/her own classification based on his/her
needs and scope of interest
 impossible to define a unifying standard for all biologists
 No representation of molecular interactions
 relationships between objects are only hierarchical, not horizontal or cyclic
 e.g. does not describe which genes are the target of a given transcription
factor

What is biological function ?
 A general definition
 Fonction: action, rôle caractéristique dʼun élément, dʼun organe, dans un ensemble
(souvent opposé à structure). Source: Le Petit Robert - dictionnaire alphabetique et
analogique de la langue francaise. 1982.
 Function: characteristic action (role) of an element (organ) within an set
(often opposed to structure)
 Function and gene ontology
 Understanding the function requires to establish the link between molecular activity
and the context in which it takes place (process).
 Multifunctionality
• Same activity can play different roles in different processes.
 Example: scute gene in Drosophila melanogaster: a transcription factor
(activity) involved in sex determination, determination of neural precursors
and malpighian tubules (3 processes).
• Multiple activities of a same protein in a given process
 Example: aspatokinase PutA in Escherichia coli, contains 2 enzymatic
domains (enzymatic activities) + a DNA-binding domain (DNA binding
transcription factor) -> 3 molecular activities in the same process (proline
utilization).

Small compounds, reactions
and metabolic pathways

LIGAND - Small compounds and metabolic reactions

KEGG - Kyoto Encycplopaedia of Genes and Genomes

Ecocyc, BioCyc and Metacyc - Metabolic pathways

Protein interaction networks
and transduction pathways

Microarray databases

Human genome resources

HapMap
http://www.hapmap.org/
 The International HapMap
Project is a multi-country effort to
identify and catalog genetic
similarities and differences in
human beings.
 Associations between genetic
variations (SNPs, ...) and
diseases + response to
pharmaceuticals.

Issues for
biomolecular databases

Issues for biological databases
 Dealing with biological complexity
 Data content
 Coverage
 Information content
 Data quality
 Data structure
 Consistency
 Query capabilities
 Interfaces
 User interfaces
 Programmatic interfaces
 Annotation
 Funding

Towards biological complexity
 The main databases currently available are focussed on one type of molecular
entity : nucleic sequences, proteins, compounds, …
 This type of organization is very convenient as far as the information to be
represented is simple (e.g. DNA sequences, structures of small molecules and
macromolecules).
 It becomes more difficult if we want to represent
 the interactions between biological objects,
 the integration of various elements in a biological process (metabolic pathways, protein
interaction networks, regulatory networks, …)
 complex concepts such as ”biological function”

Data content
 Scope of the database
 types of biological objects represented
 Number of entries
 coverage of the current knowledge
 Information content
 Level of detail in the description of the biological objects
 References to the source of information

Data quality
 Data Consistency
 always use the same name to indicate the same object
 (this seems trivial, but its is unfortunately still not always the case)
 event better: define an ID for each objects, and allow to retrieve it by any of its
synonyms
 spelling mistakes
 Data Structuration
 distinct fields for distinct attributes of the biological objects
 Reliability
 Evidences ? Level of confidence ?
 Assignation of function by similarity
• recursive process  propagation of errors

Query capabilities
 Browsing (click and read)
 Simple search
 select records with some constraints
 More elaborate search
 select specific fields of some records with constraints on some fields (~SQL
SELECT)
 Complex querying
 ability to return an answer that results from a "live" computation, and was not part
of any record of the dabatase

Interfaces
 User interfaces
 user-friendly
 convenient browsing
 intuitive query forms
 visualization (graphical output)
 Programmatic interfaces
 communication with external programs:
• other databases (concept of distributed database)
• analysis tools

Annotation
 Problem
 The flow of available data is increasing exponentially
 Strategies
 internal curators
 selected external experts
 public submission
 computer-based extraction of information from biological texts

Funding
 Public funding
 Problem: easier to obtain public funds for creating a new database than for
maintaining or expanding existing resources
 Private funding
 Industrial companies are
• ready to invest in good data and good query capabilities
• interested by academic expertise
 Solutions
 All users pay (per query for example)
• Note: academic users are anyway funded by public funds
 Hybrid solution
• access is free for academic users, not for companies
• companies can buy the whole database an install it in-house
(+ add their own private data)
• academia-industry interface is often ensured by a spinoff company

02.databases slides

More Related Content

What's hot (20)

Similar to 02.databases slides (20)

Recently uploaded (20)

02.databases slides