SlideShare a Scribd company logo
Biomolecular databases
Bioinformatics
Jacques van HeldenFORMER ADDRESS (1999-2011)
Université Libre de Bruxelles, Belgique
Bioinformatique des Génomes et des Réseaux (BiGRe lab)
http://www.bigre.ulb.ac.be/
NEW ADDRESS (since Nov 1st,2011)
Jacques.van-Helden@univ-amu.fr
Université d’Aix-Marseille, France
Lab. Technological Advances for Genomics and Clinics
(TAGC, INSERM Unit U1090)
http://tagc.univ-mrs.fr/
B!GRe
Bioinformatique des
Génomes et Réseaux
!"#$%&'&()#*' *,-*%#". /&0("%&1)#. *%, #')%)#
! "#$
Inserm U1090
Contents
 Examples of biological databases
 Nucleic sequences: Genbank, EMBL, and DDBJ
 Protein sequences: UniProt
 The Gene Ontology (GO) project
 Issues and perspectives for biological databases
Examples of biomolecular databases
Biomolecular Databases
Jacques.van.Helden@ulb.ac.be
Université Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe)
http://www.bigre.ulb.ac.be/
Examples of biomolecular databases
 Sequence and structure databases
 Protein sequences (UniProt)
 DNA sequences (EMBL, Genbank, DDBJ)
 3D structures (PDB)
 Structural motifs (CATH)
 Sequence motifs (PROSITE, PRODOM)
 Genome sequences and annotations
 Genome-specific databases (SGD, FlyBase, AceDB, PlasmoDB, …)
 Multiple genomes (Integr8, NCBI, KEGG, TIGR, …)
 Molecular functions
 Transcriptional regulation (TRANSFAC, RegulonDB, InteractDB)
 Enzymatic catalysis (Expasy, LIGAND/KEGG, BRENDA)
 Transport (YTPdb)
 Biological processes
 Metabolic pathways (EcoCyc, LIGAND/KEGG, Biocatalysis/biodegradation)
 Signal transduction pathways (CSNdb, Transpath)
 Protein-protein interactions (DIP, BIND, MINT)
 Gene networks (GeneNet, FlyNets)
Databases of databases
 There are hundreds of databases related to molecular biology and biochemistry.
New databases are created every year.
 Every year, the first issue of Nucleic Acids Research is dedicated to biological
databases
 http://nar.oupjournals.org/
 2011 Issue: http://nar.oxfordjournals.org/content/39/suppl_1
 The same journal maintains a database of databases: the Molecular Biology
Database Collection
 http://www.oxfordjournals.org/nar/database/c/
 Some bioinformatics centres maintain multiple database, with cross-links
between them. The SRS server at EBI holds an impressive collection of
databases.
 http://srs.ebi.ac.uk/
Nucleic sequence databases:
GenBank, EMBL, and DDBJ
Biomolecular Databases
Jacques.van.Helden@ulb.ac.be
Université Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe)
http://www.bigre.ulb.ac.be/
Okubo et al. (2006) NAR 34: D6-D9
Nucleic sequence databases
 To publish an article dealing with a sequence, scientific journals impose to have
previously deposited this sequence in a reference database.
 There are 3 main repositories for nucleic acid sequences.
 Sequences deposited in any of these 3 databases are automatically
synchronized in the 2 other ones.
Adapted from Didier Gonze
The sequencing pace
 Nucleic sequences
 Genbank (April 2011) http://www.ncbi.nlm.nih.gov/genbank/
• 126,551,501,141 bases in 135,440,924 sequence records in the
traditional GenBank divisions
• 191,401,393,188 bases in 62,715,288 sequence records in the
Whole Genome Ssequencing
 Entire genomes
 GOLD Release V.2 (Oct 2011) contains ~2000 completely sequenced
genomes.
 http://www.genomesonline.org/gold_statistics.htm
 Protein sequences
 Essentially obtained by translation of putative genes in nucleic
sequences (almost no direct protein sequencing).
 UniProtKB/TrEMBL (2011) contains 17 millions of protein sequences.
 http://www.ebi.ac.uk/swissprot/sptr_stats/index.html
Size of the nucleotide database
EMBL Nucleotide Sequence Database: Release Notes - Release 113 September 2012
http://www.ebi.ac.uk/embl/Documentation/Release_notes/current/relnotes.html
Class entries nucleotides
------------------------------------------------------------------
CON:Constructed 7,236,371 359,112,791,043
EST:Expressed Sequence Tag 73,715,376 40,997,082,803
GSS:Genome Sequence Scan 34,528,104 21,985,922,905
HTC:High Throughput CDNA sequencing 491,770 594,229,662
HTG:High Throughput Genome sequencing 152,599 25,159,746,658
PAT:Patents 24,364,832 12,117,896,594
STD:Standard 13,920,617 37,665,112,606
STS:Sequence Tagged Site 1,322,570 636,037,867
TSA:Transcriptome Shotgun Assembly 8,085,693 5,663,938,279
WGS:Whole Genome Shotgun
Total
88,288,431
-----------
252,106,363
305,661,696,545
---------------
450,481,663,919
Division entries nucleotides
------------------------------------------------------------------
ENV:Environmental Samples 30,908,230 14,420,391,278
FUN:Fungi 6,522,586 11,614,472,226
HUM:Human 32,094,500 38,072,362,804
INV:Invertebrates 31,907,138 52,527,673,643
MAM:Other Mammals 40,012,731 145,678,620,711
MUS:Mus musculus 11,745,671 19,701,637,499
PHG:Bacteriophage 8,511 85,549,111
PLN:Plants 52,428,994 55,570,452,118
PRO:Prokaryotes 2,808,489 28,807,572,238
ROD:Rodents 6,554,012 33,326,106,733
SYN:Synthetic 4,045,013 782,174,055
TGN:Transgenic 285,307 849,743,891
UNC:Unclassified 8,617,225 4,957,442,673
VRL:Viruses 1,358,528 1,518,575,082
VRT:Other Vertebrates
Total
22,809,428
-----------
252,106,363
42,568,889,857
---------------
450,481,663,919
Genbank (NCBI - USA)
http://www.ncbi.nlm.nih.gov/Genbank/
The EMBL Nucleotide Sequence Database (EBI - UK)
http://www.ebi.ac.uk/embl/
DDBJ - DNA Data Bank of Japan
http://www.ddbj.nig.ac.jp/
URL Sequences
Bases
(without
shotgun)
bases
(including
shotgun) Organisms
DDBJ http://www.ddbj.nig.ac.jp/ 2.0E+06 1.7E+09
EMBL
GenBank
http://www.ebi.ac.uk/embl/
http://www.ncbi.nlm.nih.gov/ 4.6E+07 5.1E+10
1.0E+11
1.0E+11
2.0E+05
2.1E+05
Size of the nucleic sequence databases
 Summary of database contents for the 3 main databases of nucleic sequences.
 Source: NAR database issue January 2006.
UniProt : protein sequences
and functional annotations
Biomolecular Databases
Jacques.van.Helden@ulb.ac.be
Université Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe)
http://www.bigre.ulb.ac.be/
UniProt - the Universal Protein Resource
http://www.uniprot.org/
 Database content (Sept 2012)
 UniProtKB:
• 24,532,088 entries
• Translation of EMBL coding sequences
(non-redundant with Swiss-Prot)
 UniProtKB/Swiss-Prot section (reviewed):
• 537,505 entries
• annotation by experts
• high information content
• many references to the literature
• good reliability of the information
 The rest (90% of the entries)
• Automatic annotation by sequence
similarity.
 Features
 The most comprehensive protein database in
the world.
 A huge team: >100 annotators + developers.
 Annotation by experts: annotators are
specialized for different types of proteins or
organisms.
 World-wide recognized as an essential
resource.
 References
 Bairoch et al. The SWISS-PROT protein
sequence data bank. Nucleic Acids Res (1991)
vol. 19 Suppl pp. 2247-9
 The UniProt Consortium. The Universal Protein
Resource (UniProt) 2009. Nucleic Acids Res
(2008). Database Issue.
Number of entries (polypeptides) in Swiss-Prot
http://www.expasy.org/sprot/relnotes/relstat.html
Taxonomic distribution of the sequences
Within Eukaryotes
UniProt example - Human Pax-6 protein
Header : name and synonyms
UniProt example - Human Pax-6 protein
Human-based annotation by specialists
UniProt example - Human Pax-6 protein
Structured annotation : keywords and Gene Ontology terms
UniProt example - Human Pax-6 protein
Protein interactions; Alternative products
UniProt example - Human Pax-6 protein
Detailed description of regions, variations, and secondary structure
UniProt example - Human Pax-6 protein
Peptidic sequence
UniProt example - Human Pax-6 protein
References to original publications
UniProt example - Human Pax-6 protein
Cross-references to many databases (fragment shown)
3D Structure of macromolecules
Jacques.van.Helden@ulb.ac.be
Université Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe)
http://www.bigre.ulb.ac.be/
PDB - The Protein Data Bank
http://www.rcsb.org/pdb/
Genome browsers
Jacques.van.Helden@ulb.ac.be
Université Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe)
http://www.bigre.ulb.ac.be/
EnsEMBL Genome Browser (Sanger Institute + EBI)
http://www.ensembl.org/
UCSC Genome Browser (University California Santa Cruz - USA)
http://genome.ucsc.edu/
Human gene Pax6 aligned with Vertebrate genomes
UCSC Genome Browser (University California Santa Cruz - USA)
http://genome.ucsc.edu/
Drosophila gene eyeless (homolog to Pax6) aligned with Insect genomes
UCSC Genome Browser (University California Santa Cruz - USA)
http://genome.ucsc.edu/
Drosophila 120kb chromosomal region covering the Achaete-Scute Complex
ECR Browser
http://ecrbrowser.dcode.org/
EnsEMBL - Example: Drosophila gene Pax6
http://www.ensembl.org/
Comparative genomics
Jacques.van.Helden@ulb.ac.be
Université Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe)
http://www.bigre.ulb.ac.be/
Integr8 - access to complete genomes and proteomes
http://www.ebi.ac.uk/integr8/
Integr8 - genome summaries
http://www.ebi.ac.uk/integr8/
Integr8 - clusters of orthologous genes (COGs)
http://www.ebi.ac.uk/integr8/
Integr8 - clusters of paralogous genes
http://www.ebi.ac.uk/integr8/
Databases of protein domains
Jacques.van.Helden@ulb.ac.be
Université Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe)
http://www.bigre.ulb.ac.be/
Prosite - protein domains, families and functional sites
http://www.expasy.ch/prosite/
Prosite - aligned sequences and logo
http://www.expasy.ch/prosite/
 Some of the sequences that were
used to built the Prosite profile for
the Zn(2)-C6 fungal-type DNA-
binding domain
(ZN2_CY6_FUNGAL_2,
PS50048).
 The Sequence Logo (below)
indicates the level of conservation
of each residue in each column of
the alignment.
 Note the 6 cysteines,
characteristic of this domain.
Prosite - Example of profile matrix
http://www.expasy.ch/prosite/
Prosite - Example of sequence logo
http://www.expasy.ch/prosite/
Prosite - Example of domain signature
http://www.expasy.ch/prosite/
 The domain signature is a string-based pattern representing the residues that
are characteristic of a domain.
PFAM (Sanger Institute - UK) http://pfam.sanger.ac.uk/
Protein families represented by multiple sequence alignments and hidden Markov models (HMMs)
CATH - Protein Structure Classification
http://www.cathdb.info/
 CATH is a hierarchical classification of
protein domain structures, which clusters
proteins at four major levels:
 Class (C),
 Architecture (A),
 Topology (T)
 Homologous superfamily (H).
 The boundaries and assignments for
each protein domain are determined
using a combination of automated and
manual procedures which include
computational techniques, empirical and
statistical evidence, literature review and
expert analysis.
 References
 Orengo et al. The CATH Database
provides insights into protein structure/
function relationships. Nucleic Acids Res
(1999) vol. 27 (1) pp. 275-9
 Cuff et al. The CATH classification
revisited--architectures reviewed and new
ways to characterize structural divergence
in superfamilies. Nucleic Acids Res (2008)
pp.
CATH - Protein Structure Classification
http://www.cathdb.info/
InterPro (EBI - UK)
http://www.ebi.ac.uk/interpro/
InterPro (EBI - UK)
Antennapedia-like Homeobox (entry IPR001827)
The Gene Ontology (GO) database
Biomolecular Databases
Jacques.van.Helden@ulb.ac.be
Université Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe)
http://www.bigre.ulb.ac.be/
Ontology definition
 Ontologie: partie de la métaphysique qui s'intéresse à l'être en tant qu'être,
indépendamment de ses déterminations particulières
 Ontology: part of the metaphysics that focusses on the being as a beging, independently of
its particular determinations
Le Petit Robert - dictionnaire alphabétique et analogique de la langue française. 1993
The "bio-ontologies"
 Answer to the problem of inconsistencies in the annotations
 Controlled vocabulary
 Hierarchical classification between the terms of the controlled vocabulary
 E.g.: The Gene Ontology
 molecular function ontology
 process ontology
 cellular component ontology
Gene ontology: processes
Gene ontology: molecular functions
Gene ontology: cellular components
Gene Ontology Database
http://www.geneontology.org/
Gene Ontology Database (http://www.geneontology.org/)
Example: methionine biosynthetic process
Status of GO annotations (NAR DB issue 2006)
 Term definitions
 Biological process terms
 Molecular function terms
 Cellular component terms
 Sequence Ontology terms
9,805
7,076
1,574
963
 Genomes with annotation 30
 Excludes annotations from UniProt, which represent 261 annotated proteomes.
 Annotated gene products
 Total
 Electronic only
 Manually curated
1,618,739
1,460,632
158,107
QuickGO (http://www.ebi.ac.uk/QuickGO/)
 Web site
http://www.ebi.ac.uk/QuickGO/
 A user-friendly Web interface to
the Gene Ontology.
 Graphical display of the
hierarchical relationships
between terms.
 Convenient browsing between
classes.
Remarks on "bio-ontologies"
 Improvement compared to free text
 controlled vocabulary (choice among synonyms)
 hierarchical relationships between the concepts
 Nothing to do with the philosophical concept of ontology
 A "bio-ontologies" is usually nothing more than a taxonomical classification of
the terms of a controlled vocabulary
 Multiple possibilities of classification criteria
 e.g. compartment subtypes (plasma membrane is a membrane)
 e.g. compartment locations (nucleus is inside cytoplasm is inside plasma
membrane)
 To be useful, should remain purpose-based
 each biologist might wish to define his/her own classification based on his/her
needs and scope of interest
 impossible to define a unifying standard for all biologists
 No representation of molecular interactions
 relationships between objects are only hierarchical, not horizontal or cyclic
 e.g. does not describe which genes are the target of a given transcription
factor
What is biological function ?
 A general definition
 Fonction: action, rôle caractéristique dʼun élément, dʼun organe, dans un ensemble
(souvent opposé à structure). Source: Le Petit Robert - dictionnaire alphabetique et
analogique de la langue francaise. 1982.
 Function: characteristic action (role) of an element (organ) within an set
(often opposed to structure)
 Function and gene ontology
 Understanding the function requires to establish the link between molecular activity
and the context in which it takes place (process).
 Multifunctionality
• Same activity can play different roles in different processes.
 Example: scute gene in Drosophila melanogaster: a transcription factor
(activity) involved in sex determination, determination of neural precursors
and malpighian tubules (3 processes).
• Multiple activities of a same protein in a given process
 Example: aspatokinase PutA in Escherichia coli, contains 2 enzymatic
domains (enzymatic activities) + a DNA-binding domain (DNA binding
transcription factor) -> 3 molecular activities in the same process (proline
utilization).
Small compounds, reactions
and metabolic pathways
Biomolecular Databases
Jacques.van.Helden@ulb.ac.be
Université Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe)
http://www.bigre.ulb.ac.be/
LIGAND - Small compounds and metabolic reactions
KEGG - Kyoto Encycplopaedia of Genes and Genomes
Ecocyc, BioCyc and Metacyc - Metabolic pathways
Protein interaction networks
and transduction pathways
Biomolecular Databases
Jacques.van.Helden@ulb.ac.be
Université Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe)
http://www.bigre.ulb.ac.be/
Microarray databases
Biomolecular Databases
Jacques.van.Helden@ulb.ac.be
Université Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe)
http://www.bigre.ulb.ac.be/
Human genome resources
Jacques.van.Helden@ulb.ac.be
Université Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe)
http://www.bigre.ulb.ac.be/
HapMap
http://www.hapmap.org/
 The International HapMap
Project is a multi-country effort to
identify and catalog genetic
similarities and differences in
human beings.
 Associations between genetic
variations (SNPs, ...) and
diseases + response to
pharmaceuticals.
Issues for
biomolecular databases
Biomolecular Databases
Jacques.van.Helden@ulb.ac.be
Université Libre de Bruxelles, Belgique
Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe)
http://www.bigre.ulb.ac.be/
Issues for biological databases
 Dealing with biological complexity
 Data content
 Coverage
 Information content
 Data quality
 Data structure
 Consistency
 Query capabilities
 Interfaces
 User interfaces
 Programmatic interfaces
 Annotation
 Funding
Towards biological complexity
 The main databases currently available are focussed on one type of molecular
entity : nucleic sequences, proteins, compounds, …
 This type of organization is very convenient as far as the information to be
represented is simple (e.g. DNA sequences, structures of small molecules and
macromolecules).
 It becomes more difficult if we want to represent
 the interactions between biological objects,
 the integration of various elements in a biological process (metabolic pathways, protein
interaction networks, regulatory networks, …)
 complex concepts such as ”biological function”
Data content
 Scope of the database
 types of biological objects represented
 Number of entries
 coverage of the current knowledge
 Information content
 Level of detail in the description of the biological objects
 References to the source of information
Data quality
 Data Consistency
 always use the same name to indicate the same object
 (this seems trivial, but its is unfortunately still not always the case)
 event better: define an ID for each objects, and allow to retrieve it by any of its
synonyms
 spelling mistakes
 Data Structuration
 distinct fields for distinct attributes of the biological objects
 Reliability
 Evidences ? Level of confidence ?
 Assignation of function by similarity
• recursive process  propagation of errors
Query capabilities
 Browsing (click and read)
 Simple search
 select records with some constraints
 More elaborate search
 select specific fields of some records with constraints on some fields (~SQL
SELECT)
 Complex querying
 ability to return an answer that results from a "live" computation, and was not part
of any record of the dabatase
Interfaces
 User interfaces
 user-friendly
 convenient browsing
 intuitive query forms
 visualization (graphical output)
 Programmatic interfaces
 communication with external programs:
• other databases (concept of distributed database)
• analysis tools
Annotation
 Problem
 The flow of available data is increasing exponentially
 Strategies
 internal curators
 selected external experts
 public submission
 computer-based extraction of information from biological texts
Funding
 Public funding
 Problem: easier to obtain public funds for creating a new database than for
maintaining or expanding existing resources
 Private funding
 Industrial companies are
• ready to invest in good data and good query capabilities
• interested by academic expertise
 Solutions
 All users pay (per query for example)
• Note: academic users are anyway funded by public funds
 Hybrid solution
• access is free for academic users, not for companies
• companies can buy the whole database an install it in-house
(+ add their own private data)
• academia-industry interface is often ensured by a spinoff company

More Related Content

What's hot (20)

PPTX
Biological database
Iqbal college Peringammala TVM
 
PDF
Ensembl Browser Workshop
Denise Carvalho-Silva, PhD
 
PDF
Tools and database of NCBI
Santosh Kumar Sahoo
 
PDF
The ensembl database
Ashfaq Ahmad
 
PDF
Genome resources at EMBL-EBI: Ensembl and Ensembl Genomes
EBI
 
PPT
Biological databases
Sarfaraz Nasri
 
PPTX
Primary Databases.pptx
Swarup Malakar
 
PPT
Intro to databases
bhargvi sharma
 
PPT
Bioinformatic databases 2
Razzaqe
 
PPT
Biodatabases 101220022654-phpapp02
Sreekanth Gali
 
PPTX
European Molecular Biology Laboratory (EMBL)- European Bioinformatics Institu...
ExternalEvents
 
DOCX
Major biological nucleotide databases
Vidya Kalaivani Rajkumar
 
PPTX
Protein database ..... of NCBI
Alagppa University
 
PPTX
Presentation on Biological database By Elufer Akram @ University Of Science ...
Elufer Akram
 
PPTX
Emerging challenges in data-intensive genomics
mikaelhuss
 
PPTX
Genomic databases
DrSatyabrataSahoo
 
PPT
UNL UCARE Summer Symposium Poster
Nichole Leacock
 
PPTX
Rishi
Narayan Awasthi
 
PPT
Ensembl genome
Amer T. Wazwaz
 
Biological database
Iqbal college Peringammala TVM
 
Ensembl Browser Workshop
Denise Carvalho-Silva, PhD
 
Tools and database of NCBI
Santosh Kumar Sahoo
 
The ensembl database
Ashfaq Ahmad
 
Genome resources at EMBL-EBI: Ensembl and Ensembl Genomes
EBI
 
Biological databases
Sarfaraz Nasri
 
Primary Databases.pptx
Swarup Malakar
 
Intro to databases
bhargvi sharma
 
Bioinformatic databases 2
Razzaqe
 
Biodatabases 101220022654-phpapp02
Sreekanth Gali
 
European Molecular Biology Laboratory (EMBL)- European Bioinformatics Institu...
ExternalEvents
 
Major biological nucleotide databases
Vidya Kalaivani Rajkumar
 
Protein database ..... of NCBI
Alagppa University
 
Presentation on Biological database By Elufer Akram @ University Of Science ...
Elufer Akram
 
Emerging challenges in data-intensive genomics
mikaelhuss
 
Genomic databases
DrSatyabrataSahoo
 
UNL UCARE Summer Symposium Poster
Nichole Leacock
 
Ensembl genome
Amer T. Wazwaz
 

Similar to 02.databases slides (20)

PPTX
Bioinformatics final
Rainu Rajeev
 
PPTX
Sequence and Structural Databases of DNA and Protein, and its significance in...
BibiQuinah
 
PPTX
Sequence and Structural Databases of DNA and Protein, and its significance in...
SBituila
 
PPT
Databases
afzamalik
 
PPTX
Introduction to databases.pptx
sworna kumari chithiraivelu
 
PDF
BITS: Overview of important biological databases beyond sequences
BITS
 
PPT
Data Base in Bioinformatics.ppt
Bangaluru
 
PDF
Bioinformatics biological databases
Sangeeta Das
 
PPTX
Nucleic acid database
Esakkiammal S
 
PDF
Bioinformatics introduction
DrGopaSarma
 
PPTX
Biological databases
SEKHARREDDYAMBATI
 
PPTX
Biological databasesBiological databases
KrittikaChandran
 
PPTX
TheUniProtKBpptx__2022_03_30_13_07_41.pptx
PRIYANKAZALA9
 
PPTX
Biological databases
Tamanna Syeda
 
PPTX
DATABASES...............................pptx
Cherry
 
PPTX
biological databases.pptx
science lover
 
PPTX
Nucleic acid and protein databanks
NithyaNandapal
 
PPTX
Biological database ppt(1).pptx Introuction
RAJESHKUMAR428748
 
PPT
Nucleic_Acid_Databases, Bioinformatics, genome
MohamedHasan816582
 
PPTX
Biological databases.pptx
PagudalaSangeetha
 
Bioinformatics final
Rainu Rajeev
 
Sequence and Structural Databases of DNA and Protein, and its significance in...
BibiQuinah
 
Sequence and Structural Databases of DNA and Protein, and its significance in...
SBituila
 
Databases
afzamalik
 
Introduction to databases.pptx
sworna kumari chithiraivelu
 
BITS: Overview of important biological databases beyond sequences
BITS
 
Data Base in Bioinformatics.ppt
Bangaluru
 
Bioinformatics biological databases
Sangeeta Das
 
Nucleic acid database
Esakkiammal S
 
Bioinformatics introduction
DrGopaSarma
 
Biological databases
SEKHARREDDYAMBATI
 
Biological databasesBiological databases
KrittikaChandran
 
TheUniProtKBpptx__2022_03_30_13_07_41.pptx
PRIYANKAZALA9
 
Biological databases
Tamanna Syeda
 
DATABASES...............................pptx
Cherry
 
biological databases.pptx
science lover
 
Nucleic acid and protein databanks
NithyaNandapal
 
Biological database ppt(1).pptx Introuction
RAJESHKUMAR428748
 
Nucleic_Acid_Databases, Bioinformatics, genome
MohamedHasan816582
 
Biological databases.pptx
PagudalaSangeetha
 
Ad

Recently uploaded (20)

PDF
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
DOC
MATRIX_AMAN IRAWAN_20227479046.docbbbnnb
vanitafiani1
 
PDF
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
PDF
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
PPTX
Resmed Rady Landis May 4th - analytics.pptx
Adrian Limanto
 
PPTX
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
PDF
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
PDF
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
PDF
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
PDF
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
PPTX
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
PDF
Incident Response and Digital Forensics Certificate
VICTOR MAESTRE RAMIREZ
 
PDF
WEF_Future_of_Global_Fintech_Second_Edition_2025.pdf
AproximacionAlFuturo
 
PPT
Lecture 2-1.ppt at a higher learning institution such as the university of Za...
rachealhantukumane52
 
PPTX
The _Operations_on_Functions_Addition subtruction Multiplication and Division...
mdregaspi24
 
PDF
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
PPTX
Hadoop_EcoSystem slide by CIDAC India.pptx
migbaruget
 
PDF
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
PPTX
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
PDF
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
Avatar for apidays apidays PRO June 07, 2025 0 5 apidays Helsinki & North 2...
apidays
 
MATRIX_AMAN IRAWAN_20227479046.docbbbnnb
vanitafiani1
 
Data Chunking Strategies for RAG in 2025.pdf
Tamanna
 
apidays Helsinki & North 2025 - REST in Peace? Hunting the Dominant Design fo...
apidays
 
Resmed Rady Landis May 4th - analytics.pptx
Adrian Limanto
 
AI Presentation Tool Pitch Deck Presentation.pptx
ShyamPanthavoor1
 
apidays Helsinki & North 2025 - How (not) to run a Graphql Stewardship Group,...
apidays
 
How to Connect Your On-Premises Site to AWS Using Site-to-Site VPN.pdf
Tamanna
 
Building Production-Ready AI Agents with LangGraph.pdf
Tamanna
 
Web Scraping with Google Gemini 2.0 .pdf
Tamanna
 
apidays Munich 2025 - Building an AWS Serverless Application with Terraform, ...
apidays
 
Incident Response and Digital Forensics Certificate
VICTOR MAESTRE RAMIREZ
 
WEF_Future_of_Global_Fintech_Second_Edition_2025.pdf
AproximacionAlFuturo
 
Lecture 2-1.ppt at a higher learning institution such as the university of Za...
rachealhantukumane52
 
The _Operations_on_Functions_Addition subtruction Multiplication and Division...
mdregaspi24
 
Product Management in HealthTech (Case Studies from SnappDoctor)
Hamed Shams
 
Hadoop_EcoSystem slide by CIDAC India.pptx
migbaruget
 
apidays Helsinki & North 2025 - API-Powered Journeys: Mobility in an API-Driv...
apidays
 
Numbers of a nation: how we estimate population statistics | Accessible slides
Office for National Statistics
 
apidays Helsinki & North 2025 - APIs in the healthcare sector: hospitals inte...
apidays
 
Ad

02.databases slides

  • 1. Biomolecular databases Bioinformatics Jacques van HeldenFORMER ADDRESS (1999-2011) Université Libre de Bruxelles, Belgique Bioinformatique des Génomes et des Réseaux (BiGRe lab) http://www.bigre.ulb.ac.be/ NEW ADDRESS (since Nov 1st,2011) [email protected] Université d’Aix-Marseille, France Lab. Technological Advances for Genomics and Clinics (TAGC, INSERM Unit U1090) http://tagc.univ-mrs.fr/ B!GRe Bioinformatique des Génomes et Réseaux !"#$%&'&()#*' *,-*%#". /&0("%&1)#. *%, #')%)# ! "#$ Inserm U1090
  • 2. Contents  Examples of biological databases  Nucleic sequences: Genbank, EMBL, and DDBJ  Protein sequences: UniProt  The Gene Ontology (GO) project  Issues and perspectives for biological databases
  • 3. Examples of biomolecular databases Biomolecular Databases [email protected] Université Libre de Bruxelles, Belgique Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/
  • 4. Examples of biomolecular databases  Sequence and structure databases  Protein sequences (UniProt)  DNA sequences (EMBL, Genbank, DDBJ)  3D structures (PDB)  Structural motifs (CATH)  Sequence motifs (PROSITE, PRODOM)  Genome sequences and annotations  Genome-specific databases (SGD, FlyBase, AceDB, PlasmoDB, …)  Multiple genomes (Integr8, NCBI, KEGG, TIGR, …)  Molecular functions  Transcriptional regulation (TRANSFAC, RegulonDB, InteractDB)  Enzymatic catalysis (Expasy, LIGAND/KEGG, BRENDA)  Transport (YTPdb)  Biological processes  Metabolic pathways (EcoCyc, LIGAND/KEGG, Biocatalysis/biodegradation)  Signal transduction pathways (CSNdb, Transpath)  Protein-protein interactions (DIP, BIND, MINT)  Gene networks (GeneNet, FlyNets)
  • 5. Databases of databases  There are hundreds of databases related to molecular biology and biochemistry. New databases are created every year.  Every year, the first issue of Nucleic Acids Research is dedicated to biological databases  http://nar.oupjournals.org/  2011 Issue: http://nar.oxfordjournals.org/content/39/suppl_1  The same journal maintains a database of databases: the Molecular Biology Database Collection  http://www.oxfordjournals.org/nar/database/c/  Some bioinformatics centres maintain multiple database, with cross-links between them. The SRS server at EBI holds an impressive collection of databases.  http://srs.ebi.ac.uk/
  • 6. Nucleic sequence databases: GenBank, EMBL, and DDBJ Biomolecular Databases [email protected] Université Libre de Bruxelles, Belgique Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/
  • 7. Okubo et al. (2006) NAR 34: D6-D9 Nucleic sequence databases  To publish an article dealing with a sequence, scientific journals impose to have previously deposited this sequence in a reference database.  There are 3 main repositories for nucleic acid sequences.  Sequences deposited in any of these 3 databases are automatically synchronized in the 2 other ones.
  • 8. Adapted from Didier Gonze The sequencing pace  Nucleic sequences  Genbank (April 2011) http://www.ncbi.nlm.nih.gov/genbank/ • 126,551,501,141 bases in 135,440,924 sequence records in the traditional GenBank divisions • 191,401,393,188 bases in 62,715,288 sequence records in the Whole Genome Ssequencing  Entire genomes  GOLD Release V.2 (Oct 2011) contains ~2000 completely sequenced genomes.  http://www.genomesonline.org/gold_statistics.htm  Protein sequences  Essentially obtained by translation of putative genes in nucleic sequences (almost no direct protein sequencing).  UniProtKB/TrEMBL (2011) contains 17 millions of protein sequences.  http://www.ebi.ac.uk/swissprot/sptr_stats/index.html
  • 9. Size of the nucleotide database EMBL Nucleotide Sequence Database: Release Notes - Release 113 September 2012 http://www.ebi.ac.uk/embl/Documentation/Release_notes/current/relnotes.html Class entries nucleotides ------------------------------------------------------------------ CON:Constructed 7,236,371 359,112,791,043 EST:Expressed Sequence Tag 73,715,376 40,997,082,803 GSS:Genome Sequence Scan 34,528,104 21,985,922,905 HTC:High Throughput CDNA sequencing 491,770 594,229,662 HTG:High Throughput Genome sequencing 152,599 25,159,746,658 PAT:Patents 24,364,832 12,117,896,594 STD:Standard 13,920,617 37,665,112,606 STS:Sequence Tagged Site 1,322,570 636,037,867 TSA:Transcriptome Shotgun Assembly 8,085,693 5,663,938,279 WGS:Whole Genome Shotgun Total 88,288,431 ----------- 252,106,363 305,661,696,545 --------------- 450,481,663,919 Division entries nucleotides ------------------------------------------------------------------ ENV:Environmental Samples 30,908,230 14,420,391,278 FUN:Fungi 6,522,586 11,614,472,226 HUM:Human 32,094,500 38,072,362,804 INV:Invertebrates 31,907,138 52,527,673,643 MAM:Other Mammals 40,012,731 145,678,620,711 MUS:Mus musculus 11,745,671 19,701,637,499 PHG:Bacteriophage 8,511 85,549,111 PLN:Plants 52,428,994 55,570,452,118 PRO:Prokaryotes 2,808,489 28,807,572,238 ROD:Rodents 6,554,012 33,326,106,733 SYN:Synthetic 4,045,013 782,174,055 TGN:Transgenic 285,307 849,743,891 UNC:Unclassified 8,617,225 4,957,442,673 VRL:Viruses 1,358,528 1,518,575,082 VRT:Other Vertebrates Total 22,809,428 ----------- 252,106,363 42,568,889,857 --------------- 450,481,663,919
  • 10. Genbank (NCBI - USA) http://www.ncbi.nlm.nih.gov/Genbank/
  • 11. The EMBL Nucleotide Sequence Database (EBI - UK) http://www.ebi.ac.uk/embl/
  • 12. DDBJ - DNA Data Bank of Japan http://www.ddbj.nig.ac.jp/
  • 13. URL Sequences Bases (without shotgun) bases (including shotgun) Organisms DDBJ http://www.ddbj.nig.ac.jp/ 2.0E+06 1.7E+09 EMBL GenBank http://www.ebi.ac.uk/embl/ http://www.ncbi.nlm.nih.gov/ 4.6E+07 5.1E+10 1.0E+11 1.0E+11 2.0E+05 2.1E+05 Size of the nucleic sequence databases  Summary of database contents for the 3 main databases of nucleic sequences.  Source: NAR database issue January 2006.
  • 14. UniProt : protein sequences and functional annotations Biomolecular Databases [email protected] Université Libre de Bruxelles, Belgique Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/
  • 15. UniProt - the Universal Protein Resource http://www.uniprot.org/  Database content (Sept 2012)  UniProtKB: • 24,532,088 entries • Translation of EMBL coding sequences (non-redundant with Swiss-Prot)  UniProtKB/Swiss-Prot section (reviewed): • 537,505 entries • annotation by experts • high information content • many references to the literature • good reliability of the information  The rest (90% of the entries) • Automatic annotation by sequence similarity.  Features  The most comprehensive protein database in the world.  A huge team: >100 annotators + developers.  Annotation by experts: annotators are specialized for different types of proteins or organisms.  World-wide recognized as an essential resource.  References  Bairoch et al. The SWISS-PROT protein sequence data bank. Nucleic Acids Res (1991) vol. 19 Suppl pp. 2247-9  The UniProt Consortium. The Universal Protein Resource (UniProt) 2009. Nucleic Acids Res (2008). Database Issue. Number of entries (polypeptides) in Swiss-Prot http://www.expasy.org/sprot/relnotes/relstat.html Taxonomic distribution of the sequences Within Eukaryotes
  • 16. UniProt example - Human Pax-6 protein Header : name and synonyms
  • 17. UniProt example - Human Pax-6 protein Human-based annotation by specialists
  • 18. UniProt example - Human Pax-6 protein Structured annotation : keywords and Gene Ontology terms
  • 19. UniProt example - Human Pax-6 protein Protein interactions; Alternative products
  • 20. UniProt example - Human Pax-6 protein Detailed description of regions, variations, and secondary structure
  • 21. UniProt example - Human Pax-6 protein Peptidic sequence
  • 22. UniProt example - Human Pax-6 protein References to original publications
  • 23. UniProt example - Human Pax-6 protein Cross-references to many databases (fragment shown)
  • 24. 3D Structure of macromolecules [email protected] Université Libre de Bruxelles, Belgique Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/
  • 25. PDB - The Protein Data Bank http://www.rcsb.org/pdb/
  • 26. Genome browsers [email protected] Université Libre de Bruxelles, Belgique Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/
  • 27. EnsEMBL Genome Browser (Sanger Institute + EBI) http://www.ensembl.org/
  • 28. UCSC Genome Browser (University California Santa Cruz - USA) http://genome.ucsc.edu/ Human gene Pax6 aligned with Vertebrate genomes
  • 29. UCSC Genome Browser (University California Santa Cruz - USA) http://genome.ucsc.edu/ Drosophila gene eyeless (homolog to Pax6) aligned with Insect genomes
  • 30. UCSC Genome Browser (University California Santa Cruz - USA) http://genome.ucsc.edu/ Drosophila 120kb chromosomal region covering the Achaete-Scute Complex
  • 32. EnsEMBL - Example: Drosophila gene Pax6 http://www.ensembl.org/
  • 33. Comparative genomics [email protected] Université Libre de Bruxelles, Belgique Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/
  • 34. Integr8 - access to complete genomes and proteomes http://www.ebi.ac.uk/integr8/
  • 35. Integr8 - genome summaries http://www.ebi.ac.uk/integr8/
  • 36. Integr8 - clusters of orthologous genes (COGs) http://www.ebi.ac.uk/integr8/
  • 37. Integr8 - clusters of paralogous genes http://www.ebi.ac.uk/integr8/
  • 38. Databases of protein domains [email protected] Université Libre de Bruxelles, Belgique Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/
  • 39. Prosite - protein domains, families and functional sites http://www.expasy.ch/prosite/
  • 40. Prosite - aligned sequences and logo http://www.expasy.ch/prosite/  Some of the sequences that were used to built the Prosite profile for the Zn(2)-C6 fungal-type DNA- binding domain (ZN2_CY6_FUNGAL_2, PS50048).  The Sequence Logo (below) indicates the level of conservation of each residue in each column of the alignment.  Note the 6 cysteines, characteristic of this domain.
  • 41. Prosite - Example of profile matrix http://www.expasy.ch/prosite/
  • 42. Prosite - Example of sequence logo http://www.expasy.ch/prosite/
  • 43. Prosite - Example of domain signature http://www.expasy.ch/prosite/  The domain signature is a string-based pattern representing the residues that are characteristic of a domain.
  • 44. PFAM (Sanger Institute - UK) http://pfam.sanger.ac.uk/ Protein families represented by multiple sequence alignments and hidden Markov models (HMMs)
  • 45. CATH - Protein Structure Classification http://www.cathdb.info/  CATH is a hierarchical classification of protein domain structures, which clusters proteins at four major levels:  Class (C),  Architecture (A),  Topology (T)  Homologous superfamily (H).  The boundaries and assignments for each protein domain are determined using a combination of automated and manual procedures which include computational techniques, empirical and statistical evidence, literature review and expert analysis.  References  Orengo et al. The CATH Database provides insights into protein structure/ function relationships. Nucleic Acids Res (1999) vol. 27 (1) pp. 275-9  Cuff et al. The CATH classification revisited--architectures reviewed and new ways to characterize structural divergence in superfamilies. Nucleic Acids Res (2008) pp.
  • 46. CATH - Protein Structure Classification http://www.cathdb.info/
  • 47. InterPro (EBI - UK) http://www.ebi.ac.uk/interpro/
  • 48. InterPro (EBI - UK) Antennapedia-like Homeobox (entry IPR001827)
  • 49. The Gene Ontology (GO) database Biomolecular Databases [email protected] Université Libre de Bruxelles, Belgique Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/
  • 50. Ontology definition  Ontologie: partie de la métaphysique qui s'intéresse à l'être en tant qu'être, indépendamment de ses déterminations particulières  Ontology: part of the metaphysics that focusses on the being as a beging, independently of its particular determinations Le Petit Robert - dictionnaire alphabétique et analogique de la langue française. 1993
  • 51. The "bio-ontologies"  Answer to the problem of inconsistencies in the annotations  Controlled vocabulary  Hierarchical classification between the terms of the controlled vocabulary  E.g.: The Gene Ontology  molecular function ontology  process ontology  cellular component ontology
  • 56. Gene Ontology Database (http://www.geneontology.org/) Example: methionine biosynthetic process
  • 57. Status of GO annotations (NAR DB issue 2006)  Term definitions  Biological process terms  Molecular function terms  Cellular component terms  Sequence Ontology terms 9,805 7,076 1,574 963  Genomes with annotation 30  Excludes annotations from UniProt, which represent 261 annotated proteomes.  Annotated gene products  Total  Electronic only  Manually curated 1,618,739 1,460,632 158,107
  • 58. QuickGO (http://www.ebi.ac.uk/QuickGO/)  Web site http://www.ebi.ac.uk/QuickGO/  A user-friendly Web interface to the Gene Ontology.  Graphical display of the hierarchical relationships between terms.  Convenient browsing between classes.
  • 59. Remarks on "bio-ontologies"  Improvement compared to free text  controlled vocabulary (choice among synonyms)  hierarchical relationships between the concepts  Nothing to do with the philosophical concept of ontology  A "bio-ontologies" is usually nothing more than a taxonomical classification of the terms of a controlled vocabulary  Multiple possibilities of classification criteria  e.g. compartment subtypes (plasma membrane is a membrane)  e.g. compartment locations (nucleus is inside cytoplasm is inside plasma membrane)  To be useful, should remain purpose-based  each biologist might wish to define his/her own classification based on his/her needs and scope of interest  impossible to define a unifying standard for all biologists  No representation of molecular interactions  relationships between objects are only hierarchical, not horizontal or cyclic  e.g. does not describe which genes are the target of a given transcription factor
  • 60. What is biological function ?  A general definition  Fonction: action, rôle caractéristique dʼun élément, dʼun organe, dans un ensemble (souvent opposé à structure). Source: Le Petit Robert - dictionnaire alphabetique et analogique de la langue francaise. 1982.  Function: characteristic action (role) of an element (organ) within an set (often opposed to structure)  Function and gene ontology  Understanding the function requires to establish the link between molecular activity and the context in which it takes place (process).  Multifunctionality • Same activity can play different roles in different processes.  Example: scute gene in Drosophila melanogaster: a transcription factor (activity) involved in sex determination, determination of neural precursors and malpighian tubules (3 processes). • Multiple activities of a same protein in a given process  Example: aspatokinase PutA in Escherichia coli, contains 2 enzymatic domains (enzymatic activities) + a DNA-binding domain (DNA binding transcription factor) -> 3 molecular activities in the same process (proline utilization).
  • 61. Small compounds, reactions and metabolic pathways Biomolecular Databases [email protected] Université Libre de Bruxelles, Belgique Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/
  • 62. LIGAND - Small compounds and metabolic reactions
  • 63. KEGG - Kyoto Encycplopaedia of Genes and Genomes
  • 64. Ecocyc, BioCyc and Metacyc - Metabolic pathways
  • 65. Protein interaction networks and transduction pathways Biomolecular Databases [email protected] Université Libre de Bruxelles, Belgique Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/
  • 66. Microarray databases Biomolecular Databases [email protected] Université Libre de Bruxelles, Belgique Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/
  • 67. Human genome resources [email protected] Université Libre de Bruxelles, Belgique Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/
  • 68. HapMap http://www.hapmap.org/  The International HapMap Project is a multi-country effort to identify and catalog genetic similarities and differences in human beings.  Associations between genetic variations (SNPs, ...) and diseases + response to pharmaceuticals.
  • 69. Issues for biomolecular databases Biomolecular Databases [email protected] Université Libre de Bruxelles, Belgique Laboratoire de Bioinformatique des Génomes et des Réseaux (BiGRe) http://www.bigre.ulb.ac.be/
  • 70. Issues for biological databases  Dealing with biological complexity  Data content  Coverage  Information content  Data quality  Data structure  Consistency  Query capabilities  Interfaces  User interfaces  Programmatic interfaces  Annotation  Funding
  • 71. Towards biological complexity  The main databases currently available are focussed on one type of molecular entity : nucleic sequences, proteins, compounds, …  This type of organization is very convenient as far as the information to be represented is simple (e.g. DNA sequences, structures of small molecules and macromolecules).  It becomes more difficult if we want to represent  the interactions between biological objects,  the integration of various elements in a biological process (metabolic pathways, protein interaction networks, regulatory networks, …)  complex concepts such as ”biological function”
  • 72. Data content  Scope of the database  types of biological objects represented  Number of entries  coverage of the current knowledge  Information content  Level of detail in the description of the biological objects  References to the source of information
  • 73. Data quality  Data Consistency  always use the same name to indicate the same object  (this seems trivial, but its is unfortunately still not always the case)  event better: define an ID for each objects, and allow to retrieve it by any of its synonyms  spelling mistakes  Data Structuration  distinct fields for distinct attributes of the biological objects  Reliability  Evidences ? Level of confidence ?  Assignation of function by similarity • recursive process  propagation of errors
  • 74. Query capabilities  Browsing (click and read)  Simple search  select records with some constraints  More elaborate search  select specific fields of some records with constraints on some fields (~SQL SELECT)  Complex querying  ability to return an answer that results from a "live" computation, and was not part of any record of the dabatase
  • 75. Interfaces  User interfaces  user-friendly  convenient browsing  intuitive query forms  visualization (graphical output)  Programmatic interfaces  communication with external programs: • other databases (concept of distributed database) • analysis tools
  • 76. Annotation  Problem  The flow of available data is increasing exponentially  Strategies  internal curators  selected external experts  public submission  computer-based extraction of information from biological texts
  • 77. Funding  Public funding  Problem: easier to obtain public funds for creating a new database than for maintaining or expanding existing resources  Private funding  Industrial companies are • ready to invest in good data and good query capabilities • interested by academic expertise  Solutions  All users pay (per query for example) • Note: academic users are anyway funded by public funds  Hybrid solution • access is free for academic users, not for companies • companies can buy the whole database an install it in-house (+ add their own private data) • academia-industry interface is often ensured by a spinoff company