SlideShare a Scribd company logo
WHAT'S IN A NAME?
Better vocabulary = better bioinformatics???

From flickr user giantginkgo
# Author: Keith Bradnam, Genome Center, UC Davis
# This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike
3.0 Unported License.
http://biomickwatson.wordpress.com

Most of the interesting 'stuff' that I discover about bioinformatics and genomics comes from
a) twitter, b) blogs, and c) papers (in that order). Mick Watson has fun and engaging blog
about bioinformatics and today he raised an important point: the lack of standardization in
scientific databases leads to frustration (and frustration leads to...suffering).
http://biomickwatson.wordpress.com

These are some terms that appear in the same database. You can code solutions for some of
this variation (e.g. British/American English differences or presence/absence of underscore vs
space character), but who wants to waste time doing that? Shouldn't these databases be
using controlled vocabularies?
This infamous paper from 2004 reveals how easy it is to introduce errors into biological
databases.
First highlighted column = actual gene name.
Second highlighted column = what Excel will automatically assume you mean.
RIKEN ID: 2310009E13

Happens for other identifiers as well. This RIKEN ID will change if it ever ends up in Excel...
RIKEN ID: 2.31E+13

...now it appears as a number in scientific notation.
The paper shows that these 'dates-as-gene-names' ended up propagating to other
databases.
I searched today for '2-Sep' at GenBank and this was the only hit. It's possible that this is an
intended gene-name variant, but Septin 2 is usually referred to as sep2/sept2/sep-2 etc. So
this is possibly another Excel-based error.
Sometimes people make assumptions that gene names are unique to a specific function.
DEC1 (one of the Excel-ified gene names mentioned in the earlier paper) can mean one thing
to people working on many vertebrate species...
...but something else if you work on fruit flies. Dangerous to make any assumptions when it
comes to gene names.
Consider one worm gene...

Here is one Caenorhabiditis elegans gene (abu-11) in WormBase. There is the official gene
name, a sequence name, 'other' names, the WormBase gene ID, plus other identifiers for
external databases which also describe the gene (there's also a protein ID, not shown here).
In C. elegans, gene names have a central naming authority (the CGC) but genes often get
renamed. Just look at these pqn genes which have been renamed or merged with other
genes.
This is the current view of the twk-43 gene in C. elegans (aka F32H5.7[abc]).
WormBase allows you to see the history behind genes. This gene started out as just F32H5.2,
a gene with no splice isoforms.
Then at some point it was split into 3 genes...
...before being converted into the current one gene (with four splice isoforms). Genes are
split and merged and renamed all the time. Relying on the common gene name (e.g. twk-43)
or the sequence identifier (F32H5.7) can get you into trouble.
SOLUTIONS

What can be done to help with these sorts of problems?
Use ontologies and understand what those ontologies do.
Three main parts to a Gene Ontology term (GO term):
1) The name
2) The accession
3) The definition (which can change)
A fourth major part of a GO term is that it has ancestors and children. A single term is 'part
of' other terms and also 'is' examples of other terms. E.g. a nuclear outer membrane *is* a
nuclear membrane and is *part of* the cell.
Most model organism databases are loaded up with GO terms. E.g. you can search GO terms
from the 'front door' of FlyBase.
In WormBase, the same GO term search takes you directly to a gene page.
Scroll down on that gene page and we see the specified GO term...but what is an 'evidence
code', and what does 'IDA' mean?
Sadly the majority of people who use GO terms (as part of 'DAVID' analyses etc.) have no
knowledge of evidence codes
All GO terms should be connected to genes (or other database entries) with evidence codes.
Gives you an idea of how robust the assignment is. Databases like WormBase have curators
that scan papers (by eye, but also with software) to find suitable GO terms that can be added
to genes on the basis of experiments described in the paper.
Most of the GO terms you will ever see have this evidence code. It is among the weakest of all
evidence (avoid any evidence which is 'non-traceable author statement'). It could simply
mean that a human protein (with some known information) was BLASTed against a yeast
genome and the resulting yeast match acquired the human meta-information as GO terms.
IEA codes should be treated with some suspicion.
48.2% of GO annotations
— in one of the best annotated eukaryotic animal genomes —
are generated automatically
The Gene Ontology website shows how many GO terms are attached to genes in different
organisms. Even in C. elegans (with >15 years of gene annotation), about half of the GO
terms are all in the IEA category.
Gene Ontology is not the only game in town. Sequence Ontology (SO) is widely used and a
subset of SO terms are used in GFF files to describe features (or at least they should be!).
GO and SO are part of OBO (Open Biological Ontologies: http://www.obofoundry.org).There
may be a community developing an ontology for your field of interest. This site lists them all.
Some get very specific.
SUMMARY
Use ontologies whenever possible
Don't assume that identifiers in existing databases are
the correct (or only) identifiers
Be careful when inflicting new database identifiers on
to the world!

On the last point, check whether your identifiers (even if they end up buried in supplementary
material somewhere) don't conflict with other databases out there. Long and boring
identifiers are usually the most stable and more easily parsed by scripts (although they are
the least human-friendly). But no spaces or asterisks in identifiers please!
This talk is KORF_labtalk_00000315

More Related Content

PDF
Thoughts on the feasibility of an Assemblathon 3 contest
Keith Bradnam
 
PDF
The art of good science writing
Keith Bradnam
 
PDF
Genome Assembly: the art of trying to make one BIG thing from millions of ver...
Keith Bradnam
 
PPTX
2014 ucl
c.titus.brown
 
PPTX
2014 villefranche
c.titus.brown
 
PPTX
2014 naples
c.titus.brown
 
PDF
Basics of Genome Assembly
José Héctor Gálvez
 
PDF
Genome assembly: then and now — v1.1
Keith Bradnam
 
Thoughts on the feasibility of an Assemblathon 3 contest
Keith Bradnam
 
The art of good science writing
Keith Bradnam
 
Genome Assembly: the art of trying to make one BIG thing from millions of ver...
Keith Bradnam
 
2014 ucl
c.titus.brown
 
2014 villefranche
c.titus.brown
 
2014 naples
c.titus.brown
 
Basics of Genome Assembly
José Héctor Gálvez
 
Genome assembly: then and now — v1.1
Keith Bradnam
 

What's hot (20)

PDF
Genome assembly: then and now — v1.2
Keith Bradnam
 
PDF
Genome assembly: then and now — with notes — v1.1
Keith Bradnam
 
PDF
Genome Assembly 2018
Aureliano Bombarely
 
PDF
2013 stamps-assembly-methods.pptx
c.titus.brown
 
PPTX
2014 bangkok-talk
c.titus.brown
 
PDF
Bio IGCSE- Genetic Engineering.
LiveOnlineClassesInd
 
PPTX
2012 oslo-talk
c.titus.brown
 
PPTX
2013 duke-talk
c.titus.brown
 
PDF
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
Torsten Seemann
 
PDF
Lets Make a Mammoth
Cheche Salas
 
PDF
Apollo - A webinar for the Phascolarctos cinereus research community
Monica Munoz-Torres
 
PPTX
2014 sage-talk
c.titus.brown
 
PPTX
Microbiome studies using 16S ribosomal DNA PCR: some cautionary tales.
jennomics
 
PDF
Comparative Genomics and Visualisation - Part 2
Leighton Pritchard
 
PPTX
How to sequence a large eukaryotic genome
Lex Nederbragt
 
PPTX
Future of metagenomics
Francisco Rodriguez-Valera
 
PDF
Genome Curation using Apollo
Monica Munoz-Torres
 
PPTX
Plant Pathogen Genome Data: My Life In Sequences
Leighton Pritchard
 
PDF
Long read sequencing - WEHI bioinformatics seminar - tue 16 june 2015
Torsten Seemann
 
PPT
DNA Notes
charlietheteacher
 
Genome assembly: then and now — v1.2
Keith Bradnam
 
Genome assembly: then and now — with notes — v1.1
Keith Bradnam
 
Genome Assembly 2018
Aureliano Bombarely
 
2013 stamps-assembly-methods.pptx
c.titus.brown
 
2014 bangkok-talk
c.titus.brown
 
Bio IGCSE- Genetic Engineering.
LiveOnlineClassesInd
 
2012 oslo-talk
c.titus.brown
 
2013 duke-talk
c.titus.brown
 
De novo genome assembly - T.Seemann - IMB winter school 2016 - brisbane, au ...
Torsten Seemann
 
Lets Make a Mammoth
Cheche Salas
 
Apollo - A webinar for the Phascolarctos cinereus research community
Monica Munoz-Torres
 
2014 sage-talk
c.titus.brown
 
Microbiome studies using 16S ribosomal DNA PCR: some cautionary tales.
jennomics
 
Comparative Genomics and Visualisation - Part 2
Leighton Pritchard
 
How to sequence a large eukaryotic genome
Lex Nederbragt
 
Future of metagenomics
Francisco Rodriguez-Valera
 
Genome Curation using Apollo
Monica Munoz-Torres
 
Plant Pathogen Genome Data: My Life In Sequences
Leighton Pritchard
 
Long read sequencing - WEHI bioinformatics seminar - tue 16 june 2015
Torsten Seemann
 
Ad

Viewers also liked (7)

PDF
10 tips for adding polish to presentations
Keith Bradnam
 
PDF
This bioinformatics lesson is brought to you by the letter 'W'
Keith Bradnam
 
PDF
Polish that presentation! 25 tips to bring clarity to your slides
Keith Bradnam
 
PPTX
Master Thesis Presentation
Ashok Varadharajan
 
PDF
Genome assembly: the art of trying to make one big thing from millions of ver...
Keith Bradnam
 
PDF
13 questions you might have about galaxy
Keith Bradnam
 
PPTX
Assembly: before and after
Lex Nederbragt
 
10 tips for adding polish to presentations
Keith Bradnam
 
This bioinformatics lesson is brought to you by the letter 'W'
Keith Bradnam
 
Polish that presentation! 25 tips to bring clarity to your slides
Keith Bradnam
 
Master Thesis Presentation
Ashok Varadharajan
 
Genome assembly: the art of trying to make one big thing from millions of ver...
Keith Bradnam
 
13 questions you might have about galaxy
Keith Bradnam
 
Assembly: before and after
Lex Nederbragt
 
Ad

Similar to What's in a name? Better vocabularies = better bioinformatics? (20)

PPT
The Seven Deadly Sins of Bioinformatics
Duncan Hull
 
PPT
The seven-deadly-sins-of-bioinformatics3960
mare34
 
PPTX
Computing on the shoulders of giants
Benjamin Good
 
PPTX
Chibucos annot go_final
Sucheta Tripathy
 
PPT
Gene Ontology Project
vaibhavdeoda
 
PDF
BITS: Overview of important biological databases beyond sequences
BITS
 
PPT
Bioinformatica 06-10-2011-t2-databases
Prof. Wim Van Criekinge
 
PDF
bioinformatics enabling knowledge generation from agricultural omics data
International Institute of Tropical Agriculture
 
PPTX
How to analyse large data sets
improvemed
 
PPT
Bioinformatics MiRON
Prabin Shakya
 
PPT
hts ...kafna
hagostesfay4
 
PPTX
Light Intro to the Gene Ontology
nniiicc
 
PDF
Bioinformatics
Nuno Barreto
 
PPTX
Cool Informatics Tools and Services for Biomedical Research
David Ruau
 
PPTX
Ewan Birney Biocuration 2013
Iddo
 
PPTX
2016 bergen-sars
c.titus.brown
 
PDF
University of Manchester Symposium 2012: Extraction and Representation of in ...
geraintduck
 
PPTX
Ontologies: Necessary, but not sufficient
robertstevens65
 
PPTX
Major databases in bioinformatics
Vidya Kalaivani Rajkumar
 
The Seven Deadly Sins of Bioinformatics
Duncan Hull
 
The seven-deadly-sins-of-bioinformatics3960
mare34
 
Computing on the shoulders of giants
Benjamin Good
 
Chibucos annot go_final
Sucheta Tripathy
 
Gene Ontology Project
vaibhavdeoda
 
BITS: Overview of important biological databases beyond sequences
BITS
 
Bioinformatica 06-10-2011-t2-databases
Prof. Wim Van Criekinge
 
bioinformatics enabling knowledge generation from agricultural omics data
International Institute of Tropical Agriculture
 
How to analyse large data sets
improvemed
 
Bioinformatics MiRON
Prabin Shakya
 
hts ...kafna
hagostesfay4
 
Light Intro to the Gene Ontology
nniiicc
 
Bioinformatics
Nuno Barreto
 
Cool Informatics Tools and Services for Biomedical Research
David Ruau
 
Ewan Birney Biocuration 2013
Iddo
 
2016 bergen-sars
c.titus.brown
 
University of Manchester Symposium 2012: Extraction and Representation of in ...
geraintduck
 
Ontologies: Necessary, but not sufficient
robertstevens65
 
Major databases in bioinformatics
Vidya Kalaivani Rajkumar
 

More from Keith Bradnam (9)

PDF
This bioinformatics lesson is brought to you by the letter 'T'
Keith Bradnam
 
PDF
This bioinformatics lesson is brought to you by the letter 'D'
Keith Bradnam
 
PDF
Genome assembly: then and now (with notes) — v1.2
Keith Bradnam
 
PDF
Genome assembly: then and now — v1.0
Keith Bradnam
 
PPTX
Database talk for Bits & Bites meeting
Keith Bradnam
 
PPTX
Benchmarking short-read mapping programs
Keith Bradnam
 
PDF
Thoughts on the recent announcements by Oxford Nanopore Technologies
Keith Bradnam
 
PDF
When is a genome finished?
Keith Bradnam
 
PDF
Twitter 101 - an introduction to Twitter
Keith Bradnam
 
This bioinformatics lesson is brought to you by the letter 'T'
Keith Bradnam
 
This bioinformatics lesson is brought to you by the letter 'D'
Keith Bradnam
 
Genome assembly: then and now (with notes) — v1.2
Keith Bradnam
 
Genome assembly: then and now — v1.0
Keith Bradnam
 
Database talk for Bits & Bites meeting
Keith Bradnam
 
Benchmarking short-read mapping programs
Keith Bradnam
 
Thoughts on the recent announcements by Oxford Nanopore Technologies
Keith Bradnam
 
When is a genome finished?
Keith Bradnam
 
Twitter 101 - an introduction to Twitter
Keith Bradnam
 

Recently uploaded (20)

PPTX
Kanban Cards _ Mass Action in Odoo 18.2 - Odoo Slides
Celine George
 
PPTX
Artificial-Intelligence-in-Drug-Discovery by R D Jawarkar.pptx
Rahul Jawarkar
 
PPTX
Care of patients with elImination deviation.pptx
AneetaSharma15
 
PPTX
How to Apply for a Job From Odoo 18 Website
Celine George
 
PDF
Review of Related Literature & Studies.pdf
Thelma Villaflores
 
PPTX
A Smarter Way to Think About Choosing a College
Cyndy McDonald
 
PPTX
Information Texts_Infographic on Forgetting Curve.pptx
Tata Sevilla
 
PDF
BÀI TẬP TEST BỔ TRỢ THEO TỪNG CHỦ ĐỀ CỦA TỪNG UNIT KÈM BÀI TẬP NGHE - TIẾNG A...
Nguyen Thanh Tu Collection
 
PPTX
Python-Application-in-Drug-Design by R D Jawarkar.pptx
Rahul Jawarkar
 
PPTX
How to Close Subscription in Odoo 18 - Odoo Slides
Celine George
 
PPTX
Applications of matrices In Real Life_20250724_091307_0000.pptx
gehlotkrish03
 
PDF
Module 2: Public Health History [Tutorial Slides]
JonathanHallett4
 
PPTX
CARE OF UNCONSCIOUS PATIENTS .pptx
AneetaSharma15
 
PDF
Health-The-Ultimate-Treasure (1).pdf/8th class science curiosity /samyans edu...
Sandeep Swamy
 
PPTX
HISTORY COLLECTION FOR PSYCHIATRIC PATIENTS.pptx
PoojaSen20
 
DOCX
Modul Ajar Deep Learning Bahasa Inggris Kelas 11 Terbaru 2025
wahyurestu63
 
DOCX
Action Plan_ARAL PROGRAM_ STAND ALONE SHS.docx
Levenmartlacuna1
 
PPTX
Introduction to pediatric nursing in 5th Sem..pptx
AneetaSharma15
 
PPTX
BASICS IN COMPUTER APPLICATIONS - UNIT I
suganthim28
 
PDF
2.Reshaping-Indias-Political-Map.ppt/pdf/8th class social science Exploring S...
Sandeep Swamy
 
Kanban Cards _ Mass Action in Odoo 18.2 - Odoo Slides
Celine George
 
Artificial-Intelligence-in-Drug-Discovery by R D Jawarkar.pptx
Rahul Jawarkar
 
Care of patients with elImination deviation.pptx
AneetaSharma15
 
How to Apply for a Job From Odoo 18 Website
Celine George
 
Review of Related Literature & Studies.pdf
Thelma Villaflores
 
A Smarter Way to Think About Choosing a College
Cyndy McDonald
 
Information Texts_Infographic on Forgetting Curve.pptx
Tata Sevilla
 
BÀI TẬP TEST BỔ TRỢ THEO TỪNG CHỦ ĐỀ CỦA TỪNG UNIT KÈM BÀI TẬP NGHE - TIẾNG A...
Nguyen Thanh Tu Collection
 
Python-Application-in-Drug-Design by R D Jawarkar.pptx
Rahul Jawarkar
 
How to Close Subscription in Odoo 18 - Odoo Slides
Celine George
 
Applications of matrices In Real Life_20250724_091307_0000.pptx
gehlotkrish03
 
Module 2: Public Health History [Tutorial Slides]
JonathanHallett4
 
CARE OF UNCONSCIOUS PATIENTS .pptx
AneetaSharma15
 
Health-The-Ultimate-Treasure (1).pdf/8th class science curiosity /samyans edu...
Sandeep Swamy
 
HISTORY COLLECTION FOR PSYCHIATRIC PATIENTS.pptx
PoojaSen20
 
Modul Ajar Deep Learning Bahasa Inggris Kelas 11 Terbaru 2025
wahyurestu63
 
Action Plan_ARAL PROGRAM_ STAND ALONE SHS.docx
Levenmartlacuna1
 
Introduction to pediatric nursing in 5th Sem..pptx
AneetaSharma15
 
BASICS IN COMPUTER APPLICATIONS - UNIT I
suganthim28
 
2.Reshaping-Indias-Political-Map.ppt/pdf/8th class social science Exploring S...
Sandeep Swamy
 

What's in a name? Better vocabularies = better bioinformatics?

  • 1. WHAT'S IN A NAME? Better vocabulary = better bioinformatics??? From flickr user giantginkgo # Author: Keith Bradnam, Genome Center, UC Davis # This work is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 3.0 Unported License.
  • 2. http://biomickwatson.wordpress.com Most of the interesting 'stuff' that I discover about bioinformatics and genomics comes from a) twitter, b) blogs, and c) papers (in that order). Mick Watson has fun and engaging blog about bioinformatics and today he raised an important point: the lack of standardization in scientific databases leads to frustration (and frustration leads to...suffering).
  • 3. http://biomickwatson.wordpress.com These are some terms that appear in the same database. You can code solutions for some of this variation (e.g. British/American English differences or presence/absence of underscore vs space character), but who wants to waste time doing that? Shouldn't these databases be using controlled vocabularies?
  • 4. This infamous paper from 2004 reveals how easy it is to introduce errors into biological databases.
  • 5. First highlighted column = actual gene name. Second highlighted column = what Excel will automatically assume you mean.
  • 6. RIKEN ID: 2310009E13 Happens for other identifiers as well. This RIKEN ID will change if it ever ends up in Excel...
  • 7. RIKEN ID: 2.31E+13 ...now it appears as a number in scientific notation.
  • 8. The paper shows that these 'dates-as-gene-names' ended up propagating to other databases.
  • 9. I searched today for '2-Sep' at GenBank and this was the only hit. It's possible that this is an intended gene-name variant, but Septin 2 is usually referred to as sep2/sept2/sep-2 etc. So this is possibly another Excel-based error.
  • 10. Sometimes people make assumptions that gene names are unique to a specific function. DEC1 (one of the Excel-ified gene names mentioned in the earlier paper) can mean one thing to people working on many vertebrate species...
  • 11. ...but something else if you work on fruit flies. Dangerous to make any assumptions when it comes to gene names.
  • 12. Consider one worm gene... Here is one Caenorhabiditis elegans gene (abu-11) in WormBase. There is the official gene name, a sequence name, 'other' names, the WormBase gene ID, plus other identifiers for external databases which also describe the gene (there's also a protein ID, not shown here).
  • 13. In C. elegans, gene names have a central naming authority (the CGC) but genes often get renamed. Just look at these pqn genes which have been renamed or merged with other genes.
  • 14. This is the current view of the twk-43 gene in C. elegans (aka F32H5.7[abc]).
  • 15. WormBase allows you to see the history behind genes. This gene started out as just F32H5.2, a gene with no splice isoforms.
  • 16. Then at some point it was split into 3 genes...
  • 17. ...before being converted into the current one gene (with four splice isoforms). Genes are split and merged and renamed all the time. Relying on the common gene name (e.g. twk-43) or the sequence identifier (F32H5.7) can get you into trouble.
  • 18. SOLUTIONS What can be done to help with these sorts of problems?
  • 19. Use ontologies and understand what those ontologies do.
  • 20. Three main parts to a Gene Ontology term (GO term): 1) The name 2) The accession 3) The definition (which can change)
  • 21. A fourth major part of a GO term is that it has ancestors and children. A single term is 'part of' other terms and also 'is' examples of other terms. E.g. a nuclear outer membrane *is* a nuclear membrane and is *part of* the cell.
  • 22. Most model organism databases are loaded up with GO terms. E.g. you can search GO terms from the 'front door' of FlyBase.
  • 23. In WormBase, the same GO term search takes you directly to a gene page.
  • 24. Scroll down on that gene page and we see the specified GO term...but what is an 'evidence code', and what does 'IDA' mean? Sadly the majority of people who use GO terms (as part of 'DAVID' analyses etc.) have no knowledge of evidence codes
  • 25. All GO terms should be connected to genes (or other database entries) with evidence codes. Gives you an idea of how robust the assignment is. Databases like WormBase have curators that scan papers (by eye, but also with software) to find suitable GO terms that can be added to genes on the basis of experiments described in the paper.
  • 26. Most of the GO terms you will ever see have this evidence code. It is among the weakest of all evidence (avoid any evidence which is 'non-traceable author statement'). It could simply mean that a human protein (with some known information) was BLASTed against a yeast genome and the resulting yeast match acquired the human meta-information as GO terms. IEA codes should be treated with some suspicion.
  • 27. 48.2% of GO annotations — in one of the best annotated eukaryotic animal genomes — are generated automatically The Gene Ontology website shows how many GO terms are attached to genes in different organisms. Even in C. elegans (with >15 years of gene annotation), about half of the GO terms are all in the IEA category.
  • 28. Gene Ontology is not the only game in town. Sequence Ontology (SO) is widely used and a subset of SO terms are used in GFF files to describe features (or at least they should be!).
  • 29. GO and SO are part of OBO (Open Biological Ontologies: http://www.obofoundry.org).There may be a community developing an ontology for your field of interest. This site lists them all.
  • 30. Some get very specific.
  • 32. Use ontologies whenever possible Don't assume that identifiers in existing databases are the correct (or only) identifiers Be careful when inflicting new database identifiers on to the world! On the last point, check whether your identifiers (even if they end up buried in supplementary material somewhere) don't conflict with other databases out there. Long and boring identifiers are usually the most stable and more easily parsed by scripts (although they are the least human-friendly). But no spaces or asterisks in identifiers please! This talk is KORF_labtalk_00000315