iPhylo: taxonomy

Roderic D. M. Page

Showing posts with label taxonomy. Show all posts

Sunday, January 02, 2022

Large graph viewer experiments

I keep returning to the problem of viewing large graphs and trees, which means my hard drive has accumulated lots of failed prototypes. Inspired by some recent discussions on comparing taxonomic classifications I decided to package one of these (wildly incomplete) prototypes up so that I can document the idea and put the code somewhere safe.

Very cool, thanks for sharing this-- the tree diff is similar to what J Rees has been cooking up lately with his 'cl diff' tool. I'll tag @beckettws in here too so he can see potential crossover. The goal is autogenerate diffs like this as 1st step to mapping taxo name-to concept
— Nate Upham (@n8_upham) December 28, 2021

Google Maps-like viewer

I've created a simple viewer that uses a tiled map viewer (like Google Maps) to display a large graph. The idea is to draw the entire graph scaled to a 256 x 256 pixel tile. The graph is stored in a database that supports geospatial queries, which means the queries to retrieve the individual tiles need to display the graph at different levels of resolution are simply bounding box queries to a database. I realise that this description is cryptic at best. The GitHub repository https://github.com/rdmpage/gml-viewer has more details and the code itself. There's a lot to do, especially adding support for labels(!) which presents some interesting challenges (levels of detail and generalization). The code doesn't do any layout of the graph itself, instead I've used the yEd tool to compute the x,y coordinates of the graph.

Since this exercise was inspired by a discussion of the ASM Mammal Diversity Database, the graph I've used for the demonstration above is the ASM classification of extant mammals. I guess I need to solve the labelling issue fairly quickly!

Tuesday, April 06, 2021

It's been a while...

Is it's been a while since I've blogged here. The last few months have been, um, interesting for so many reasons. Meanwhile in my little corner of the world there's been the constant challenge of rethinking how to teach online-only, whilst also messing about with a bunch of things in a rather unfocused way (and spending way too much time populating Wikidata). So here I'll touch on a few rather random topics that have come up in the last few months, and elsewhere on this blog I'll try and focus on some of the things that I'm working on. In many ways this blog post is really to serve as a series of bookmarks for things I'd like to think about a bit more.

Taxonomic precision and the "one tree"

One thing that had been bugging me for a while was my inability to find the source of a quote about taxonomic precision that I remembered as a grad student. I was pretty sure that David Penny and Mike Handy had said it, but where? Found it at last:

Biologists seem to seek “The One Tree” and appear not to be satisfied by a range of options. However, there is no logical difficulty with having a range of trees. There are 34,459,425 possible trees for 11 taxa (Penny et al. 1982), and to reduce this to the order of 10-50 trees is analogous to an accuracy of measurement of approximately one part in 10⁶.

Many measurements in biology are only accurate to one or two significant figures and pale when compared to physical measurements that may be accurate to 10 significant figures. To be able to estimate an accuracy of one tree in 10⁶ reflects the increasing sophistication of tree reconstruction methods. (Note that, on this argument, to identify an organism to a species is also analogous to a measurement with an accuracy of approximately one in 10⁶.). — "Estimating the reliability of evolutionary trees" p.414 doi:10.1093/oxfordjournals.molbev.a040407

I think this quote helps put taxonomy and phylogeny in the broader context of quantitative biology. Building trees that accurately place taxa is a computationally challenging task that yields some of the most precise measurements in biology.

Barcodes for everyone

DNA for 95 specimens in 40 min at negligible cost (not included: beverage consumption during 20 min break): this is the 1st video demonstrating the methods used in “MinION barcodes: biodiversity discovery and identification by everyone, for everyone”. https://t.co/W2LuaJ6SYk pic.twitter.com/eNDk6eU2xa
— Rudolf Meier (@RudolfMeier15) March 22, 2021

This is yet another exciting paper from Rudolf Meier's lab (see earlier blog post Signals from Singapore: NGS barcoding, generous interfaces, the return of faunas, and taxonomic burden). The preprint doi:10.1101/2021.03.09.434692 is on bioRxiv. It feels like we are getting ever-closer to the biodiversity tricorder.

Barcodes for Australia

#ArabaBioscan week 23 https://t.co/n9zvE2lTlF Some of the local #Ichneumonoidea diversity from this week's Malaise sample #BIOSCAN #entomology #DNABarcoding @iBOLConsortium pic.twitter.com/yR7yg0B7Ys
— Donald Hobern (@dhobern) April 5, 2021

Donald Hobern (@dhobern) has been blogging about insects collected in malaise traps in Aranda, Australian Capital Territories (ACT). The insects are being photographed (see stream on Flickr) and will be barcoded.

No barcodes please we're taxonomists!

A paper with a title like "Minimalist revision and description of 403 new species in 11 subfamilies of Costa Rican braconid parasitoid wasps, including host records for 219 species" (Harvey et al. doi:10.3897/zookeys.1013.55600 was always likely to cause problems, and sure enough some taxonomists had a meltdown. A lot of the arguments centered around whether DNA sequences counted as words, which seems surreal. DNA sequences are strings of characters, just like natural language. Unlike English, not all languages have word breaks. Consider Chinese for example, where search engines can't break text up into words for indexing, but instead use n-grams. I mention this simply because n-grams are a useful way to index DNA sequences and to compute sequence similarly without performing a costly sequence alignment. I used this technique in my DNA barcode browser. If we move beyond arguments about whether a picture and a DNA sequence is enough to describe a species (if all species every discovered were described this way we'd arguably be much better off than we are now) I think there is a core issue here, namely the relative size of the intersection between taxa that have been described classically (i.e., with words) and those described almost entirely by DNA (e.g., barcodes) will likely drop as more and more barcoding is done, and this has implications for how we do biology (see Dark taxa: GenBank in a post-taxonomic world).

Bioschema

The dream of linked data rumbles on. Schema.org is having a big impact on standardising basic metadata encoded in web sites, so much so that anyone building a web site now needs to be familiar with schema.org if you want your site to do well in search engine rankings. I made extensive use of schema.org to model bibliographic data on Australian animals for my Ozymandias project.

Bioschemas aims to provide a biology-specific extension to schema.org, and is starting to take off. For example, GBIF pages for species now have schema.org embedded as JSON-LD, e.g. the page for Chrysochloris visagiei Broom, 1950 has this JSON-LD embedded in a <script type="application/ld+json"> tag:

{
  "@context": [
    "https://schema.org/",
    {
      "dwc": "http://rs.tdwg.org/dwc/terms/",
      "dwc:vernacularName": {
        "@container": "@language"
      }
    }
  ],
  "@type": "Taxon",
  "additionalType": [
    "dwc:Taxon",
    "http://rs.tdwg.org/ontology/voc/TaxonConcept#TaxonConcept"
  ],
  "identifier": [
    {
      "@type": "PropertyValue",
      "name": "GBIF taxonKey",
      "propertyID": "http://www.wikidata.org/prop/direct/P846",
      "value": 2432181
    },
    {
      "@type": "PropertyValue",
      "name": "dwc:taxonID",
      "propertyID": "http://rs.tdwg.org/dwc/terms/taxonID",
      "value": 2432181
    }
  ],
  "name": "Chrysochloris visagiei Broom, 1950",
  "scientificName": {
    "@type": "TaxonName",
    "name": "Chrysochloris visagiei",
    "author": "Broom, 1950",
    "taxonRank": "SPECIES",
    "isBasedOn": {
      "@type": "ScholarlyArticle",
      "name": "Ann. Transvaal Mus. vol.21 p.238"
    }
  },
  "taxonRank": [
    "http://rs.gbif.org/vocabulary/gbif/rank/species",
    "species"
  ],
  "dwc:vernacularName": [
    {
      "@language": "eng",
      "@value": "Visagie s Golden Mole"
    },
    {
      "@language": "eng",
      "@value": "Visagie's Golden Mole"
    },
    {
      "@language": "eng",
      "@value": "Visagie's Golden Mole"
    },
    {
      "@language": "eng",
      "@value": "Visagie's Golden Mole"
    },
    {
      "@language": "",
      "@value": "Visagie's golden mole"
    },
    {
      "@language": "eng",
      "@value": "Visagie's Golden Mole"
    },
    {
      "@language": "deu",
      "@value": "Visagie-Goldmull"
    }
  ],
  "parentTaxon": {
    "@type": "Taxon",
    "name": "Chrysochloris Lacépède, 1799",
    "scientificName": {
      "@type": "TaxonName",
      "name": "Chrysochloris",
      "author": "Lacépède, 1799",
      "taxonRank": "GENUS",
      "isBasedOn": {
        "@type": "ScholarlyArticle",
        "name": "Tabl. Mamm. p.7"
      }
    },
    "identifier": [
      {
        "@type": "PropertyValue",
        "name": "GBIF taxonKey",
        "propertyID": "http://www.wikidata.org/prop/direct/P846",
        "value": 2432177
      },
      {
        "@type": "PropertyValue",
        "name": "dwc:taxonID",
        "propertyID": "http://rs.tdwg.org/dwc/terms/taxonID",
        "value": 2432177
      }
    ],
    "taxonRank": [
      "http://rs.gbif.org/vocabulary/gbif/rank/genus",
      "genus"
    ]
  }
}

For more details on the potential of Bioschemas see Franck Michel's TDWG Webinar.

OCR correction

Just a placeholder to remind me to revisit OCR correction and the dream of a workflow to correct text for BHL. I came across hOCR-Proofreader (which has a Github repo). Internet Archive now provides hOCR files as one of its default outputs, so we're getting closer to a semi-automated workflow for OCR correction. For example, imagine having all this set up on Github so that people can correct text and push those corrections to Github. So close...

Roger Hyam keeps being awesome

Roger just keeps doing cool things that I keep learning from. In the last few months he's been working on a nice interface to the World Flora Online (WFO) which, let's face it, is horrifically ugly and does unspeakable things to the data. Roger is developing a nicer interface and is doing some cool things under the hood with identifiers that inspired me to revisit LSIDs (see below).

But the other thing Roger has been doing is using GraphQL to provide a clean API for the designer working with him to use. I have avoided GraphQL because it couldn't see what problem it solved. It's not a graph query language (despite the name), it breaks HTTP caching, it just seemed that it was the SOAP of today. But, if Roger's using it, I figured there must be something good here (and yes, I'm aware that GraphQL has a huge chunk of developer mindshare). As I was playing with yet another knowledge graph project I kept running into the challenge of converting a bunch of SPARQL queries into something that could be easily rendered in a web page, which is when the utility of GraphQL dawned on me. The "graph" in this case is really a structured set of results that correspond to the information you want to render on a web page. This may be the result of quite a complex series of queries (in my case using SPARQL on a triple store) that nobody wants to actually see. The other motivator was seeing DataCite's use of GraphQL to query the "PID Graph". So, I think I get it now, in the sense that I see why it is useful.

LSIDs back from the dead

LSIDs are back baby! https://t.co/gWoBoY1wgn Persistent identifiers should, you know, persist #PID pic.twitter.com/RFW723DnVV
— Roderic Page (@rdmpage) March 9, 2021

In part inspired by Roger Hyam's work on WFO I released a Life Science Identifier (LSID) Resolver to make LSIDs resolvable. I'll spare you the gory details, but you can think of LSIDs as DOIs for taxonomic names. They came complete with a decentralised resolution mechanism (based on Internet domain names) and standards for what information they return (RDF as XML), and millions were minted for animal, fungi, and plant names. For various reasons they didn't really take off (they were technically tricky to use and didn't return information in a form people could read, so what were the chances?). Still, they contain a lot of valuable information for those of us interested in having lists of names linked to the primary literature. Over the years I have been collecting them and wanted a way to make them available. I've chosen a super-simple approach based on storing them in compressed form in GitHub and wrapping that repo in simple web site. Lots of limitations, but I like the idea that LSIDs actually, you know, persist.

DOIs for Biodiversity Heritage Library

Six months ago we started BHL's Persistent Identifier Working Group, so it's time for a huge shout out to the amazing efforts of @mlichtenberg @SusanWLynch @fauxbrarian @missellb @BHLProgramMgr & @rdmpage to mint DOIs for the historic literature on @BioDivLibrary. #RetroPIDs pic.twitter.com/s4jJBUEQg9
— Nicole Kearney (@nicolekearney) April 1, 2021

In between everything else I've been working with BHL to add DOIs to the literature that they have scanned. Some of this literature is old and of limited scientific value (but sure looks pretty - Nicole Kearney is going to take me to task for saying that), but a lot of it is recent, rich, and scientifically valuable. I'm hoping that the coming months will see a lot of this literature emerge from relative obscurity and become a first class citizen of the taxonomic and biodiversity literature.

Summary

I guess something meaningful and deep should go here... nope, I'm done.

Thursday, July 09, 2020

Lists of species don't matter: thoughts on "Principles for creating a single authoritative list of the world’s species"

Garnett et al. recently published a paper in PLoS Biology that starts with the sentence "Lists of species matter":

Garnett, S. T., Christidis, L., Conix, S., Costello, M. J., Zachos, F. E., Bánki, O. S., … Thiele, K. R. (2020). Principles for creating a single authoritative list of the world’s species. PLOS Biology, 18(7), e3000736. doi:10.1371/journal.pbio.3000736

This paper (one of a forthcoming series) is pretty much the kind of paper I try and avoid reading. It has lots of authors so it is a paper by committee, those authors all have a stake in particular projects, and it is an acronym soup of organisations the paper is pitched at. It's a well-worn strategy: write one or more papers outlining making the case that there is a problem, then get funding based on the notion that clearly there's a problem (you've published papers saying so) and that you and your co-applicants are best placed to solve it (clearly, because you wrote the papers identifying the problem in the first place). I'm not criticising the strategy, it's how you get things done in science. It just makes for a rather uninspiring read.

From my perspective focussing on "lists" is a mistake. Lists don't really matter, it is what is on the list that counts. And I think this is where the real prize is. As I play with Wikidata I'm becoming increasingly aware of the ~~clusterfuck~~ mess the taxonomic database community has created by conflating taxonomic names with taxa, and by having multiple identifiers for the same things. We survive this mess by relying on taxonomic names as somewhat fuzzy identifiers, and the hope that we can communicate successfully with other people despite this ambiguity (I guess this is pretty much the basis of all communication). As Roger Hyam notes:

These taxon names we are dealing with are really just social tags that need to be organised in a central place.

Having lots of names (tags) is fine, and Wikidata is busy harvesting all these taxonomic names and their identifiers (ITIS, IPNI, NCBI, EOL, iNaturalist, eBird, etc., etc., etc.). For most of these names all we have is a mapping to other identifiers for the same name, a link to a parent taxon, and sometimes a link to a reference for the name. But what happens if we want to attach data to a taxon? Take, for example, the African Piculet Verreauxia africana. This bird has at least two scientific names, each with a separate entry in Wikidata: Verreauxia africana Q28123873 and Sasia africana Q1266812. These are the same species yet it has two entries in Wikidata. If I want to add, say, body weight, or population size, or longevity, which Wikidata item do I add that data too?

What we need is an identifier for the species, an identifier that remains unchanged even if the name changes, or if that species moves in the taxonomic hierarchy. Some databases do this already. For example the eBird identifier for Verreauxia africana/Sasia africana is afrpic1. Because the identifier remains unchanged we can do things such as "diffs" between successive classifications showing how the species has moved between different genera (see Taxonomic publications as patch files and the notion of taxonomic concepts):

45759416 c9c5ed80 bc1f 11e8 98ca 5f4554ddca42

Ironically it seems that for birds the common name (in this case "African Piculet") is a more stable identifier than the scientific name (although that may well change). By having stable taxon identifiers we can then decide what entity to attach biological data to. Taxonomic names have failed to do this, but are still vital as well known tags. The actual taxon identifiers should be opaque identifiers (like "afrpic1" - not really opaque but close enough - or Avibase's C4DFB5E31495AE94). Make each opaque identifier a DOI, use existing taxonomic names as formalised tags so we aren't disconnected from the literature, use timestamped versions to track changes in species classification over time, and we have something useful.

This, I think, is the real prize. Rather than frame the task as making a list of species so that organisations can have a checklist they can all share, why not frame it as providing a framework that we can hang trait data on? We have vast quantities of data residing in siloed databases, spreadsheets, and centuries of biological literature. The argument shouldn't be about what is on a list, it should be how we group that information together and enable people to do their science. By providing stable identifiers that are resistant to name changes we can confidently associate trait data with taxa. Taxonomy could then actually be what it should be, the organisational framework for biological information (see Taxonomy as Information Science).

Monday, June 08, 2020

Towards visualising classifications from Wikidata

These are simply notes to myself about taxonomic classifications in Wikidata.

Classifications in Wikidata can be complex and are often not trees. For example, if we trace the parents of the frog family Leptodactylidae back we get a graph like this:

Each oval represents a taxon in Wikidata, and each arrow connects a taxon to its parent(s) in Wikidata. Likewise, if we do the same for the albatross genus Diomedea we get a similarly complex diagram:

The presence of multiple classifications likely reflects several factors. If you deal with just extant species you are likely to have fairly shallow classifications, for example, the kingdom, phylum, class, order, family, genus ranks used by GBIF. may be enough. Some taxonomic groups may routinely use ranks such as subfamily, and in well-studied groups there may be additional taxa based on phylogenetic research (e.g., the RTA clade in spiders). And of course, different Wikidata editors may favour different classifications.

Anecdotally (certainly for vertebrates), many of the additional levels in the classifications in Wikidata come from fossil taxa. In the case of birds, extant Aves (birds) are a fairly isolated group in the tree of life, but as we go down the tree towards their common ancestor with the crocodilians we encounter dinosaurs and other taxa. So if you are a palaeontologist the jump from, say Aves to Tetrapoda skips over a fairly significant part of the tree!

Faced with this complexity, how do we display a Wikidata classification in a simple way? One approach may be to display only a classification from a particular source, for example Mammal Species of the World. This requires that Wikidata has that classification, and enough information for you to extract it by a SPARQL query (for example if each node in the classification that is in MSW has a reference to MSW attached to that node).

Another approach is to extract a simplified classification from the sort of graphs shown above. Technically, these graphs are DAGs (Directed acyclic graphs). An obvious way to simplify a DAG is to find the shortest path in that DAG. For example, the path (Eukaryota, Animalia, Bilateria, Deuterostomia, Chordata, Olfactores, Gnathostomata, Tetrapoda, Amphibia, Anura, Leptodactylidae) is a path through the DAG shown above. Shortest paths are reasonably easy to find once you have a topological sorting of the graph (see e.g. Shortest Path in Directed Acyclic Graph). At the moment this looks the best bet for displaying classifications from Wikidata.

Preferred classifications

In some cases the classification in Wikidata is complicated, but this complexity isn’t reflected in SPARQL results because parts of that classification have different “ranks”. For example, for the plant order Fagales there are currently seven parents:

fabids
Rosanae
Hamamelididae
eurosids I
Monochlamydeae
Archichlamydeae
Juglandanae

One of these is flagged “Preferred rank” (fabids) and the others are “Normal rank”. As a result only the rabies appear in the list of parents.

Monday, April 20, 2020

Making sense of how Wikidata models taxonomy

Given my renewed enthusiasm for Wikidata, I'm trying to get my head around the way that Wikidata models biological taxonomy. As a first pass, here's a diagram of the properties linked to a taxonomic name. The model is fairly comprehensive, it includes relationships between names (e.g, basionym, protonym, replacement), between taxa (e.g., parent taxon), and links to the literature. It's also a complex model to query, given that a lot of information is expressed using qualifiers. Hence there's a bit of head scratching while I figure out the relationship between properties, statements, etc.

Links to the literature is one of my interests, can in cases where Wikidata has this information you can start to enhance the way we display publications, e.g.

Wow, that’s great! Hope it wasn’t too tedious a slog. Oh, and I saw this list of cicadas linked to a Fauna of NZ publication https://t.co/goIG9Lr7Gs - I’m assuming you made those links? Nice example of the potential to enhance publications on @Wikidata pic.twitter.com/iDsv4YgnkF
— Roderic Page (@rdmpage) April 19, 2020

The Wikidata model is very like that used in Darwin Core, where everything is a taxon and every taxon has a name, which means that relationships that are notionally between names and not taxa (e.g., basionym) are all treated as relationships between taxa.

One big challenge is how to interpret Wikidata as a classification, given that we expect classifications to be trees. The taxonomic classification in Wikidata is clearly not a tree, for example:

Hmmm, so @wikidata has a rather *complicated* biological taxonomy that is certainly not a tree. Here is the parent - child structure for the frog family Leptodactylidae. Instead of a single path from tip to root, we have all sorts of detours #crowdsourced pic.twitter.com/zu03KvLgnG
— Roderic Page (@rdmpage) April 4, 2020

What I think is happening here is that different people are adding different parent taxa, depending on which classification they follow. Some classifications (e.g., that used by GBIF) are "shallow" with only a few levels (e.g., kingdom, phylum, class, order, family, genus), other classifications are deep (e.g., NCBI). So the idea of simply being able to do a SPARQL query and get a tree (e.g. Displaying taxonomic classifications from Wikidata using d3js and SPARQL) runs into problems. But this could also be a strength, particularly if we had a reference or source for each parent child pair. That way we could (a) store multiple classifications in Wikidata, and (b) have queries that retreive classifications according to a particular source (e.g., GBIF).

So, lots of potential, but lots I've still to learn.

Thursday, October 25, 2018

Taxonomic publications as patch files and the notion of taxonomic concepts

There's a slow-burning discussion on taxonomic concepts on Github that I am half participating in. As seems inevitable in any discussion of taxonomy, there's a lot of floundering about given that there's lots of jargon - much of it used in different ways by different people - and people are coming at the problem from different perspectives.

In one sense, taxonomy is pretty straightforward. We have taxonomic names (labels), we have taxa (sets) that we apply those labels to, and a classification (typically a set of nested sets, i.e., a tree) of those taxa. So, if we download, say, GenBank, or GBIF, or BOLD we can pretty easily model names (e.g., a list of strings), the taxonomic tree (e.g., a parent-child hierarchy), and we have a straightforward definition of the terminal taxa (leaves) or the tree: they comprise the specimens and observations (GBIF), or sequences (GenBank and BOLD) assigned to that taxon (i.e., for each specimen or sequence we have a pointer to the taxon to which it belongs).

Given this, one response to the taxonomic concept discussion is to simply ignore it as irrelevant, and we can demonstrably do a lot of science without it. I suspect most people dealing with GBIF and GenBank data aren't aware of the taxonomic concept issue. Which begs the question, why the ongoing discussion about concepts?

Perhaps the fundamental issue is that taxonomic classification changes over time, and hence the interpretation of a taxon can change over time. In other words, the problem is one of versioning. Once again, the simplest strategy to deal with this is simply use the latest version. In much the same way that most of us probably just read the latest version of a Wikipedia page, and many of us are happy to have our phone apps update automatically, I suspect most are happy to just grab the latest version and do some, you know, science.

I think taxonomic concepts really become relevant when we are aggregating data from sources where the data may not be current. In other words, where data is associated with a particular taxonomic name and the interpretation of that name has changed since the last time the data was curated. If the relationships of a taxon or specimen can be computed on the fly, e.g. if the data is a DNA barcode, then this issue is less relevant because we can simply re-cluster the sequences and discover where the specimen with that sequence belongs in a new classification. But for many specimens we don't have sufficient information to do this computation (this is one reason DNA barcodes are so useful, everything needed to determine a barcode's relationship is contained in the sequence itself).

To make this concrete, consider the genus Brookesia in GBIF (GBIF:2449310.

According to Wikipedia Brookesia is endemic to Madagascar, so why does it appear on the African mainland? There are two records from Africa, Brookesia brookesia ionidesi collected in 1957 and Brookesia temporalis collected in 1926. Both represent taxa that were in the genus Brookesia at one point, but are now in different genera. So our notion of Brookesia has changed over time, but curation of these records has yet to catch up with that.

So, what would be ideal would be if we have a timestamped series of classifications so that we could go back in time and see what a given taxon meant at a given time, and then go forward to see the status of that taxon today. Having such a timestamped series is not a trivial task, indeed it may only be available in well studied groups. Birds are one such group, where each year eBird updates the current bird classification based on taxonomic activity over the previous year. As part of the Github discussion I posted visual "diff" between two bird classifications:

You can see the complete diff here, and the blog post Visualising the difference between two taxonomic classifications for details on the method.. The illustration above shows the movement of one species from Sasia to Verreauxia.

So, given two classifications we can compute the difference between them, and represent that difference as an "edit script" or operations to convert one tree into another. These edits are essentially what taxonomists do when they revise a group, they do things such as move species form one genus to another, merge some taxa, sink others into synonymy, and so on. So, taxonomy is essentially creating a series of edit files ("patches") to a classification. At a recent workshop in Ottawa Karen Cranston pointed out that the Open Tree of Life has been accumulating amendments to their classification and that these are essentially patch files.

Hence, we could have a markup language for taxonomic work that described that work in terms of edit operations that can then be automatically applied to an existing classification. We could imagine encoding all the bird taxonomy for a year in this way, applying those patches to the previous years' tree, and out pops the new classification. The classification becomes an evolving document under version control (think GitHub for trees). Of course, we'd need something to detect whether two different papers were proposing incompatible changes, but that's essentially a tree compatibility problem.

One way to store version information would be to use time-based versioned graphs. Essentially, we start with each node in the classification tree having a start date (e.g., 2017) and an open-ended end date. A taxonomic work post 2017 that, say, moved a species from one genus to another would set the end date for the parent-child link between genus and species, and create a new timestamped node linking the species to its new genus. To generate the 2018 classification we simply extract all links in the tree whose date range includes 2018 (which means the old generic assignment for the species is not included). This approach gives us a mechanism for automating the updating of a classification, as well as time-based versioning.

I think something along these lines would create something useful, and focus the taxonomic discussion on solving a specific problem.

Thursday, December 22, 2016

DNA barcoding taxonomy now in GBIF

220px The Face of a Lupine Blue Following on from adding DNA barcodes to GBIF I've now uploaded a taxonomic classification of DNA barcode BINs (Barcode Index Numbers). Each BIN is a cluster of similar DNA barcodes that is essentially equivalent to a species. For more details see:

Ratnasingham, S., & Hebert, P. D. N. (2013, July 8). A DNA-Based Registry for All Animal Species: The Barcode Index Number (BIN) System. (D. Fontaneto, Ed.), PLoS ONE. Public Library of Science (PLoS). https://doi.org/10.1371/journal.pone.0066213

The data I've uploaded was obtained by screen scraping the BOLD web site for each BIN in the DNA barcode dataset (BOLD's API doesn't let me get all the information I want). In addition to the taxonomic hierarchy associated with each BIN I've also extracted any publications mentioned on the BIN page, and subsequently tried to link those to the corresponding DOI, if the publication has one. The code for all this is available on GitHub https://github.com/rdmpage/bold-bins, which also serves as the host for the Darwin Core Archive for this dataset. There's a neat trick where you can use a .gitattributes file to tell GitHub not store certain files in the zip file it creates for the repository (see Excluding files from git archive exports using gitattributes by @fmarier).

Having done this, I've a few thoughts.

Please, please use DOIs for articles

BOLD pages for BINs often include one or more papers that published the barcodes included in that BIN. This is great, but often links to these papers are pretty strange:

Ever wonder why DOIs are nice? Take a look at this URL for an article :O This is why you want to store DOIs in databases, not links... pic.twitter.com/QzthcL8f7k
— Roderic Page (@rdmpage) December 20, 2016

If you are going to store literature in a database treat links to articles with great care. they are often full of extraneous stuff that depends on how the user reached that article online. DOIs greatly simplify this process. Instead of a URL like http://onlinelibrary.wiley.com/store/10.1111/j.1755-0998.2009.02650.x/asset/j.1755-0998.2009.02650.x.pdf?v=1&t=hellc54c&s=e14bbc4146b66a051ad5cd1f5361ac2e16dc5831&systemMessage=Pay+Per+View+will+be+unavailable+for+upto+3+hours+from+06%3A00+EST+March+23rd+ (I kid you not) you should use the DOI 10.1111/j.1755-0998.2009.02650.x.

Adding DOIs to these articles means GBIF will display them on the corresponding species page, for example Centromerus sylvaticus (Blackwall, 1841) has links to these two papers:

Telfer, A., deWaard, J., Young, M., Quinn, J., Perez, K., Sobel, C., … Hebert, P. (2015, August 30). Biodiversity inventories in high gear: DNA barcoding facilitates a rapid biotic survey of a temperate nature reserve. Biodiversity Data Journal. Pensoft Publishers. https://doi.org/10.3897/bdj.3.e6313

Blagoev, G. A., deWaard, J. R., Ratnasingham, S., deWaard, S. L., Lu, L., Robertson, J., … Hebert, P. D. N. (2015, July 26). Untangling taxonomy: a DNA barcode reference library for Canadian spiders. Molecular Ecology Resources. Wiley-Blackwell. https://doi.org/10.1111/1755-0998.12444

Now GBIF users can easily explore what we know about barcodes from this species by going directly to the primary literature.

Dark taxa

In an earlier post I discussed dark taxa, which are taxa that lack formal scientific names. BOLD is full of these, so many of the taxa I've added to GBIF don't have Linnean names. Instead I've used a combination of higher taxon name and the BIN itself.

Composite taxa

Having said that BINs are essentially the same as species, this need not imply that there's a one-to-one match between BINs and currently recognised species (indeed, this is of the things that makes barcoding so interesting, it's ability to discover hidden variation without taxa currently considered to be a single species). This means that some BINs will have the same name (significant variation within a species), and some BINs will have multiple names (more than one species name assigned to the same BIN). For example, BOLD:AAA2525 is a cluster of DNA barcodes with the following names attached:

Icaricia lupini
Icaricia acmon
Icaricia neurona
Plebejus lupini
Aricia sp. RV-2009
Aricia acmon
Plebejus acmon
Plebejus elvira
Icaricia lupini texanus
Icaricia lupini monticola
Icaricia lupini chlorina
Icaricia lupini lupini
Icaricia lupini alpicola

This cluster of names includes subspecies, synonyms (e.g. ). Looking at the phylogeny for this BIN (PDF-only) some of these names are intermingled suggesting that some specimens might be misidentified, apparently Icaricia lupini and I. acmon are very similar:

Coutsis, J. G. (2011). The male genitalia of N American Icaricia lupini and I. acmon; how they differ from each other and how they compare to those of the other two members of the group, I. neurona and I. shasta (Lepidoptera: Lycaenidae, Polyommatiti). Phegea, 39(4), 144-151. Retrieved from http://biostor.org/reference/160269

Summary

This is a first attempt to integrate DNA barcode taxonomy into GBIF, so there are going to be some issues to explore. GBIF currently assumes taxa can be easily mapped to a Linnean hierarchy. While this is ultimately likely to be true for animal COI barcodes, getting there is going to be messy while we have numerous dark taxa and/or BINs which don't match the current identifications of the voucher specimens.

Perhaps it's worth asking whether attempt to fit the results of DNA barcoding into a classical taxonomy is the best way forward. In doing so we loose much of what makes barcodoing so powerful, namely a specimen-level phylogenetic tree. Maybe what we should be really thinking about is ways to explore barcoding data natively. See Notes on next steps for the million DNA barcodes map for some thoughts on how to do that.

Image from Wikimedia Commons The Face of a Lupine Blue by Ingrid Taylar.

Friday, August 07, 2015

Testing the GBIF taxonomy with the graph database Neo4J

I've been playing with the graph database Neo4J to investigate aspects of the classification of taxa in GBIF's backbone classification. Neo4J is a graph database, and a number of people in biodiversity informatics have been playing with it. Nicky Nicolson at Kew has a nice presentation using graph databases to handle names Building a names backbone, and the Open Tree of Life project use it in their tree machine.

One of the striking things about Neo4J is how much effort has gone in to making it easy to play with. In particular, you can create GraphGists, which are simple text documents that are transformed into interactive graphs that you can query. This is fun, and I think it's also a great lesson in how to publicise a technology (compare this with RDF and SPARQL, which is in no way fun to work with).

I created some GraphGists that explore various problems with the current GBIF taxonomy. The goal is to find ways to quickly test the classifications for logical errors, and wherever possible I want to use just the information in the GBIF classification itself.

The first example is a version of the "papaya plots" that I played with in an earlier post (see also an unfinished manuscript Taxonomy as impediment: synonymy and its impact on the Global Biodiversity Information Facility's database). For various reasons, GBIF has ended up with the same species occuring more that once in its backbone classification, usually because none of its source databases has enough information on synonymy to prevent this happening.

As an example, I've grabbed the classification for the bat family Molossidae, converted it to a Neo4J graph, and then tested for the existence of species in different genera that have the same specific epithet. This is a useful (but not foolproof test) of whether there are undetected synonyms, especially if the generic placement of a set of species has been in flux (this is certainly true for these bats). If you visit the gist you will see a list of species that are potential synonyms.

A related test catches cases where one classification treats a taxon as a subspecies whereas another treats it as a full species, and GBIF has ended up with both interpretations in the same classification (e.g., the butterfly species Heliopyrgus margarita and the subspecies Heliopyrgus domicella margarita).

Another GraphGist tests that the genus name for a species matches the genus it is assigned too. This seems obvious (the species Homo sapiens belongs in the genus Homo) but there are cases where GBIF's classification fails this test, such as the genus Forsterinaria. Typically this test fails due to problematic generic names (e.g., homonyms), incorrect spellings, etc.

The last test is slightly more pedantic, but revealing nevertheless. It relies on the convention in zoology that when you write the authorship of a species name, if the name is not in the original genus then you enclose the authorship in parentheses. For example, it's Homo sapiens Linnaeus, but Homo erectus (Dubois, 1894) because Dubois originally called this species Pithecanthropus erectus.

Because you can only move a species to a genus that has been named, it follows that if a species is described before the genus name was published, then if the species is in that newer genus the authorship must be in parentheses. For example, the lepidopteran genus Heliopyrgus was published in 1957, and includes the species willi Plötz, 1884. Since this species was described before 1957, it must have been originally placed in a different genus, and so the species name should be Heliopyrgus willi (Plötz, 1884). However, GBIF has this as Heliopyrgus willi Plötz, 1884 (no parentheses). The GraphGist tests for this, and finds several species of Heliopyrgus that are incorrectly formed. This may seem pendantic, but it has practical consequences. Anyone searching for the original description of Heliopyrgus willi Plötz, 1884 might think that they should be looking for the text string "Heliopyrgus willi" in literature from 1884, but the name didn't exist then and so the search will be fruitless.

I think there's a lot of scope for deveoping tests like these, inclusing some that m make use of external data as well. In an earlier post (A use case for RDF in taxonomy ) I mused about using RDF to perform tests like this. However Neo4J is so much easier to work with I suspect that it makes better sense to develop standard queries in it's query language (CYPHER) and use those.

Tuesday, August 04, 2015

Possible project: extract taxonomic classification from tags (folksonomy)

Note to self about a possible project. This PLoS ONE paper:

Tibély, G., Pollner, P., Vicsek, T., & Palla, G. (2013, December 31). Extracting Tag Hierarchies. (P. Csermely, Ed.)PLoS ONE. Public Library of Science (PLoS). http://doi.org/10.1371/journal.pone.0084133

describes a method for inferring a hierarchy from a set of tags (and cites related work that is of interest). I've grabbed the code and data from http://hiertags-beta.elte.hu/home/ and put it on GitHub.

Possible project

Use Tibély et al. method (or others) on taxonomic names extracted from BHL text (or other) and see if we can reconstruct taxonomic classifications. ow do classifications compare to those in databases? Can we enhance existing databases using this technique (e.g., extract classifications from literature for groups pporly represented in existing databases)? Could be part of larger study of what we can learn from co-occurrence of taxonomic names, e.g. Automatically extracting possible taxonomic synonyms from the literature.

Note to anyone reading this: if this project sounds interesting, by all means feel free to do it. These are just notes about things that I think would be fun/interesting/useful to do.

Wednesday, November 20, 2013

Reaction to taxonomic reactionaries

There is a fairly scathing editorial in Nature [The new zoo. (2013). Nature, 503(7476), 311–312. doi:10.1038/503311b ] that reacts to a recent paper by Dubois et al.:

Dubois, A., Crochet, P.-A., Dickinson, E. C., Nemésio, A., Aescht, E., Bauer, A. M., Blagoderov, V., et al. (2013). Nomenclatural and taxonomic problems related to the electronic publication of new nomina and nomenclatural acts in zoology, with brief comments on optical discs and on the situation in botany. Zootaxa, 3735(1), 1. doi:10.11646/zootaxa.3735.1.1

To quote the editorial:

...there might be more than a disinterested concern for scientific integrity at work here. A typical reader of the Zootaxa paper (not that there are typical readers of a 94-page work on the minutiae of nomenclature protocol) might reasonably conclude that the authors have axes to grind. Exhibits A–E: the high degree of autocitation in the Zootaxa paper; the admission that some of the authors were against the ICZN amendments; that they clearly feel that their opinions regarding the amendments have been disregarded; the ad hominem attacks on ‘wealthy’ publishers as opposed to straitened natural-history societies; and the use of emotive and occasionally intemperate language that one does not associate with the usually dry and legalistic tone of debate on this subject. (The online publisher BioMed Central, based in London, gets a particular pasting, to which it has responded; see http://blogs.biomedcentral.com/bmcblog/2013/11/15/the-devil-may-be-in-the-detail-but-the-longview-is-also-worth-a-look/.)

One of many recommendations made in the diatribe is that journals should routinely have on their review boards those expert in the business of nomenclature — in other words, a cadre of people who are, unlike ordinary mortals, qualified to interpret the mystic strictures of the code. A typical reader is again entitled to ask whom, apart from themselves, the authors think might be suitable candidates.

Ouch! But Dubois et al.'s paper pretty much deserves this reaction - it's a reactionary rant that is breathtaking in it's lack of perspective. From the abstract:

As shown by several examples discussed here, an electronic document can be modified while keeping the same DOI and publication date, which is not compatible with the requirements of zoological nomenclature. Therefore, another system of registration of electronic documents as permanent and inalterable will have to be devised.

So, we have an identifier system for publications which currently has 63,793,212 registered DOIs (see CrossRef), includes key journals such as Zootaxa and ZooKeys, and which has tools to support versioning of papers (see CrossMark) but hey, let's have our own unique system. After all, zoological nomenclature is special, and our community has such a good track record of maintaining our own identifier system (LSIDs anyone?).

Now that the financial crisis faced by the ICZN has been averted by a three-year bail-out by the National University of Singapore (for three years at least), maybe the guardians of scientific names can focus on providing tools and services of value to the broader scientific community (or, indeed, taxonomists). As it stands, the ICZN can say little about the majority of animal names. Much better to focus on that than trying to rail against the practices of modern publishing.

Thursday, August 15, 2013

BioNames update - taxonomic name timelines

One feature I've always wanted to have in BioNames is a timeline of taxonomic names. ION has one (see here), but I wanted a way to go from the timeline to the actual publications. In other words, if, say, there were approximately 99 bird names published in 2012, I want to see the papers that published those names.

As an example, you can go to http://bionames.org/timeline/Animalia/Chordata/Vertebrata/Aves and get a timeline of bird names:
Birds

The data is incomplete (I'm still processing and indexing the data) but you get a sense that the number of bird names being coined each year is fairly small. Actually, I was surprised it was as high as it is, but remember these are not the number of new species described each year. It does include new species (many of them are fossils in this case), but also higher taxa and nomenclatural changes (e.g., replacement names for homonyms, etc.). The timeline also only shows names that are "new" (i.e., not the new combinations that result when a species gets moved to a new genus), and only those names linked to a publication.

The timeline graphs are clickable, so you can click on a year and get a list of publications for that taxon for that year (sometimes this can take a while). You can click on the publications for more details, sometimes you can also view the full text.

The timeline page also shows a treemap of the taxonomic groups recognised by the ION database (the example below is for birds):

Treemap

Browsing different taxa shows some interesting patterns. For example, here are snakes:

Snakes

That huge spike on the far right? That's due to hundreds of names published by "Snake Man" Raymond Hoser (his activities have been the subject of an impassioned debate on TAXACOM).

The timeline for insects shows a major dip in new names that corresponds to the Second World War, followed by a big jump in the late sixties.

Insects

Smaller taxa, such as Teuthida, show a more episodic pattern where a single monograph can result in a prominent spike in the numbers in any one year (again, you can click on the spikes to see the actual publications):

Squid

Still a daunting mount of cleaning and linking to do, but it's one more way to explore the efforts of generations of taxonomists to discover and make sense of the diversity of animal life on the planet.

Wednesday, August 14, 2013

Cluster maps, papaya plots, and the trouble with GBIF taxonomy

Continuing the theme of the failings of the GBIF classification I've been playing further with cluster maps to visualise the problem (see this earlier post for an introduction).

Browsing through bats in GBIF I keep finding the same species appearing more than once, albeit in different genera. As discussed in the gibbon example, GBIF merges several competing classifications for mammals, and these often don't agree on the "accepted name" for a species. In the absence of a decent database of taxonomic synonyms, GBIF ends up duplicating species, and each duplicate is often associated with different occurence data. If you are trying to get the distribution for a species this can be a disaster.

To get a sense of the scale of the problem I put together a simple tool to create cluster maps. The code is on github) and there is a live service at http://iphylo.org/~rpage/cluster-map/. The service takes a simple tab-delimited file that lists sets and their members, computes the overlap between the sets, calls Graphviz to layout a graph in SVG, then draws in the members of each cluster (phew).

The input file looks something like this:


Molossops	aequatorianus
Chaerephon	aloysiisabaudiae
Tadarida	aloysiisabaudiae
Chaerephon	ansorgei
Tadarida	ansorgei
Molossus	ater
Mormopterus	petrophilus
Sauromys	petrophilus

What can we do with this tool? Well, I created a quick list of all the species of bat in the family Molossidae according to GBIF. The sets are the bat genera, the members are the species (you can see the file here). I then ran this through the cluster map, and got something like this (this is only part of the cluster map):

Bats

(now can you see why I call these "papaya plots"?). Note that there are species names (i.e., specific epithets) in common to more than one genus. Some of these may be perfectly OK (it's not unusual for the same epithet to be used in different species, e.g. "major", etc.). But in many cases these bat species turn out to be the same species, just in different genera in different classifications. For example, GBIF has both Cynomops greenhalli and Molossops greenhalli. These are the same thing. Species in the genus Mormopterus may also occur in other genera. In some cases the issue is competing classifications, sometimes it is conflict over whether a species is a species or merely a subspecies, and some generic conflicts are because some genera are relegated to subgeneric status in some classifications. In short, it's an unholy mess.

Does this matter? Well, consider Mormopterus petrophilus and Sauromys petrophilus, which GBIF both regard as valid species (they're the same thing). Here are the distributions for the two different names in GBIF:

Depending on which name you use you'll get a very different picture of the distribution of this bat.

The next step is to figure out how to fix this. Is there a way we can automate fixing the GBIF classification so that it is not riddled with spurious duplicates like these?

Thursday, August 01, 2013

A use case for RDF in taxonomy

Readers of this blog will know that I'm sceptical about the current value of linked data and RDF in biodiversity informatics. But I came across an interesting paper on RDF and biocuration that suggests a good "use case" for RDF in constructing and curating taxonomic databases.

The paper is "Catching inconsistencies with the semantic web: a biocuration case study" (PDF here) by Jerven Bolleman and Sebastien Gehant. The basic idea is that errors in databases (in this case, UniProt) can be flagged by constructing queries in SPARQL that return results if there is a problem (for example if a sequence annotation is contradictory).

In recent posts I've been complaining about errors in the GBIF taxonomy, notably duplicate taxa that are synonyms. One way to tackle this would be to develop a set of SPARQL queries that we could use to flag potential problems. For example, if two names are objective synonyms then only one of them should be a node in the GBIF classification. If both exist then we have a problem. If we know a name is a homonym of an older name, but that name exists in the GBIF classification, then we could flag that as an issue. We could also construct queries that flag possible problems, even if we don't have precise information on synonymy. For example, in this post I noted that several frog species appear twice in the GBIF classification because GBIF has aggregated classifications that put these frogs in different genera. We could catch such cases by constructing a query to check whether the same species name (specific epithet) appeared in different genera within the same family.

The advantage of using RDF and SPARQL in this context is that that the queries are portable. Assuming everyone uses the same vocabulary (e.g., the TDWG LSID vocabularies) then queries can be constructed by one person (e.g., me) and then used by anyone who has their data in a triple store. We could develop a set of "taxonomy tests" that anyone could apply to their database.

This idea needs some more work, but it would be fun to play with some data and see how many kinds of errors or issues we can catch in this way.

Wednesday, April 10, 2013

Time to put taxonomy into GitHub

Donald Hobern drew my attention to nice the way iNaturalist displays taxonomic splits:

In this example, observations identified as Rhipidura fuliginosa are being split into Rhipidura fuliginosa and Rhipidura albiscapa. This immediately reminds me of the idea which keeps circulating around, namely using version control tools to manage taxonomic classification. Some years ago David Shorthouse proposed managing taxonomic classifications using version control, see Taxonomic Consensus as Software Creation. I discussed this in Taxonomy on a hard disk, and Pierre Lindenbaum has an interesting post on treating the NCBI taxonomy as a file system A FUSE-based filesystem reproducing the NCBI Taxonomy hierarchy.

The idea is that a taxonomy, such as the GBIF backbone taxonomy, could be placed in GitHub where people could clone it, annotated, correct, edit, or otherwise mess with it, then GBIF could pull in those edits and release an updated, cleaner taxonomy. If software version control seems a bit esoteric, it's worth noting that use of GitHub is rapidly becoming much more mainstream in science, and not just for software development. People are using it to store versions of data analysis (e.g., https://github.com/dwinter/Fungal-Foray) and collaboratively write manuscripts (e.g., https://github.com/weecology/data-sharing-paper). The journal eLIFE is depositing articles there (e.g., https://github.com/elifesciences/elife-articles). In addition to all the infrastructure GitHub provides (the ability to identify who did what and when, to roll back changes, to fork classifications, etc.) there is also the attraction of not creating yet more software, but simply editing a classification by moving folders around on your local filesystem. The idea seems irresistible…

Friday, March 15, 2013

BioNames: yet another taxonomic database

Yet another taxonomic database, this time I can't blame anyone else because I'm the one building it (with some help, as I'll explain below).

BioNames was my entry in EOL's Computable Data Challenge (you can see the proposal here: http://dx.doi.org/10.6084/m9.figshare.92091). In that proposal I outlined my goal:

BioNames aims to create a biodiversity “dashboard” where at a glance we can see a summary of the taxonomic and phylogenetic information we have for a given taxon, and that information is seamlessly linked together in one place. It combines classifications from EOL with animal taxonomic names from ION, and bibliographic data from multiple sources including BHL, CrossRef, and Mendeley. The goal is to create a database where the user can drill down from a taxonomic name to see the original description, track the fate of that name through successive revisions, and see other related literature. Publications that are freely available will displayed in situ. If the taxon has been sequenced, the user can see one or more phylogenetic trees for those sequences, where each sequence is in turn linked to the publication that made those sequences available. For a biologist the site provides a quick answer to the basic question “what is this taxon?”, coupled with with graphical displays of the relevant bibliographic and genomic information.

The bulk of the funding from EOL is going into interface work by Ryan Schenk (@ryanschenk), author of synynyms among other cool things. EOL's Chief Scientist Cyndy Parr (@cydparr) is providing adult supervision ("Chief Scientist", why can't I have a title like that?).

Development of BioNames is taking place in the open as much as we can, so there are some places you can see things unfold:

Key features and milestones are on Trello
Design details are on GitHub
Database is hosted by Cloudant
There is a (currently private) design document in Google Docs. I've posted a snapshot on FigShare (http://dx.doi.org/10.6084/m9.figshare.652203

I've lots of terrible code scattered around which I am in the process of organising into something usable, which I'll then post on GitHub. Working with Ryan is forcing me to be a lot more thoughtful about coding this project, which is a good thing. Currently I'm focussing on building an API that will support the kinds of things we want to do. I'm hoping to make this public shortly.

The original proposal was a tad ambitious (no, really). Most of what I hope to do exists in one form or another, but making it robust and usable is a whole other matter.

As the project takes shape I hope to post updates here. If you have any suggestions feel free to make them. The current target is to have this "out the door" by the end of May.

Thursday, February 14, 2013

Rate of description of new animal species and that Taxatoy graph

As part of the discussion on whether legacy biodiversity literature matters a graph from the following paper came up:

Sarkar, I., Schenk, R., & Norton, C. N. (2008). Exploring historical trends using taxonomic name metadata. BMC Evolutionary Biology, 8(1), 144. doi:10.1186/1471-2148-8-144

.@rmounce @caseybergman Sarkar et al. graph bogus dx.doi.org/10.1186/1471-2… see organismnames.com/metrics.htm?pa… Also we need to define "legacy"
— Roderic Page (@rdmpage) February 14, 2013

So, why is the Sarkar et al. graph bogus? Here is their graph (Fig. 3) for animals:

Taxatoy

This is the number of new animal species described each year, estimated by parsing taxonomic names and extracting the date in the taxonomic authority. There are two prominent "spikes" which are worrying. Sarkar et al. discuss the peak in 1994:

For example, the analyzed data indicate that a significant portion of the 1994 peak is due to an increase in descriptions of the family Cerambycidae, a large group of beetles.

So, 1994 was a bumper year for describing new species of Cerambycidae? Not quite. Taxatoy is based on names in uBio, and I have a local copy of most of these names. The Cerambycidae names contain lots of duplicate names that differ only in taxon authority. For example, searching the name Ancylocera macrotela on uBio finds:


Ancylocera macrotela	
Ancylocera macrotela Aurivillius, 1912	
Ancylocera macrotela BATES Henry Walter, 1880	
Ancylocera macrotela Bates, 1880	
Ancylocera macrotela Bates, 1885	
Ancylocera macrotela Blackwelder, 1946	
Ancylocera macrotela Chemsak & Linsley, 1970	
Ancylocera macrotela Chemsak, 1963	
Ancylocera macrotela Chemsak, 1964	
Ancylocera macrotela Chemsak, Linsley & Mankins, 1980
Ancylocera macrotela Chemsak, Linsley & Noguera, 1992
Ancylocera macrotela Lameere, 1883	
Ancylocera macrotela Maes & al., 1994	
Ancylocera macrotela Monné & Giesbert, 1994	
Ancylocera macrotela Monné, 1994	
Ancylocera macrotela Noguera & Chemsak, 1996	
Ancylocera macrotela Viana, 1971

These names are chresonyms. The original name is Ancylocera macrotela Bates, 1880 (you can see first publication of this name in BHL), the rest are subsequent citations of that name (gotta love taxonomy...).

Why the spike in 1994? I suspect that this is due to the publication in 1994 of "Checklist of the Cerambycidae and Disteniidae (Coleoptera) of the Western Hemisphere" by Miguel A Monné and Edmund F Giesbert. At least 8552 names from that checklist seem to have ended up in uBio, all with the date "1994". So the spike is an artefact. Similarly, the other peak (1912) corresponds to the publication of a checklist by Per Olof Christopher Aurivillius, which contributes over 3000 names.

One reason I was suspicious of the Taxatoy graph is that it doesn't look anything like the equivalent graph from the Index of Organism Names. After a bit of fussing I've grabbed data from the ION site, and from Taxatoy's Google Code repository and created the following chart:

Taxatoy version2

The data for this chart is on figshare http://dx.doi.org/10.6084/m9.figshare.156862. ION is an index of all new animal names, based on Zoological Record. I place more confidence in its data than data derived from uBio, but it clearly ION has its own issues (such as the gap after 1850, and the uneven sampling of the early years of taxonomy). The key point is that arguments on the temporal distribution of taxonomic descriptions (and the value of legacy literature) need to be aware that the data used is in pretty poor shape.

Update 2013-02-23
Jose Antonio Gonzalez Oreja pointed out in an email that the values for ION that I used were a little higher than those that appear on the ION web site. My script for retrieving those values hadn't quite worked. I've uploaded the corrected data to Figshare http://dx.doi.org/10.6084/m9.figshare.156862, updated the diagram above, and put the web calls I used to fetch the data on GitHub https://gist.github.com/rdmpage/5019153. The story doesn't change, but it helps to have the correct data.

Wednesday, November 21, 2012

Species wait 21 years to be described - show me the data

Benoît Fontaine et al. recently published a study concluding that average lag time between a species being discovered and subsequently described is 21 years.

Fontaine, B., Perrard, A., & Bouchet, P. (2012). 21 years of shelf life between discovery and description of new species. Current Biology, 22(22), R943–R944. doi:10.1016/j.cub.2012.10.029

The paper concludes:

With a biodiversity crisis that predicts massive extinctions and a shelf life that will continue to reach several decades, taxonomists will increasingly be describing from museum collections species that are already extinct in the wild, just as astronomers observe stars that vanished thousands of years ago.

This is a conclusion that merits more investigation, especially as the title of the paper suggests there is an appalling lack of efficiency (or resources) in the way we decsribe biodiversity. So, with interest I looked at the Supplemental Information for the data:

I was hoping to see the list of the 600 species chosen at random, the publication containing their original description, and the date of their first collection. Instead, all we have is a description of the methods for data collection and analysis. Where is the data? Without the data I have no way of exploring the conclusions, asking additional questions. For example, what is the distribution of date of specimen collection in each species? One could imagine situations where a number of specimens are recently collected, prompting recognition and description of a new species, and as part of that process rummaging through the collections turns up older, unrecognised members of that species. Indeed, if it takes a certain number of specimens to describe a species (people tend to frown upon descriptions based on single specimens) perhaps what we are seeing is the outcome of a sampling process where specimens of new species are rare, they take a while to accumulate in collections, and the distribution of collection dates will have a long tail.

These are the sort of questions we could have if we had the data, but the authors don't provide that. The worrying thing is that we are seeing a number of high-visibility papers that potentially have major implications for how we view the field of taxonomy but which don't publish their data. Another recent example is:

Joppa, L. N., Roberts, D. L., & Pimm, S. L. (2011). The population ecology and social behaviour of taxonomists. Trends in Ecology & Evolution, 26(11), 551–553. doi:10.1016/j.tree.2011.07.010

Biodiversity is a big data science, it's time we insisted on that data being made available.

Saturday, September 22, 2012

Touching the tree of life

Prompted by a conversation with Vince Smith at the recent Online Taxonomy meeting at the Linnean Society in London I've been revisiting touch-based displays of large trees. There are a couple of really impressive examples of what can be done.

Perceptive Pixel

I've blogged about this before, but came across another video that better captures the excitement of touch-based navigation of a taxonomy. Perceptive Pixel's (recently acquired by Microsoft) Jeff Han demos browsing an animal classification. The underlying visualisation is fairly sttaightforward, but the speed and ease with which you can interact with it clearly makes it fun to use.

DeepTree

DeepTree comes from Life on Earth lab, and there's a paper coming out by @blockflorian and colleagues (I was reminded of this project by @treevisproject):

Technique added: DeepTree (2012); Florian Block, Michael Horn, Brenda Phillips, Judy Diamond, Margaret Evans, Chia Shen researchgate.net/publication/23…
— treevis.net Project (@treevisproject) September 21, 2012

For technical details on the layout algorithm see https://lifeonearth.seas.harvard.edu/downloads/DeepTree.pdf. Below is a video of it in use:

Both of these are really nice, but what I really want is to have this on my iPad…

Friday, July 13, 2012

Sometimes the mess taxonomy creates drives me nuts

Playing with some sequence data I found numerous Plasmodium sequences from the following paper:

Werner, E. B. ., Taylor, W. R., & Holder, A. A. (1998). A Plasmodium chabaudi protein contains a repetitive region with a predicted spectrin-like structure1Note: Nucleotide sequence data reported in this paper are available in the EMBL, GenBank™ and DDJB databases under the accession number U43145.1. Molecular and Biochemical Parasitology, 94(2), 185–196. doi:10.1016/S0166-6851(98)00067-X

These sequences (e.g., U43145) give the host as Thamnomys rutilans. You'd think it would be fairly easy to learn more about this animal, given that it hosts a relative of the cause of malaria in humans, and indeed there are a number of biomedical papers that come up in Google, e.g.:

Landau, I., & Chabaud, A. (1994). Advances in Parasitology (Vol. 33, pp. 49–90). Elsevier BV. doi:10.1016/S0065-308X(08)60411-X

Killick-Kendrick, R. (1968). Malaria parasites of Thamnomys rutilans (Rodentia, Muridae) in Nigeria. Bull World Health Organ. 1968; 38(5): 822–824. PMC2554675

Google also tells me that Thamnomys rutilans is an African rodent (e.g., 6.1.6. Rodent malaria, but NCBI has no sequences for "Thamnomys rutilans", and GBIF has no data on its distribution. If I search Mammal Species of the World I get (literally) "nothing found ...".

So, this is an African rodent, host to Plasmodium, and we know nothing about it? A bit of Googling, a trip to Wikipedia and Google Books reveals that Thamnomys rutilans is a synonym of Grammomys rutilans, but it is now called Grammomys poensis because the original name (Mus rutilans Peters 1876) is a junior ~~synonym~~ homonym of Mus rutilans Olfers, 1818 (simples). You can see the original description of Mus rutilans Peters 1876 in BioStor http://biostor.org/reference/105261 (this took some tracking down, but that's another story):

4ca1a4521753bde9a091661c7694f8ae

The original description of Mus rutilans Olfers, 1818 is given by The description of a new species of South American hocicudo, or long-nose mouse, genus Oxymycterus (Sigmodontinae, Muroidea), with a critical review of the generic content as:

Olfers, I. 1818. Bemerkungen zu Illiger's Ueberblick der Saugethiere nach ihrer Betheilung über die Welttheile rüchsichtlich der Südamerikanischen Arten (Species). In Eschwege, W. L., ed., Journal von Brasilien, Weimar, 15(2): 192-237.

This reference doesn't seem to be online.

The upshot of all this information about the host of Plasmodium chabaudi is hidden behind taxonomic name changes, and databases that one might expect to help simply don't. If names are the glue that link biodiversity data together then we need to get a lot better at making basic information about name changes accessible, otherwise we are creating islands of disconnected data.

Tuesday, May 15, 2012

EOL challenge draft proposal

In the spirit of the Would you give me a grant experiment? [1] here's the draft of a proposal I'm working on for the Computable Data Challenge. It's an attempt to merge taxonomic names, the primary literature, and phylogenetics into one all-singing, all-dancing website that makes it easy to browse names, see the publications relevant to those names, and see what, if anything, we know about the phylogeny of those taxa. It builds on a number of other projects I've been working on, most recently my efforts to link names to the primary literature. Comments welcome (the proposal deadline is next week).

The proposal is embedded below using Google's PDF viewer, if you can't see it try logging into your Google account, or click here.

1. The answer from NERC was a resounding "no".