iPhylo: linking

Roderic D. M. Page

Showing posts with label linking. Show all posts

Friday, June 21, 2019

Messages from Melbourne: Towards linking all the things

I'm doing some work with Nicole Kearney (@nicolekearney) at the Melbourne Museum on the general theme of "linking all the things". It's the end of the first full week we've had, so here's a quick update of what we've been up to.

Brainstorming

The things we want to do are being captured as a project on GitHub. This is where we come up with ideas, comment on then, then try to figure out which ones can be done. So far there are three things we've made a serious start on.

Unpaywall

Unpaywall is a project by Impactstory. It is sort of a Sci-Hub without the legal issues (for the record, I think Alexandra Elbakyan's work on Sci-Hub is nothing short of heroic). Unpaywall scans open access archives for legal, freely available versions of articles and makes them easy to find. If you have Firefox or Chrome you can get a plugin that lights up if the paywall article you're looking at has a free version somewhere else.
Nicole has long wanted the BHL to provide data to Unpaywall, because BHL has open access versions of many papers relevant to taxonomy and biodiversity more broadly defined. After a bit of digging we figured out that Unpaywall didn't have access to BHL's data, so we've set about fixing that. We've got the data harvested, but we're still waiting for Unpaywall to process that data. So, for now, we're still waiting for the little green light to appear on pages such as this one: https://doi.org/10.1080/00222932208632640.

Adding taxonomic literature to Atlas of Living Australia

Part of "linking all the things" is making the taxonomic literature a first class citizen of biodiversity databases. It is frankly embarrassing to see how much better the scientific literature is handled by projects such as Wikipedia than scientific databases such as GBIF and the ALA. We've decided to try and do something about this by showing how easily the literature could be embedded into the existing ALA web site. Nicole crafted a mockup of the ALA names tab, and I wrote some code to make it "live". For example, if you click on this link you will see a list of publications for Pauropsalta herveyensis Owen & Moulds, 2016. Note that we have DOIs and links to BHL where ever possible (and we use Unpaywall's API to flag whether an article with a DOI is freely available). We want this literature (the primary evidence for what we know about a species) to be visible and accessible. The demo is powered by my Ozymandias project, but we hope to work out a mechanism for delivering the mapping between taxa and literature to ALA (and, indeed, anyone else) as a dataset.
Because Ozymandias only has data for animals, we've had to exclude plants from this demo. I'm frantically trying to figure out how to work with data in Australia's plant name databases to resolve this. I'm discovering that never mind having more than one name for the same species, taxonomists also delight in having many different ways of representing taxonomic information in their databases. So, plants will be a challenge.

Mapping taxonomists to ORCID and Wikidata

One reason for adding literature to taxonomic databases is to make the work of taxonomists more visible. One way to do this is to move beyond using only "dumb strings" as people names and linking taxonomists to their ORCIDs and to entries in Wikidata (this is something I touched on in Ozymandias, and David Shorthouse is doing on an epic scale in Bloodhound). We're playing with the idea of being able to generate a list of active taxonomists in Australia, linked to their identifiers and publications, solely based on querying Wikidata. The first step is to try and automate the initial mapping between taxonomists and Wikidata as much as possible, we've only just started looking at this.

Summary

It is early days, and we're still identifying things we could work on. As always, there are so manythings which could be done, we're hoping we can make progress on at least some of these in the next few weeks.

Friday, August 15, 2014

Some design notes on modelling links between specimens and other kinds of data

If we view biodiversity data as part of the "biodiversity knowledge graph" then specimens are a fairly central feature of that graph. I'm looking at ways to link specimens to sequences, taxa, publications, etc., and doing this across multiple data providers. Here are some rough notes on trying to model this in a simple way.

For simplicity let's suppose that we have this basic model:

Core

A specimen comes from a locality (ideally we have the latitude and longitude of that locality), it is assigned to a taxon, we have data derived from that specimen (e.g., one or more DNA sequences), and we have one or more publications about that specimen (e.g., a paper that publishes a taxon name for which the specimen is a type, or a paper that publishes a sequence for which the specimen is a voucher).

Ncbi

NCBI

In GenBank we have sequences that have accession numbers, and these are linked to taxa (identified by NCBI tax ids). A nice feature of sequence databases is that taxa are explicitly defined by extension, that is, a taxon is the set of sequences assigned to a given taxon. Most (but not all, see Miller et al. doi:10.1186/1756-0500-2-101) sequences are also linked to a publication, which will usually have a PubMed id (PMID), and sometimes a DOI. Many sequences are also georeferenced (see Guest post: response to "Putting GenBank Data on the Map"). Most sequences aren't linked to a voucher specimen, but there is the implict notion of a source (in RDF-speak, many specimens are "blank nodes" Blank nodes for specimens without URI). Some sequences are associated with a specimen that has a museum code, and some are explicitly linked to the specimen by a URL.

DNA barcodes

Barcodes, as represented in BOLD are similar to sequences in GenBank. We have explicit taxa ("BINs") each of which has a URL, some also having DOIs. Most barcodes are georeferenced. There's some ambiguity about whether the URL for a barcode record identifies the barcode sequence, the specimen, or both. There may be a voucher code for the specimen. Some barcodes are linked to publications, but not (as far as I can see) in the data obtained from the API. Some barcodes are linked to the corresponding record in GenBank (which may or may not be supressed, see Dark taxa even darker: NCBI pulls (some) DNA barcodes from GenBank (updated)).

Gbif

GBIF

At it's core GBIF has occurrence records (many of these are specimen-based, but the majority of data in GBIF is actually observation-based), each of which has a unique id, and which is linked to a taxon, also with a unique id. As with the sequence databases, a taxon is a set of occurrences that have been assigned to that taxon. Many records in GBIF are georeferenced. There are limited cross links to other database - some occurrences list associated GenBank sequences. Some GBIF occurrences actually are sequences (e.g., the European Molecular Biology Laboratory Australian Mirror and the soon to be indexed Geographically tagged INSDC sequences), and barcodes are also making their way into GBIF (e.g., Zoologische Staatssammlung Muenchen - International Barcode of Life (iBOL) - Barcode of Life Project Specimen Data). Links to publications are limited.

Museum

Museums and herbaria

Some individual natural history collections which are online provide specimen-level web pages and URLs (some even have DOIs, see DOIs for specimens are here, but we're not quite there yet), and some museums list associated GenBank sequences. In the diagram I've not linked the specimens to a taxon, because most specimens are tagged by a name, not an explicit taxon concept (unlike NCBI, BOLD, or GBIF).

Literature

Literature databases (represented here by BioStor, but could be other sources such as ZooKeys, for example) may contain articles that mention specimen codes. These articles may also mention taxon names, and geographic localities (including coordinates) (see, for example, Linking GBIF and the Biodiversity Heritage Library. Mining text for names, specimens, and localities is fairly easy, but linking these together is harder (i.e., this specimen is of this taxon, and was found at this locality).

Linking together

If we have these separate sources and this trivial model, then we can imagine trying to tie information about the same specimen together across the different databases. Why might we want to do this. Here are three reasons:

Augmentation Combining information can enhance our understanding of a specimen. Perhaps a specimen in GBIF is a geographic outlier. A publication that mentions the specimen includes it in a new taxon, perhaps discovered by sequencing DNA extarcted from that specimen. Linking this information together resolves the problematic distribution.
Provenance What is the evidence that a particular specimen belongs to a particualr taxon, or was collected at a particular locality? If we connect specimens to the literature we we can review the evidence for ourselves. If we have sequences we can run BLAST, build a tree, and see if we should rethink our classification of that sequence. Imagine being able to browse GBIF and see the evidence for each dot on the map?
Citation Mentions in the literature, use as vouchers for DNA barcoding or other forms of sequencing can be thought of a "citation" of that specimen. Museums hosting that material could use metrics base don this to demonstrate the value of their collection (see also The impact of museum collections: one collection ≈ one Nobel Prize).

Making the links

All this is well and good, the trick is to actually make the links. Here things get horribly messy very quickly. Museum specimens are cited in inconsistent ways, we don't have widely used unique, resolvable specimen identifiers, and even if we did have these identifiers we don't have a global discovery mechanism for matching voucher codes to identifiers. GBIF would be an obvious part of a "global discovery mechanism" (bit like CrossRef but for specimens), GBIF can have multiple records for the same specimen. Sometimes this is because GBIF not only aggregates data from primary sources (such as museums) but also other aggregations which may themselves already include specimens harvested from primary sources. GBIF can also have multiple records because museums keep messing with their databases, try new variants of the Darwin Core triple, etc., resulting in records that look "new" to GBIF. Whole collections can be duplicate din this way.

One way to tackle this multiplicity of specimen records is to think in terms of "clusters" of specimens that are, in some sense, the same thing across multiple databases. For example, clustering a set of duplicated GBIF records together with the sequences derived from those specimens, perhaps including a DNA barcode, and a list of papers that mention that specimen. This is represented by the yellow bar through the diagram, it connects all the different pieces of information about a specimen into a single cluster. More *cough* later.

Thursday, January 17, 2013

Tight versus loose coupling

Following on from my previous post bemoaning the lack of links between biodiversity data sets, it's worth looking at different ways we can build these links. Specifically, data can be tightly or loosely coupled.

Tight coupling

Tight coupling uses identifiers. A good example is bibliographic citation, where we state that one reference cites another by linking DOIs. This makes it easy to store these links in a database, such as the Open Citations project which is exploring citation networks base don data from PubMed Cenral. Tight coupling also makes it easy to aggregate information from multiple sources. For example, one database may record citations of a paper, another may record citations of GenBank sequences, a third may record publication of taxonomic names. If all three databases use the same identifiers for the same publications (e.g., DOIs) we can combine them and potentially discover new things (for example, we could answer the question "how many descriptions of new species include sequence data?").

Loose coupling

In part this post has been prompted by a discussion I've been having with Paul Murray (@PaulMurrayCbr on his blog. Paul has added COinS to pages in the Australian Faunal Directory (AFD). These are snippets of HTML that encode a bibliographic reference as an OpenURL, and which browser extensions such as OpenURL Referrer for Firefox and COinS 2 OpenURL for Chrome can convert into links.

I've mapped many of the references in AFD to standard identifiers such as DOIs, or to digital libraries such as BioStor, and this tightly-coupled mapping is available in AFD on CouchDB. To date these mappings haven't been imported into AFD itself, which means that users of the original site don't have easy access to the literature that appears on that site (basically they'll have to Google each reference). However, if they have a browser extension (or the Javascript bookmarklet available from http://iphylo.org/~rpage/afd/openurl) that supports COinS, they will now see a clickable link that, in many cases, will take them to the online version of the corresponding reference.

This is an example of loose linking. The AFD site provides OpenURL links which can be resolved "just in time". Users of the AFD site can get some of the benefits of the tight linking stored in my CouchDB version of AFD, but the maintainers of AFD itself don't need to add code to handle these identifiers.

A lot of linking of biodiversity data shares this pattern. Instead of linking identifiers, one site links to another through a query. For example, NCBI taxonomy links to GBIF using URLs of the form "http://data.gbif.org/search/<taxon name>". Linking by query is potentially more robust than simply linking by URLs, especially if the target of the link doesn't ensure its identifiers are stable (GBIF, I'm looking at you). But there may be multiple ways to construct the same search query, which makes them poor candidates for use as identifiers. COinS are perhaps an extreme example, where there are at least two versions of the OpenURL standard in the wild, and the key-value pairs that make up the query can be in any order.

If the goal is to integrate data then having the same identifiers for the same thing make life a lot simpler, and means that we can switch from endless data cleaning and matching ("is this citation the same as that one?") to building systems that can tackle some of the scientific questions we are interested in. But in their absence we are left a kind of defensive programming where we expect the links to fail. Loose linking creates "soft links" that may work for humans (we get to click on a link and, with luck, see a web page) but they are less useful for mechanised tools trying to aggregate data.

When tight=loose

Although I've distinguished between tight and loose coupling, the distinction is not absolute. Indeed, one could argue that the best "tight" coupling is a form of "loose" coupling. For example, the most obvious form of tight linking is to use URLs for the things of interest. This is simple and direct, but has draw backs for both publisher and consumer. For the consumer, we are now at the mercy of the publisher's ability to keep the URLs stable. If they change (for example, publishing firm is bought by another firm, or adopts new publishing platform which generates different URLs) then the links break (not to mention that URLs for some resources, such as articles, are often conditional on how you are accessing the article, and may contain extraneous cruff such as session ids, etc.).

Likewise, the publisher is now constrained by a decision it made at the time of publication. If it decides to adopt better technology, or if circumstances otherwise change, it may find itself having to break existing identifiers. Some of this can be avoided if we designed clean URLs, such as this example http://data.rbge.org.uk/herb/E00001195 given by Roger Hyam. However, I wonder how persistent the ".uk" part of this URL will be if the Royal Botanic Garden Edinburgh finds itself in a Scotland that is no longer part of the United Kingdom.

One solution is our old friend indirection, where we put an identifier in between the consumer and the actual URL of the resource, and the consumer uses that identifier. This is the rationale for DOIs. The user gets an identifier that is unlikely to change, and hence can build systems upon that identifier. The publisher knows that they can change how they serve the corresponding data without disrupting their users, so long as they update the URL that the DOI points to. Indirection gives users the appearance of tight coupling without imposing the constraints of tight coupling on publishers.

Tuesday, January 15, 2013

iDigBio: You are putting identifiers on the wrong thing

The Integrated Digitized Biocollections (iDigBio) project aims to advance digitising US biodiversity collections. They recently published a GUID Guide for Data Providers. In the PDF document I read this:

It has been agreed by the iDigBio community that the identifier represents the digital record (database record) of the specimen not the specimen itself. Unlike the barcode that would be on the physical specimen, for instance, the GUID uniquely represents the digital record only. (emphasis added)

My heart sank. There's nothing wrong with having identifiers for metadata (apart from inviting the death spiral that is metadata about metadata), but surely the key to integrating specimens with other biodiversity data is to have globally unique identifiers for the specimens.

Now, identifiers for metadata can be useful. For example, there is a specimen of Parathemisto japonica in the National Museum of Natural History, Smithsonian Institution with the label "USNM 100988". The NMNH web site has a picture of the index card for this specimen:

Search php

This is an image of the metadata, not the specimen itself. We could link the metadata to this image, but of course we also want to link it to the actual specimen.

Specimens are the things we collect, preserve, dissect, measure, sequence, photograph, and so on. I want to link a specimen to the sequences that have been obtains from that specimen, I want to list the publications that cite that specimen, I want to be able to aggregate data on a specimen from multiple sources, I want to be able to add annotations including misidentifications, simple typos, or missing georeferencing.

Key to this is having identifiers for specimens. Identifiers for metadata about those specimens is not good enough. By analogy with bibliographic citation, one of the important decisions CrossRef made was that DOIs for articles identify the article, not the metadata about the article, or any of the different formats (HTML, PDF, print) and article may occur in. This means we can build databases about things and relationships (this article cites that one, these articles were authored by this person, etc.).

As it stands, if we don't have identifiers for specimens then we can't link data together. For example, the frog specimen "USNM 195785" is depicted in the image below (from EOL):

89351 orig

It is also listed in various papers in BioStor. In the absence of a globally unique identifier for this specimen how do I make these links? "USNM 195785" won't do because there are at least four specimens in the USNM with the catalogue number "195785". The GBIF occurrence id for this specimen (http://data.gbif.org/occurrences/244405570) would be an obvious candidate, were it not for the fact that GBIF has no concept of stable identifiers and its occurrence ids regularly change.

I confess I'm flabbergasted that iDigBio has avoid tackling the issue of specimen identifiers. If any museum wants to discover how its collection is being used to support science it will want to find the citations of its specimens in scientific papers and databases. This requires identifiers for specimens.

Tuesday, July 24, 2012

Dear GBIF, please stop changing occurrenceIDs!

If we are ever going to link biodiversity data together we need to have some way of ensuring persistent links between digital records. This isn't going to happen unless people take persistent identifiers seriously.

I've been trying to link specimen codes in publications to GBIF, with some success, so imagine my horror when it started to fall apart. For example, I recent added this paper to BioStor:

A remarkable new asterophryine microhylid frog from the mountains of New Guinea. Memoirs of The Queensland Museum 37: 281-286 (1994) http://biostor.org/reference/105389

This paper describes a new frog (i>Asterophrys leucopus) from New Guinea, and BioStor has extracted the specimen code QM J58650 (where "QM" is the abbreviation for Queensland Museum), which according to the local copy of GBIF data that I have, corresponds to http://data.gbif.org/occurrences/363089399/. Unfortunately, if you click on that link GBIF denies all knowledge (you get bounced to the search page). After a bit of digging I discover that specimen is now in GBIF as http://data.gbif.org/occurrences/478001337/. At some point GBIF has updated its data and the old occurrenceID for QM J58650 (363089399) has been deleted. Noooo!

Looking at the old record I have there is an additional identifier:

urn:catalog:QM: Herpetology:J58650

This is a URN, and it's (a) unresolvable and (b) invalid as it contains a space. This is why URNs are useless. There's no expectation they will be resolvable hence there's no incentive to make sure they are correct. It's as much use as writing software code but not bothering to run it (because surely it will work, no?).

The GBIF record http://data.gbif.org/occurrences/478001337/ contains a UUID as an alternative identifier:

bc58ce6b-3cc3-459a-9f5b-4a70a026afbe

If you Google this you discover a record in the Atlas of Living Australia http://biocache.ala.org.au/occurrences/bc58ce6b-3cc3-459a-9f5b-4a70a026afbe, which also lists the URN from the now deleted GBIF record http://data.gbif.org/occurrences/363089399/.

I'm guessing that at some point the OZCAM data provided to GBIF was updated and instead of updating data for existing occurrenceIDs the old ones were deleted and new ones created (possibly because OZCAM switched from URNs to UUIDs as alternative identifiers). Whatever the reason, I will now need to get a new copy of GBIF occurrence data and repeat the linking process. Sigh.

If we are ever going to deliver on the promise of linking biodiversity data together we need to take identifiers seriously. Meantime I need to think about mechanisms to handle links that disappear on a whim.

Friday, July 13, 2012

Sometimes the mess taxonomy creates drives me nuts

Playing with some sequence data I found numerous Plasmodium sequences from the following paper:

Werner, E. B. ., Taylor, W. R., & Holder, A. A. (1998). A Plasmodium chabaudi protein contains a repetitive region with a predicted spectrin-like structure1Note: Nucleotide sequence data reported in this paper are available in the EMBL, GenBank™ and DDJB databases under the accession number U43145.1. Molecular and Biochemical Parasitology, 94(2), 185–196. doi:10.1016/S0166-6851(98)00067-X

These sequences (e.g., U43145) give the host as Thamnomys rutilans. You'd think it would be fairly easy to learn more about this animal, given that it hosts a relative of the cause of malaria in humans, and indeed there are a number of biomedical papers that come up in Google, e.g.:

Landau, I., & Chabaud, A. (1994). Advances in Parasitology (Vol. 33, pp. 49–90). Elsevier BV. doi:10.1016/S0065-308X(08)60411-X

Killick-Kendrick, R. (1968). Malaria parasites of Thamnomys rutilans (Rodentia, Muridae) in Nigeria. Bull World Health Organ. 1968; 38(5): 822–824. PMC2554675

Google also tells me that Thamnomys rutilans is an African rodent (e.g., 6.1.6. Rodent malaria, but NCBI has no sequences for "Thamnomys rutilans", and GBIF has no data on its distribution. If I search Mammal Species of the World I get (literally) "nothing found ...".

So, this is an African rodent, host to Plasmodium, and we know nothing about it? A bit of Googling, a trip to Wikipedia and Google Books reveals that Thamnomys rutilans is a synonym of Grammomys rutilans, but it is now called Grammomys poensis because the original name (Mus rutilans Peters 1876) is a junior ~~synonym~~ homonym of Mus rutilans Olfers, 1818 (simples). You can see the original description of Mus rutilans Peters 1876 in BioStor http://biostor.org/reference/105261 (this took some tracking down, but that's another story):

4ca1a4521753bde9a091661c7694f8ae

The original description of Mus rutilans Olfers, 1818 is given by The description of a new species of South American hocicudo, or long-nose mouse, genus Oxymycterus (Sigmodontinae, Muroidea), with a critical review of the generic content as:

Olfers, I. 1818. Bemerkungen zu Illiger's Ueberblick der Saugethiere nach ihrer Betheilung über die Welttheile rüchsichtlich der Südamerikanischen Arten (Species). In Eschwege, W. L., ed., Journal von Brasilien, Weimar, 15(2): 192-237.

This reference doesn't seem to be online.

The upshot of all this information about the host of Plasmodium chabaudi is hidden behind taxonomic name changes, and databases that one might expect to help simply don't. If names are the glue that link biodiversity data together then we need to get a lot better at making basic information about name changes accessible, otherwise we are creating islands of disconnected data.

Saturday, June 02, 2012

Linking NCBI taxonomy to GBIF

@rdmpage, given an NCBI taxon ID, how to get GBIF occurrence records via web service?
— Rutger Vos (@rvosa) May 30, 2012

In response to Rutger Vos's question I've started to add GBIF taxon ids to the iPhylo Linkout website. If you've not come across iPhylo Linkout, it's a Semantic Mediawiki-based site were I maintain links between the NCBI taxonomy and other resources, such as Wikipedia and the BBC Nature Wildlife finder. For more background see

Page, R. D. M. (2011). Linking NCBI to Wikipedia: a wiki-based approach. PLoS Currents, 3, RRN1228. doi:10.1371/currents.RRN1228

I'm now starting to add GBIF ids to this site. This is potentially fraught with difficulties. There's no guarantee that the GBIF taxonomy ids are stable, unlike NCBI tax_ids which are fairly persistent (NCBI publish deletion/merge lists when they make changes). Then there are the obvious problems with the GBIF taxonomy itself. But, if you want a way to generate a distribution map for a taxon in the NCBI taxonomy, the quickest way is going to be via GBIF.

The mapping is being made automatically, with some crude checks to try and avoid too many erroneous links (e.g., due to homonyms). It will probably take a few days to complete (the mapping is quick, uploading to the wiki is a bit slower). Using a wiki to manage the mapping makes it easy to correct any spurious matches.

As an example, the page http://iphylo.org/linkout/Ncbi:109175 is for the frog Hyla japonica (NCBI tax_id 109175) and shows links to Wikipedia (http://en.wikipedia.org/wiki/Japanese_Tree_Frog, and to GBIF (http://data.gbif.org/species/2427601/). There's even a link to TreeBASE. I display a GBIF map so you can see what data GBIF currently has for that taxon.

Hyla

So, we have a wiki page, how do we answer Rutger's original question: how to get GBIF occurrence records via web service?

To do this we can use the RDF output by the Semantic Mediawiki software that underpins the Wiki. You can gte this by clicking on the RDF icon near the bottom of the page, or go to http://iphylo.org/linkout/Special:ExportRDF/Ncbi:109175. The RDF this produces is really, really ugly (and people wonder why the Semantic Web has been slow to take off...). In this RDF you will see the statement:


<rdfs:seeAlso rdf:resource="http://data.gbif.org/species/2427601/"/>

So, arm yourself with XPath, a regular expression, or if you are a serious RDF geek break out the SPARQL, and you can extract the GBIF taxon id for a NCBI taxon. Given that id you can query the GBIF web services. One service that I like is the occurrence density service, which you can use to recreate the 1°×1° density maps shown by GBIF. For example, http://data.gbif.org/ws/rest/density/list?taxonconceptkey=2427601 will get you the squares shown in the screen shot above.

Of course, I have glossed over several issues, such as the errors and redundancy in the GBIF classification, the mismatch between NCBI and GBIF classifications (NCBI has many more ranks than GBIF), and whether the taxon concepts used by the two databases are equivalent (this is likely to be more of an issue for higher taxa). But it's a start.

Tuesday, March 27, 2012

BHL and GBIF as biomedical databases

When I think of the Biodiversity Heritage Library (BHL) or GBIF I tend to think of taxonomy and biodiversity. Folk wisdom has it that BHL is full of old books, mostly pre-1923. Great for finding old taxonomic names, or nice artwork, but not exactly "modern" biology. GBIF is mainly about displaying organism distributions based on museum specimens, the primary data of taxonomic research. Again, great stuff, but aren't museums simply full of dead stuff that people have collected and forgotten about?

But BHL has a lot more post-1923 content than I suspect most people realise (several museum or society journals have 21st century issues in BHL's archives, for example). Continuing the theme of linking BHL and GBIF content, as part of a forthcoming project on taxonomic names (to be made available "real soon now") I stumbled across this 1976 paper in BHL (now in BioStor):

Monograph on "Lithoglyphopsis" aperta, the snail host of Mekong River Schistosomiasis by Davis et al..

Malacologia157576inst 0263

This paper has been indexed in PubMed (PMID:948206, but as far as I'm aware, BHL (and BioStor) has the only digital copy of this paper. (As a side note, wouldn't it be great if PubMed could link to BHL content?).

The article page in BioStor shows a map derived from the OCR text, showing a two localities:

Mekong

Below the map are the specimen codes I've automatically extracted from the OCR text, linked to the corresponding records in GBIF, which are georeferenced (e.g., ANSP Malacology 330925).

If we joined these things up just a little more, we could do some useful things. For example, what if a researcher searching in PubMed for schistosomiasis in South East Asia could find the Davis et al. paper, and then go to BHL or BioStor to read it? What if a researcher looking at gastropod distributions in the Mekong River in the GBIF portal could see that BHL had publications on diseases associated with these organisms (as well as their taxonomy and biology). We could also traverse the link from GBIF to BHL to PubMed and provide a direct route from distribution maps to biomedical literature.

It seems there's scope for trying to connect BHL, GBIF, and PubMed, and that BHL and GBIF may have important roles to play in providing access to basic information about organisms that have a serious impact on human populations.

Monday, February 27, 2012

Linking GBIF and the Biodiversity Heritage Library

Following on from exploring links between GBIF and GenBank here I'm going to look at links between GBIF and the primary literature, in this case articles scanned by the Biodiversity Heritage Library (BHL). The OCR text in BHL can be mined for a variety of entities. BHL itself has used uBio's tools to identity taxonomic names in the OCR text, and in my BioStor project I've extracted article-level metadata and geographic co-ordinates. Given that many articles in BioStor list museum specimens I wrote some code to extract these (see Extracting museum specimen codes from text) and applied this to the OCR text for those articles.

Having a list of specimens is nice, but in this digital age I want to be able to find out more about these specimens. An obvious solution is try and match these specimen codes to the specimen records held by GBIF. Linking to GBIF is complicated by the fact that museum codes are not unique. For example, "FMNH 147942" could refer to a bird, an amphibian, or a mammal. To tackle the non uniqueness I use the taxonomic names extracted from each page by BHL to work out what taxon an article is mainly "about". To do this I use the Catalogue of Life classification to get "paths" for each name (i.e., the lineage of each taxon down to the root of the classification) and then find the majority-rule path. You can see these paths in the "Taxonomic classification" displayed on a page for a BioStor article. If there are multiple GBIF specimens for the same code I test whether the taxon or rank "class" in the GBIF record is in the majority-rule path for the article. If so, I accept that specimen as the match to the code.

There are also issues where the specimen codes in GBIF have been modified during input (e.g., USNM 730715 has become USNM 730715.457409). There are also the inevitable OCR errors that may cause museum codes to be missed or otherwise corrupted. Bearing all this in mind, BioStor now has specimen pages (these are still being generated as I write this). For example, the page for FMNH 147942 lists the three articles in BioStor that cite this specimen code:

Fmnh147942

All three specimens have been mapped on to GBIF occurrence http://data.gbif.org/occurrences/61846037/. When BioStor displays the articles it now lists the specimen codes that have been extracted from the article, together with the GBIF logo if the specimen has been matched to a GBIF record. For example, here is a screenshot from Deep-water octopods (Mollusca: Cephalopoda) of the northeastern Pacific:
Deepwater

The map has been extracted from the OCR text (an obvious next step would be to add localities associated with the specimen records). Below the map are the specimen codes. The lack of some USNM specimens is probably due to misinterpreted specimen codes, whereas the CAS specimens don't seem to be online (the California Academy of Sciences has some of its collections in GBIF, but not its molluscs).

Where next?
Once these links between BioStor (and hence, BHL) and GBIF are created then we can do some interesting things. If you visit BioStor and want to learn more about a specimen you can click on the link an view the record in GBIF. We could also envisage doing the reverse. GBIF could augment the information it displays about a specimen by displaying a link to the content in BioStor (e.g., "this specimen is cited by these articles"). Those articles may contain further information about that specimen (for example, the habitat it was collected from, how secure is its identification, and so on).

We could also start to compute the "impact" of different museum collections based on the number of citations of specimens from their collections (this idea is explored further in this paper: http://dx.doi.org/10.1093/bib/bbn022, free preprint available here: hdl:10101/npre.2008.1760.1).

All of this works because we are linking objects (in this case articles and specimens) via their identifiers. Consequently, the links are as stable as their identifiers, which is why I've been pursuing the issue of specimen identifiers recently (see here, here, and here). If GBIF maintains the URLs for the specimens I've linked to, then links I've created could persist. If these URLs are likely to change (e.g., because the metadata from the host institution has changed) then the links (and any associated value we get from them) disappear. This is why I want globally unique, resolvable, persistent identifiers for specimens.

Tuesday, February 21, 2012

Linking GBIF and Genbank

As part of my mantra that it's not about the data, it's all about the links between the data, I've started exploring matching GenBank sequences to GBIF occurrences using the specimen_voucher codes recorded in GenBank sequences. It's quickly becoming apparent that this is not going to be easy. Specimen codes are not unique, are written in all sorts of ways, there are multiple codes for the same specimen (GenBank sequences may be associated with museum catalogue entries, or which field or collector numbers).

So why undertake what is fast looking like a hopeless task? There are several reasons:

GBIF occurrences have a unique URL which we could potentially use as a unique, resolvable identifier for the corresponding specimen.
Linking GenBank to GBIF would make it possible for GBIF to list sequences associated with a specimen, as well as the associated publication, which means we could demonstrate the "impact" of a specimen. In the simplest terms this could be the number of sequences and publications that use data from the specimen, more sophisticated approaches could use PageRank-like measures, see hdl:10101/npre.2008.1760.1.
Having a unique identifier that is shared across different databases makes it easier to combine data from different sources. For example, if a sequence in GenBank lacks geographic coordinates but the voucher specimen in GBIF is georeferenced, we can use that information to locate the sequence in geographic space (and hence build geophylogenies or add spatial indexes to databases such as TreeBASE). Conversely, if the GenBank sequence is georeferenced but the GBIF record isn't we can update the GBIF record and possibly expand the range of the corresponding taxon (this was part of the motivation behind hdl:10101/npre.2009.3173.1.

As an example, below is the GBIF 1° density map for the frog Pristimantis ridens from GBIF, with the phylogeny from Wang et al.Phylogeography of the Pygmy Rain Frog (Pristimantis ridens) across the lowland wet forests of isthmian Central Americahttp://dx.doi.org/10.1016/j.ympev.2008.02.021 layered over it. I created the KML tree from the corresponding tree in TreeBASE using the tool I described earlier. You can grab the KML for the tree here.

Density

As we'd expect, there is a lot of overlap in the two sources of data. If we investigate further, there are records that are in fact based on the same specimen. For example, if we download the GBIF KML file with individual placemarks we see that in the northern part of the range their are 15 GBIF occurrences that map onto the same point as one of the terminal taxa in the tree.

Gbif

One of these 15 GBIF records (http://data.gbif.org/occurrences/244335848) is for specimen USNM 514547, which is the voucher specimen for EU443175. This gives us a link between the record in GBIF and the record in GenBank. It also gives us a URI we can use for the specimen http://data.gbif.org/occurrences/244335848 instead of the unresolvable and potentially ambiguous USNM 514547.

If we view the geophylogeny from a different vantage point we see numerous localities that don't have occurrences in GBIF.

Nogbif

Close inspection reveals that some of the specimens listed in the Wang et al. paper are actually in GBIF, but lack geographic coordinates. For example the OTU "Pristimantis ridens Nusagandi AJC 0211" has the voucher specimen FMNH 257697. This specimen is in GBIF as http://data.gbif.org/occurrences/57919777/, but without coordinates, so it doesn't appear on the GBIF map. However, both the Wang et al. paper and the GenBank record for the sequence from this specimen EU443164 give the latitude and longitude. In this example, GBIF gives us a unique identifier for the specimen, and GenBank provides data on location that GBIF lacks.

Part of GBIFs success is due to the relative ease of integrating data by taxonomic names (despite the problems caused by synonyms, homonyms, misspellings, etc.) or using spatial coordinates (which immediately enables integration with environmental data. But if we want to integrate at deeper levels then specimen records are the glue that connects GBIF (and its contributing data sources) to sequence databases, phylogenies, and the taxonomic literature (via lists of material exampled). This will not be easy, certainly for legacy data that cites ambiguous specimen codes, but I would argue that the potential rewards are great.

Thursday, November 24, 2011

BHL needs to engage with publishers (and EOL needs to link to primary literature)

Browsing EOL I stumbled upon the recently described fish Protoanguilla palau, shown below in an image by rairaiken2011:

Two things struck me, the first is that the EOL page for this fish gives absolutely no clue as to where you would to find out more about this fish (apart from an unclickable link to the Wikipedia page http://en.wikipedia.org/wiki/Protoanguilla - seriously, a link that isn't clickable?), despite the fact this fish has been recently described in an Open Access publication ("A 'living fossil eel (Anguilliformes: Protanguillidae, fam. nov.) from an undersea cave in Palau", http://dx.doi.org/10.1098/rspb.2011.1289).

Now that I've got my customary grumble about EOL out of the way, let's look at the article itself. On the first page of the PDF it states:

This article cites 29 articles, 7 of which can be accessed free
http://rspb.royalsocietypublishing.org/content/early/2011/09/16/rspb.2011.1289.full.html#ref-list-1

So 22 of the articles or books cited in this paper are, apparently, not freely available. However, looking at the list of literature cited it becomes obvious that rather more of these citations are available online than we might think. For example, there are articles that are in the Biodiversity Heritage Library (BHL), e.g.

Regan C. T. 1912 The osteology and classification of the teleostean fishes of the order Apodes. Ann. Mag. Nat. Hist. Ser. 8, 377–387 http://biostor.org/reference/98316, http://www.biodiversitylibrary.org/page/15618586, http://dx.doi.org/10.1080/00222931208693250
McCosker J. E. 1977 The osteology, classification, and relationships of the eel family Ophichthidae. Proc. Calif. Acad. Sci. 41, 1–123 http://biostor.org/reference/59597

http://www.biodiversitylibrary.org/page/15691453

Then there are articles that are available in other digitising projects

Hay O. P. 1903 On a collection of Upper Cretaceous fishes from Mount Lebanon, Syria, with descriptions of four new genera and nineteen new species. Bull. Am. Mus. Nat. Hist. N. Y. 19, 395–452. http://hdl.handle.net/2246/1500
Nelson G. J. 1966 Gill arches of fishes of the order Anguilliformes. Pac. Sci. 20, 391–408. http://hdl.handle.net/10125/7805

Furthermore, there are articles that aren't necessarily free, but which have been digitised and have DOIs that have been missed by the publisher, such as the Regan paper above, and

Trewavas E. 1932 A contribution to the classification of the fishes of the order Apodes, based on the osteology of some rare eels. Proc. Zool. Soc. Lond. 1932, 639–659. http://dx.doi.org/10.1111/j.1096-3642.1932.tb01089.x

So, the Proceedings of the Royal Society has underestimated just how many citations the reader can view online. The problem, of course, is how does a publisher discover these additional citations? Some have been missed because of sloppy bibliographic data. The missing DOIs are probably because the Regan citation lacks a volume number, and the Trewavas paper uses a different volume number to that used by Wiley (who digitised Proc. Zool. Soc. Lond.). But the content in BHL and other digital archives will be missed because finding these is not part of a publisher's normal workflow. Typically citations are matched by using services ultimately provided by CrossRef, and the bulk of BHL content is not in CrossRef.

So it seems there's an opportunity here for someone to provide a service for publishers that adds value to their content in at least three ways:

Add missing DOIs due to problematic citations for older literature
Add links to BHL content
Add links to content in additional digitisation projects, such as journal archives in DSpace respositories

For readers this would enhance their experience (more of the literature becomes accessible to them), and for BHL and the repositories it will drive more readers to those repositories (how many people reading the paper on Protoanguilla palau have even heard of BHL?). I've said most of this before, but I really think there's an opportunity here to provide services to the publishing industry, and we don't seem to be grasping it yet.

Thursday, October 27, 2011

Linking taxonomic names to literature: beyond digitised 5×3 index cards

Tomorrow is the Anchoring Biodiversity Information: From Sherborn to the 21st century and beyond meeting. It should be an interesting gathering, albeit overshadowed by the sudden death of Frank Bisby.

I'm giving a talk entitled "Open Taxonomy", in which I argue that most taxonomic databases are little more than digitised collections of 5×3 index cards, where literature is treated as dumb citation strings rather than as resources with digital identifiers. To make the discussion concrete I've created a mapping between the Index to Organism Names (ION) database and a range of bibliographic sources, such as CrossRef (for DOIs), BioStor, JSTOR, etc.

This mapping is online at http://iphylo.org/~rpage/itaxon/.

So far I've managed to link some 200,000 animal names to a literature identifier, and a good fraction of these articles are freely available, either as images in BioStor and Gallica (for I've created a simple viewer) or as PDFs (which are displayed using Google Docs.

Some examples are:

The site is obviously a work in progress, and there's a lot to be done to the interface, but I hope it conveys the key point: a significant fraction of the primary taxonomic literature is online, and we should be linking to this. The days of digitised 5×3 index cards are past.

Wednesday, October 19, 2011

TDWG Challenge - what is RDF good for?

Last month, feeling particularly grumpy, I fired off an email to the TDWG-TAG mailing list with the subject Lobbing grenades: a challenge. Here's the email:

It's morning and the coffee hasn't quite kicked in yet, but reading through recent TDWG TAG posts, and mindful of the upcoming meeting in New Orleans (which sadly I won't be attending) I'm seeing a mismatch between the amount of effort being expended on discussions of vocabularies, ontologies, etc. and the concrete results we can point to.

Hence, a challenge:

"What new things have we learnt about biodiversity by converting biodiversity data into RDF?"

I'm not saying we can't learn new things, I'm simply asking what have we learnt so far?

Since around 2006 we have had literally millions of triples in the wild (uBio, ION, Index Fungorum, IPNI, Catalogue of Life, more recently Biodiversity Collections Index, Atlas of Living Australia, World Register of Marine Species, etc.), most of these using the same vocabulary. What new inferences have we made?

Let's make the challenge more concrete. Load all these data sources into a triple store (subchallenge - is this actually possible?). Perhaps add other RDF sources (DBpedia, Bio2RDF, CrossRef). What novel inferences can we make?

I may, of course, simply be in "grumpy old arse" mode, but we have millions of triples in the wild and nothing to show for it. I hope I'm not alone in wondering why...

In the context of the TDWG meeting (happening as we speak and which I'm following via Twitter, hashtag #tdwg) Joel Sachs asked me whether I had any specific data in mind that could form the basis of a discussion. So, here goes. I've assembled some small RDF data sets that it might be fun to play with. Each data set is for frogs, and I've divided them into two sets.

Primary data
These data sets are essentially unmodified RDF fetched from data providers:

uniprot.rdf Uniprot RDF for frogs in GenBank
ion.rdf Index of Organism Names (ION) RDF for taxonomic names for frogs (filtered to just those names that are also in GenBank, the RDF comes from ION LSIDs)
crossref.rdf CrossRef RDF for DOIs for publications that published new frog names (obtaining using CrossRef's support for Linked Data for DOIs)
dbpedia.rdf Dbpedia RDF for frogs in GenBank (Update 2011-10-20: the dbpedia.rdf file is a bit big, so here is subset.rdf which has just the conservation status and thumbnail image)

These sources give us information on genomics (at least, they tell us which taxa have been sequenced), where and when the original taxonomic description was published, and by whom, as well as some information on conservation status and what the frog looks like (via Dbpedia). Ideally we just load these files into a triple store and then ask a bunch of questions, such as what is the conservation status of frogs sequenced in Genbank?, is there correlation between the conservation status of a frog and the date it was discovered?, who has described the most frog species?, etc.

My contention is that actually we can't do any of this because the data is siloed due to the lack of shared identifiers and vocabularies (I suspect that there is not a single identifier any of these files share). The only way we can currently link these data sets together is by shared string literals (e.g., taxonomic names), in which case why bother with RDF? So my first challenge is to see whether any of the questions I've just listed can actually be tackled using this data.

Glue
In a slightly more constructive mode, to see if we can make progress I'm providing some additional RDF files, based on projects I'm working on to link data together. These files may help provide some of the missing "glue" to connect these data sets.

linkout.rdf The list of links between NCBI and Dbpedia (based on mapping in iPhylo LinkOut)
ion_doi.rdf A subset of publications listed in ION have DOIs, this file links the corresponding ION LSIDs to those DOIs (this file is from an ongoing project mapping names to primary literature)

The first file links the ION and CrossRef RDF, so we could start to ask questions about dates of discovery, who described what species, etc.. The second file links NCBI taxon ids (in this case in the form of UniProt URIs) to Wikipedia (in the form of Dbpedia URIs). Dbpedia has information on conservation status, and some frogs will also have pictures, so we can start to join genomics to conservation, as well as make some visualisations.

Update
I've now added another RDF file for 1000 georeferenced GenBank sequences for frogs. The file is genbank.rdf. This file is generated from a local, processed version of EMBL, and uses a mixture of Dublin Core and TDWG vocabularies. Here's an example of a single record:


<?xml version="1.0"?>
<rdf:RDF xmlns:dcterms="http://purl.org/dc/terms/" 
xmlns:geo="http://www.w3.org/2003/01/geo/wgs84_pos#" 
xmlns:owl="http://www.w3.org/2002/07/owl#" 
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" 
xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" 
xmlns:tcommon="http://rs.tdwg.org/ontology/voc/Common#" 
xmlns:toccurrence="http://rs.tdwg.org/ontology/voc/TaxonOccurrence#" 
xmlns:uniprot="http://purl.uniprot.org/core/">
  <uniprot:Molecule rdf:about="http://bio2rdf.org/genbank:EU566842">
    <dcterms:created>2008-07-06</dcterms:created>
    <dcterms:modified>2010-12-23</dcterms:modified>
    <dcterms:title>EU566842</dcterms:title>
    <dcterms:description>Xenopus borealis voucher MHNG:Herp:2644.64 
cytochrome oxidase subunit I (COI) gene, partial cds; mitochondrial.</dcterms:description>
    <dcterms:subject rdf:resource="http://purl.uniprot.org/taxonomy/8354"/>
    <dcterms:relation rdf:parseType="Resource">
      <rdf:type rdf:resource="http://rs.tdwg.org/ontology/voc/TaxonOccurrence#TaxonOccurrence"/>
      <toccurrence:identifiedToString>Xenopus borealis</toccurrence:identifiedToString>
      <toccurrence:decimalLatitude>0.66</toccurrence:decimalLatitude>
      <geo:lat>0.66</geo:lat>
      <toccurrence:decimalLongitude>37.5</toccurrence:decimalLongitude>
      <geo:long>37.5</geo:long>
      <toccurrence:verbatimCoordinates>0.66 N 37.5 E</toccurrence:verbatimCoordinates>
      <toccurrence:country>Kenya</toccurrence:country>
      <dcterms:identifier>MHNG:Herp:2644.64</dcterms:identifier>
    </dcterms:relation>
  </uniprot:Molecule>
</rdf:RDF>

I've added this simply so one could do some geographical queries.

Missing links
There are still lots of missing links here (for example, there's no explicit link between NCBI and ION, so we'd need to create this using taxonomic names), and we could add further links to the literature via sequences for taxa. Then there's the lack of geographic data. We could get some of this via georeferenced sequences in GenBank, but there's no RDF for this (Bio2RDF does have RDF for sequences but it ignores the bulk of the organismal metadata such as voucher specimens and latitude and longitude).

In many ways it's this lack of links that was point of my original email. The reality is that "linked data" isn't linked to anything like the extent that makes it useful. Simply pumping out RDF won't get us very far until we tackle this problem (see also my earlier post Linked data that isn't: the failings of RDF).

So, if you think RDF is the way to go, please tell me what you can learn from these data files.