iPhylo: specimen codes

Roderic D. M. Page

Showing posts with label specimen codes. Show all posts

Friday, May 27, 2022

Round trip from identifiers to citations and back again

Note to self (basically rewriting last year's Finding citations of specimens).

Bibliographic data supports going from identifier to citation string and back again, so we can do a "round trip."

1.

Given a DOI we can get structured data with a simple HTTP fetch, then use a tool such as citation.js to convert that data into a human-readable string in a variety of formats.

Identifier	⟶	Structured data	⟶	Human readable string
10.7717/peerj-cs.214	HTTP with content-negotiation	CSL-JSON	CSL templates	Willighagen, L. G. (2019). Citation.js: a format-independent, modular bibliography tool for the browser and command line. PeerJ Computer Science, 5, e214. https://doi.org/10.7717/peerj-cs.214

2.

Going in the reverse direction (string to identifier) is a little more challenging. In the "old days" a typical strategy was to attempt to parse the citation string into structured data (see AnyStyle for a nice example of this), then we could extract a truple of (journal, volume, starting page) and use that to query CrossRef to find if there was an article with that tuple, which gave us the DOI.

Identifier	⟵	Structured data	⟵	Human readable string
10.7717/peerj-cs.214	OpenURL query	journal, volume, start page	Citation parser	Willighagen, L. G. (2019). Citation.js: a format-independent, modular bibliography tool for the browser and command line. PeerJ Computer Science, 5, e214. https://doi.org/10.7717/peerj-cs.214

3.

Another strategy is to take all the citations strings for each DOI, index those in a search engine, then just use a simple search to find the best match to your citation string, and hence the DOI. This is what https://search.crossref.org does.

Identifier	⟵	Human readable string
10.7717/peerj-cs.214	search	Willighagen, L. G. (2019). Citation.js: a format-independent, modular bibliography tool for the browser and command line. PeerJ Computer Science, 5, e214. https://doi.org/10.7717/peerj-cs.214

At the moment my work on material citations (i.e., lists of specimens in taxonomic papers) is focussing on 1 (generating citations from specimen data in GBIF) and 2 (parsing citations into structured data).

Thursday, November 15, 2018

Geocoding genomic databases using GBIF

LwyH1HFe 400x400 I've put a short note up on bioRxiv about ways to geocode nucleotide sequences in databases such as GenBank. The preprint is "Geocoding genomic databases using GBIF" https://doi.org/10.1101/469650.

It briefly discusses using GBIF as a gazetteer (see https://lyrical-money.glitch.me for a demo) to geocode sequences, as well as other approaches such as specimen matching (see also Nicky Nicolson's cool work "Specimens as Research Objects: Reconciliation across Distributed Repositories to Enable Metadata Propagation" https://doi.org/10.6084/m9.figshare.7327325.v1).

Hope to revisit this topic at some point, for now this preprint is a bit of a placeholder to remind me of what needs to be done.

Tuesday, May 19, 2015

Text mining for museum specimen identifiers

Text mining for museum specimen identifiers #TDM http://t.co/BsFXSAZJpK cc @rdmpage @robgural thoughts?
— Ross Mounce (@rmounce) May 19, 2015

Class EMuMedia php This post is a response to Ross Mounce's post Text mining for museum specimen identifiers. As Ross notes in that post, mining literature for specimen codes is something I've been interested in for a while (search for specimen codes on iPhylo), and @Aime Rankin (formerly an undergraduate student at Glasgow) did some work on this as well. It's great to see progress in this area.

Here are some thoughts on Ross's post (I'm posting here rather than as a comment on Ross's blog because this is going to be long).

What questions to ask?

Obviously there's a lot of scope for metrics, such as numbers of citations for individual specimens, and league tables for collections (see GBIF specimens in BioStor: who are the top ten museums with citable specimens?). As Ross notes, there's also scope for updating out of date museum metadata with information from the literature (e.g., Linking data from the NHM portal with content in BHL), but even more interesting is the potential to cross-link databases in a way that permits novel queries. For example, if we have a paper on a disease that includes data we can link to a georeferenced specimen, then we can enable spatial queries for diseases (e.g., BHL and GBIF as biomedical databases).

Materials for mining

From my perspective the obvious corpus to mine is the Biodiversity Heritage Library (BHL). Ross repeats the erroneous view that BHL is just "legacy" literature. Apart from the obvious point that everything not published right not is, by definition, legacy, BHL has a lot of modern content (including papers published in the last couple of years).

Furthermore, there are journals that cite Natural History Museum specimens, including "in house" journals (e.g., Bulletin of the British Museum (Natural History) Zoology and Bulletin of the Natural History Museum. Zoology series), as well as the Bulletin of the British Ornithologists' Club which has published lots of new bird names for which the type specimen is often in the NHM.

I guess one issue is accessibility. Ross notes that:

The PMC OA subset is fantastic & really facilitates this kind of research – I wish ALL of the biodiversity literature was aggregated like (some) of the open access biomedical literature is. You can literally just download a million papers, click, and go do your research. It facilitates rigorous research by allowing full machine access to full texts.

So, how we can make BHL content as accessible? For each article I've extracted from BHL and stored in BioStor you can get full text by simply appending ".text" to the BioStor URL, but this isn't quite the same as grabbing a big dump of text.

The other source of mining is GenBank, which has a lot of sequences that have NHM vouchers, but also a weird and wonderful array of ways of recording those specimens. This is one reason I'm building "Material examined", to cope with these codes. For example sequence KF281084 has voucher "TRING 1877111743" which more traditionally would be written as "BMNH 1877.11.17.43", which is "NHMUK 1877.11.17.43" in the NHM database. This is just one example of the horrors of matching specimen codes (for more see the code for Material examined).

One reason GenBank is useful is that the sequences are often linked to the literature, which means you get to make the link between specimen and literature without actually needing to mine the text itself (handy if access is problematic).

Bonus question: How should I publish this annotation data?

But if I wanted to publish something a little better & a little more formal, what kind of RDF vocabulary can I use to describe “occurs in” or “is mentioned in”. What would be the most useful format to publish this data in so that it can be re-used and extended to become part of the biodiversity knowledge graph and have lasting value?

Personally I'd avoid RDF because that way lies madness (or at least endless detours haggling about ontologies).

But making the output useful is an important question. Despite the fact that it is a bit clunky, I suspect Darwin Core Archives are the way to go. The core data is a CSV table, so it's easy to generate, and also easy to use. Lets say you analysed a particular corpus (e.g., PLoS ONE), you could then output the data in Darwin Core (making sure both specimen and publication had stable identifiers), then package it up and upload to Zenodo or Figshare and get a DOI. For bonus points, it would be great to see this data on GBIF, but this would require (a) mapping NHM specimen codes to GBIF ids (the NHM has this), and (b) GBIF being able to recognise that the data you're adding is not new specimens but rather annotations of existing specimens.

Things to think about

Here are a couple of additional things to think about.

Specimen finding as a service

In the same way that we have taxonomic name-finding services, it would be great if we had a specimen code-finding service. I have code that I use in BioStor, but it would be great to have something that is robust, stable, and generalisable across multiple specimen codes. My tool Material examined focusses on parsing a single string rather than parsing a block of text, but adding that functionality is an obvious thing to do.

Markup as output

One concern I have with work that involves mining text is that we hardly ever store the intermediate step of text + located elements. Instead we get to see sumamry output (e.g., this page has these three scientific names, and these 10 specimen codes). As Terry Catapano (@catapanoth) once wisely pointed out "indexing is markup", in that if you find a substring in some text, you have in effect marked up the text. Can we preserved the marked up text so that we go back and look at it and improve our text mining methods, or make that markup available to others to build upon it? There are all sorts of things which could be built upon this information, for example, imaging if the results where given to BHL so that people could search by specimen code.

Tuesday, April 21, 2015

Looking up specimen codes in GBIF using Google Spreadsheet

Playing with the my "material examined" tool I've been working on, I wondered whether I could make use of it in, say, a spreadsheet. Imagine that I have a spreadsheet of museum codes and want to look those up in GBIF. I could create a service for Open Refine but Open Refine is a bit big and clunky, you have to fire up a Java application and point your browser at it, and Open Refine isn't as intuitive or as flexible as a spreadsheet.

It turns that Google Spreadsheets supports custom functions, including importing JSDON from a remote data source. Following How to import JSON data into Google Spreadsheets in less than 5 minutes here's what to do:

Create a new Google Spreadsheet.
Click on Tools -> Script Editor.
Click Create script for Spreadsheet.
Delete the placeholder content and paste the code from this script.
Rename the script to ImportJSON.gs and click the save button.
Back in the spreadsheet, in a cell, you can type “=ImportJSON()” and begin filling out it’s parameters.

Lets imagine we have a spreadsheet with a specimen code in cell A1, e.g. "FMNH 187122".

To call the material examined service, we need a function like this:

=ImportJSON(CONCATENATE("http://bionames.org/~rpage/material-examined/service/api.php?code=",A1,"&match&extend=10"), "/hits/key,/hits/scientificName", "noHeaders")

Paste this into cell B1 (i.e., just to the right of the specimen code) and after a short delay you should see something like this:

The three parameters supplied to ImportJSON are are the query URL, written as a spreadsheet function that grabs the specimen code from cell A1, a list of the bits of data we want to extract from the result (expressed as JSON paths), and some options (in this case, don't show the headers). ImportJSON will grab the specimen code in cell A1, add it to the query URL, then output the results. You should see something like this:

The first column is the GBIF occurrence ID, the second is the scientific name (you can add more JSON paths to get more fields).

Note that we have multiple rows as there is more than one specimen with the code "FMNH 187122" in GBIF. Now, we can ask the material examined service to return only certain taxa (such as mammals) by adding the "scientificName" parameter:

=ImportJSON(CONCATENATE("http://bionames.org/~rpage/material-examined/service/api.php?code=",A10,"&scientificName=",B10,"&match&extend=10"), "/hits/key,/hits/scientificName", "noHeaders")

If you put the specimen code in cell A10, and the higher taxon "Mammalia" in cell B10, and paste the function above into cell C10, then you should see something like this:

Note that now we have a single row with the mammal specimen.

It's a little bit fussy (you need to get the ImportJSON script, and mess a bit with the parameters but it's quick and flexible, and you get all the power of a spreadsheet to help clean the data before trying to match it to GBIF. Plus you can do it all in your browser.

Wednesday, April 15, 2015

Linking specimen codes to GBIF

I've put together a working demo of some code I've been working on to discover GBIF records that correspond to museum specimen codes. The live demo is at http://bionames.org/~rpage/material-examined/ and code is on GitHub.

To use the demo, simply paste in a specimen code (e.g., "MCZ 24351") and click Find and it will do it's best to parse the code, then go off to GBIF and see what it can find. Some examples that are fun include MCZ 24351, KU:IT:00312, MNHN 2003-1054, and AMS I33708-051

It's proof of concept at this stage, and the search is "live", I'm not (yet) storing any results. For now I simply want to explore how well if can find matches in GBIF.

By itself this isn't terribly exciting, but it's a key step towards some of the things I want to do. For example, the NCBI is interested in flagging sequences from type specimens (see http://dx.doi.org/10.1093/nar/gku1127 ), so we could imagine taking lists of type specimens from GBIF and trying to match those to voucher codes in GenBank. I've played a little with this, unfortunately there seem to be lots of cases where GBIF doesn’t know that a specimen is, in fact, a type.

Another thing I’m interested in is cases where GBIF has a georeferenced specimen but GenBank doesn’t (or visa versa), as a stepping stone towards creating geophylogenies. For example, in order to create a geophylogeny for Agnotecous crickets in New Caledonia (see GeoJSON and geophylogenies ) I needed to combine sequence data from NCBI with locality data from GBIF.

It’s becoming increasingly clear to me that the data supplied to GBIF is often horribly out of date compared to what is in the literature. Often all GBIF gets is what has been scribbled in a collection catalogue. By linking GBIF records to specimen codes cited that are cited in the literature we could imagine giving GBIF users enhanced information on a given occurrence (and at the same time get citation counts for specimens The impact of museum collections: one collection ≈ one Nobel Prize).

Lastly, if we can link specimens to sequences and the literature, then we can populate more of the biodiversity knowledge graph

Thursday, December 18, 2014

Linking data from the NHM portal with content in BHL

One reason I'm excited by the launch of the NHM data portal is that it opens up opportunities to link publications about specimens i the NHM to the record of the specimens themselves. For example, consider specimen 1977.3097, which is in the new portal as http://data.nhm.ac.uk/dataset/collection-specimens/resource/05ff2255-c38a-40c9-b657-4ccb55ab2feb/record/2336568 (possibly the ugliest URL ever).

1977 3097

This specimen is of the bat Pteralopex acrodonta, shown in the image to the right (by William N. Beckon, taken from the EOL page for this species). This species was described in the following paper:

Hill JE, Beckon WN (1978) A new species of Pteralopex Thomas, 1888 (Chiroptera: Pteropodidae) from the Fiji Islands. Bulletin of the British Museum (Natural History) Zoology 34(2): 65–82. http://biostor.org/reference/8

This paper is in my BioStor project, and if you visit BioStor you'll see see that BioStor has extracted a specimen code (BM(NH) 77.3097) and also has a map of localities extracted from the paper.

Map

Looking at the paper we discover that BM(NH) 77.3097 is the type specimen of Pteralopex acrodonta:

HOLOTYPE. BM(NH) 77.3097. Adult . Ridge about 300 m NE of the Des Voeux Peak Radio Telephone Antenna Tower, Taveuni Island, Fiji Islands, 16° 50½' S, 179° 58' W, c. 3840ft (1170 m). Collected 3 May 1977 by W. N. Beckon, died 6-7 May 1977. Caught in mist net on ridge summit : bulldozed land with secondary scrubby growth, adjacent to primary forest. Original number 104. Skin and skull.

Note that the NHM data portal doesn't know that 1977.3097 is the holotype, nor does it have the latitude and longitude. Hence, if we can link 1977.3097 to BM(NH) 77.3097 we can augment the information in the NHM portal.

This specimen has also been cited in a subsequent paper:

Helgen, K. M. (2005, November). Systematics of the Pacific monkey‐faced bats (Chiroptera: Pteropodidae), with a new species of Pteralopex and a new Fijian genus . Systematics and Biodiversity. Informa UK Limited. doi:10.1017/s1477200005001702

You can read this paper in BioNames. In this paper Helgen creates a new genus, Mirimiri for Pteralopex acrodonta, and cites the holotype (as BMNH 1977.3097). Hence, if we could extract that specimen code from the text and link it to the NHM record we could have two citations for this specimen, and note that the taxon the specimen belongs to is also known as Mirimiri acrodonta.

Imagine being able to do this across the whole NHM data portal. The original description of this bat was published in a journal published by the NHM (and part of a volume contributed by the NHM to the Biodiversity Heritage Library). With a *cough* little work we could join up these two NHM digital resources (specimen and paper) to provide a more detailed view what we know about this specimen. From my perspective this cross-linking between the different digital assets of an institution such as the NHM (as well as linking to external data such as other publications, GenBank sequences, etc.) is where the real value of digitisation lies. It has the potential to be much more than simply moving paper catalogues and publications online.

Friday, April 20, 2012

Quick thoughts on specimen identifiers

Based on recent discussions my sense is that our community will continue to thrash the issue of identifiers to death, repeating many of the debates that have gone on (and will go on) in other areas. To be trite, it seems to me we have three criteria: cheap, resolvable, and persistent. We get to pick two.

Cheap and resolvable means URLs, which everybody is nervous about because they break. They don't have to break, but for a bunch of reasons they do.

Cheap and persistent means things like Darwin Triplet Core or URNs. You can write things on paper and they will persist (the Biodiversity Heritage Library shows us that), but how in the digital era do we do anything with this? If it's not resolvable what, exactly, is the point? We tried URNs — even ones that were resolvable (LSIDs) — and that was a disaster (we learnt a lot, but what a mess).

Resolvable and persistent. This is where technologies such as DOIs reside. If every specimen had a DOI would we still be having this discussion? We'd have a resolvable identifier that is resistant to change (including loss of museum domain names, specimens moving to new institutions, etc.), and one that is already in use by CrossRef and DataCite, and will also play ball with linked data folks.

In practical terms, what if we had a convention that each collection gets it's own DOI prefix "10.nnnn", after which it appends whatever specimen identifier makes sense (and is unique within that collection).

The bulk of specimen identifiers in the wild are of the form "Institution" "Catalogue number", e.g. ANSP 332467 (from the example I discussed in BHL and GBIF as biomedical databases).

If we wrote this as a DOI of the form <doi prefix>/Collection/InstitutionCatalogue number then we'd have identifiers that (in part) matched what most people would expect to see. In the example above we would have something like:

10.nnnnn/MAL/ANSP332467

where "MAL" is the acronym for the Malacology collection. This is pretty close to "ANSP 332467", is human friendly, but would also be resolvable. It also carries limited branding, so if the specimen was moved from it's current collection to a new institution, people wouldn't get too upset by the presence of "ANSP"). It would also help make the links between specimen codes and DOIs. We couldn't rely on 10.nnnnn/MAL/ANSP332467 being a specimen in the Academy of Natural Sciences's malacological collection, but it would be a good place to start looking.

As I've argued before, we could centralise the minting of these identifiers using GBIF, but do it in a such a way that host institutions could assume responsibility for it if and when they are able (i.e., initially GBIF is responsible for managing the DOI prefixes for each institution, with the option for institutions to do this). The beauty of identifiers like DOIs is that from the user's perspective the identifier is unchanged.

I'm hoping we'll make some progress on this in the coming months...

Thursday, February 23, 2012

How many specimens does GBIF really have?

Duplicate records are the bane of any project that aggregates data from multiple sources. Mendeley, for example, has numerous copies of the same article, as documented by Duncan Hull (How many unique papers are there in Mendeley?). In their defence, Mendeley is aggregating data from lots of personal reference libraries and hence they will often encounter the same article with slightly differing metadata (we all have our own quirks when we store bibliographic details of papers). It's a challenging problem to identify and merge records which are not identical, but which are clearly the same thing.

What I'm finding rather more alarming is that GBIF has duplicate records for the same specimen from the same data provider. For example, the specimen USNM 547844 is present twice:

As far as I can tell this is the same specimen, but the catalogue numbers differ (547844 versus 547844.6544573). Apart from this the only difference is when the two records were indexed. The source for 547844 was last indexed August 9, 2009, the source for 547844.6544573 was first indexed August 22, 2010. So it would appear that some time between these two dates the US National Museum of Natural History (NMNH) changed the catalogue codes (by appending another number), so GBIF has treated them as two distinct specimens. Browsing other GBIF records from the NMNH shows the same pattern. I've not quantified the extent of this problem, but it's probably a safe bet that every NMNH herp specimen occurs twice in GBIF.

Then there are the records from Harvard's Museum of Comparative Zoology that are duplicates, such as http://data.gbif.org/occurrences/33400333/ and http://data.gbif.org/occurrences/328478233/ (both for specimen MCZ A-4092, in this case the collectionCode is either "Herp" or "HERPAMPH"). These are records that have been loaded at different times, and because the metadata has changed GBIF hasn't recognised that these are the same thing.

At the root of this problem is the lack of globally unique identifiers for specimens, or even identifiers that are unique and stable within a dataset. The Darwin Core wiki lists a field for occurrenceID for which it states:

The occurrenceID is supposed to (globally) uniquely identify an occurrence record, whether it is a specimen-based occurrence, a one-time observation of a species at a location, or one of many occurrences of an individual who is being tracked, monitored, or recaptured. Making it globally unique is quite a trick, one for which we don't really have good solutions in place yet, but one which ontologists insist is essential.

Well, now we see the side effect of not tackling this problem - our flagship aggregator of biodiversity data has duplicate records. Note that this has nothing to do with "ontologists" (whatever they are), it's simple data management. Assign a unique id (a primary key in a database will do fine) that can be used to track the identity of an object even as its metadata changes. Otherwise you are reduced to matching based on metadata, and if that is changeable then you have a problem.

Now, just imagine the potential chaos if we start changing institution and collection codes to conform to the Darwin Core triplet. In the absence of unique identifiers (again, these can be local to the data set) GBIF is going to be faced with a massive data reconciliation task to try and match old and new specimen records.

The other problem, of course, is that my plan to use GBIF occurrence URLs as globally unique identifiers for specimens is looking pretty shaky because they are unique (the same specimen can have more than one) and if GBIF cleans up the duplicates a number of these URLs will disappear. Bugger.

Thursday, January 26, 2012

Extracting museum specimen codes from text

Quick note about a tool I've cobbled together as part of the phyloinformatics course, which addresses a long standing need I and others have to extract specimen codes from text. I've had this code kicking around for a while (as part of various never-finished data mining projects), but never got around to releasing it, until now. It is very crude (basically a bunch of regular expressions), and there's a lot which could be done to improve it (not least starting with a complete list of museum specimen codes, rather than just those I've come across in, say Zootaxa and BioStor).

You can try the tool at http://iphylo.org/~rpage/phyloinformatics/services/specimenparser.php. Paste in some text and it will try and extract museum codes. The tool tries to handle ranges of specimens (e.g., MHNSM 1808-09), and some of the more common specimen numbering schemes.

Comments welcome. If you are looking for a source of text, papers in Zookeys or Zootaxa are a good place to start (especially papers on vertebrates where specimen numbers are often used). BioStor is also a good source: if you're looking at a paper in BioStor click on the "Text" link to get the OCR text for an article and paste that into the form at . For example, the text for Systematics of the Bufo coccifer complex (Anura: Bufonidae) of Mesoamerica is available at http://biostor.org/reference/97426.text.

The extraction tool can also be called as a web service using POST to get back the results in JSON.