Showing posts with label DOI. Show all posts
Showing posts with label DOI. Show all posts

Wednesday, August 03, 2022

Papers citing data that cite papers: CrossRef, DataCite, and the Catalogue of Life

Quick notes to self following on from a conversation about linking taxonomic names to the literature. There are different sorts of citation:
  1. Paper cites another paper
  2. Paper cites a dataset
  3. Dataset cites a paper
Citation type (1) is largely a solved problem (although there are issues of the ownership and use of this data, see e.g. Zootaxa has no impact factor. Citation type (2) is becoming more widespread (but not perfect as GBIF's #citethedoi campaign demonstrates. But the idea is well accepted and there are guides to how to do it, e.g.:
Cousijn, H., Kenall, A., Ganley, E. et al. A data citation roadmap for scientific publishers. Sci Data 5, 180259 (2018). https://doi.org/10.1038/sdata.2018.259
However, things do get problematic because most (but not all) DOIs for publications are managed by CrossRef, which has an extensive citation database linking papers to other paopers. Most datasets have DataCite DOIs, and DataCite manages its own citations links, but as far as I'm aware these two systems don't really taklk to each other. Citation type (3) is the case where a database is largely based on the literature, which applies to taxonomy. Taxonomic databases are essentially collections of literature that have opinions on taxa, and the database may simply compile those (e.g., a nomenclator), or come to some view on the applicability of each name. In an ideal would, each reference included in a taxonomic database would gain a citation, which would help better reflect the value of that work (a long standing bone of contention for taxonomists). It would be interesting to explore these issues further. CrossRef and DataCite do share Event Data (see also DataCite Event Data). Can this track citations of papers by a dataset? My take on Wayne's question:
Is there a way to turn those links into countable citations (even if just one per database) for Google Scholar?
is that he's is after type 3 citations, which I don't think we have a way to handle just yet (but I'd need to look at Event Data a bit more). Google Scholar is a black box, and the academic coimmunity's reliance on it for metrics is troubling. But it would be interetsing to try and figure out if there is a way to get Google Scholar to index the citations of taxonomic papers by databases. For instance, the Catalogue of Life has an ISSN 2405-884X so it can be treated as a publication. At the moment its web pages have lots of identifiers for people managing data and their organisations (lots of ORCIDs and RORs, and DOIs for individual datasets (e.g., checklistbank.org) but precious little in the way of DOIs for publications (or, indeed, ORCIDs for taxonomists). What would it take for taxonomic publications in the Catalogue of Life to be treated as first class citations?

Tuesday, February 08, 2022

Duplicate DOIs (again)

This blog post provides some background to a recent tweet where I expressed my frustration about the duplication of DOIs for the same article. I'm going to document the details here.

The DOI that alerted me to this problem is https://doi.org/10.2307/2436688 which is for the article

Snyder, W. C., & Hansen, H. N. (1940). THE SPECIES CONCEPT IN FUSARIUM. American Journal of Botany, 27(2), 64–67.

This article is hosted by JSTOR at https://www.jstor.org/stable/2436688 which displays the DOI https://doi.org/10.2307/2436688 .

This same article is also hosted by Wiley at https://bsapubs.onlinelibrary.wiley.com/doi/abs/10.1002/j.1537-2197.1940.tb14217.x with the DOI https://doi.org/10.1002/j.1537-2197.1940.tb14217.x.

Expected behaviour

What should happen is if Wiley is going to be the publisher of this content (taking over from JSTOR), the DOI 10.2307/2436688 should be redirected to the Wiley page, and the Wiley page displays this DOI (i.e., 10.2307/2436688). If I want to get metadata for this DOI, I should be able to use CrossRef's API to retrieve that metadata, e.g. https://api.crossref.org/v1/works/10.2307/2436688 should return metadata for the article.

What actually happens

Wiley display the same article on their web site with the DOI 10.1002/j.1537-2197.1940.tb14217.x. They have minted a new DOI for the same article! The original JSTOR DOI now resolves to the Wiley page (you can see this using the Handle Resolver), which is what is supposed to happen. However, Wiley should have reused the original DOI rather than mint their own.

Furthermore, while the original DOI still resolves in a web browser, I can't retrieve metadata about that DOI from CrossRef, so any attempt to build upon that DOI fails. However, I can retrieve metadata for the Wiley DOI, i.e. https://api.crossref.org/v1/works/10.1002/j.1537-2197.1940.tb14217.x works, but https://api.crossref.org/v1/works/10.2307/2436688 doesn't.

Why does this matter?

For anyone using DOIs as stable links to the literature the persistence of DOIs is something you should be able to rely upon, both for people clicking on links in web browsers and developers getting metadata from those DOIs. The whole rationale of the DOI system is a single, globally unique identifier for each article, and that these DOIs persist even when the publisher of the content changes. If this property doesn't hold, then why would a developer such as myself invest effort in linking using DOIs?

Just for the record, I think CrossRef is great and is a hugely important part of the scholarly landscape. There are lots of things that I do that would be nearly impossible without CrossRef and its tools. But cases like this where we get massive duplication of DOIs when a publishers takes over an existing journal fundamentally breaks the underlying model of stable, persistent identifiers.

Friday, March 18, 2016

The Plant List, GBIF, and the primary literature

TL;DR; The Plant List is now in GBIF http://doi.org/10.15468/btkum2.

Readers of this blog may recall that I've had a somewhat jaundiced view of The Plant List. The first version was release with a Creative Commons Attribution-NonCommercial-NoDerivs (CC BY-NC-ND) license which allowed copying so long as didn't create a derived work (The Plant List: nice data, shame it's not open). This is frankly about the silliest possible license for a data set as, from my perspective, the whole reason for releasing data is so that it can be combined and enhanced with other data.

The second release (version 1.1) dropped an explicit CC license in favour of almost the reverse position (!). You can't copy the list "as is" without permission, but you can make derivative works "without prior written permission from us" (see Terms of Use for The Plant List). Progress, of a sort.

So, for the last week I've been working on getting a version of The Plant List into GBIF, and I've finally managed to achieve this. There's isn't a single place you can grab the whole plant list, so you have to scrape the web site for CSV files, then glue them together. I would could argue that converting the data into the Darwin Core Archive is a derived work, but in case this seems not derivative enough (of course, nobody seems ready to define just what "derived" actually means) I started to augment the list of names by adding bibliographic identifiers. I've long argued (see e.g. Surfacing the deep data of taxonomy) that a fundamental limitation of existing taxonomic database is that they don't explicitly link to the primary literature. This is why I built BioNames, and why I've been working to link the "micro citations" in IPNI to identifiers such as DOIs, JSTOR likes, BioStor URLs and BHL page links (see project on github). So, I've added about 120,000 DOIs and JSTOR links to names in the plant list. This is a subset of the links I've found for IPNI, but for this first release I've tried to keep things simple. I've also made the link between Plant List name and DOI/JSTOR via the IPNI identifier for a name, and the Plant List has ommitted quite a few IPNI ids for reasons which aren't clear.

The Plant List version I've created is available in GBIF (http://doi.org/10.15468/btkum2 and http://www.gbif.org/dataset/d9a4eedb-e985-4456-ad46-3df8472e00e8). Having another list of plant names will be a useful addition to the checklists that GBIF already has, even if the Plant List is already somewhat out of date.

DOIs

One feature of enhanced Plant List in GBIF is that for a subset of names (currently about 10%) there are direct links to the original publication of that name. For example, the record for Haniffia albiflora in the Plant List has a fairly cryptic bibliographic citation Nordic J. Bot. 20: 287 2000 and no link to that publication. In the version I've uploaded to GBIF the name Haniffia albiflora looks like this: Haniffia Note the full citation. But more importantly, the Publisher record link is the DOI http://doi.org/10.1111/j.1756-1051.2000.tb00745.x so clicking on it takes you to the original description of this species: Doi4 There is a lot of plant taxonomic literature available in JSTOR, sadly most of it (along with specimen images) behind a paywall (see Why are botanists locking away their data in JSTOR Plant Science?). Some of the links from GBIF take you to JSTOR: Doi3 The DOI landscape is evolving, and there are now multiple DOI registration agencies minting DOIs for scientific papers. CrossRef provides easily the best services for discovery and metadata harvesting, other agencies often have no equivalent, which makes it hard to discover DOIs for those papers hard. I've spent some time getting this information for Chinese and Taiwanese articles, e.g. http://dx.doi.org/10.6165/tai.1985.30.5: Doi1 and http://dx.doi.org/10.3969/j.issn.2095-0845.2005.04.002: Doi2 to give two example of articles that are now linked to from the corresponding species page in GBIF.

It's all about the links To reiterate, I believe that one of the key challenges facing biodiversity informatics is cross linking between disparate types of data and source of information. At the moment most of our data resides in disconnected silos. The links I'm adding to plant names are a small step, but they can lead to all sorts of possibilities. For example, users of GBIF can click on a link and see the original paper. If, for example, GBIF doesn't have a map for the species discussed in that paper, it's likely that the paper may have some information (e.g., the type locality). If users click on the links, then that is going to drive more traffic to the original literature, thus increasing its visibility. Furthermore, now that we have a taxon identifier (from GBIF) linked to a bibliographic identifier, we can go in the opposite direction. Earlier I proposed a Javascript bookmarklet as a way to augment the information on a web page (see Rethinking annotating biodiversity data). We could have a popup on a article web page that can tell the user about the taxa mentioned in that paper. If GBIF has a ma for those taxa, we can immediately place that paper in a geospatial context (e.g., Africa). This is barely scratching the surface of what is possible once we start breaking out of silos and share deeply linked data.

Thursday, September 17, 2015

On having multiple DOI registration agencies for the same journal

On Friday I discovered that BHL has started issuing CrossRef DOIs for articles, starting with the journal Revue Suisse de Zoologie. The metadata for these articles comes from BioStor. After a WTF and WWIC moment, I tweeted about this, and something of a Twitter storm (and email storm) ensued:

To be clear, I'm very happy that BHL is finally assigning article-level DOIs, and that it is doing this via CrossRef. Readers of this blog may recall an earlier discussion about the relative merits of different types of DOIs, especially in the context of identifiers for articles. The bulk of the academic literature has DOIs issued by CrossRef, and these come with lots of nice services that make them a joy to use if you are a data aggregator, like me. There are other DOI registration agencies minting DOIs for articles, such as Airiti Library in Taiwan (e.g., doi:10.6165/tai.1998.43(2).150) and ISTIC (中文DOI) in China (e.g., doi:10.3969/j.issn.1000-7083.2014.05.020) (pro tip, if you want to find out the registration agency for a DOI, simply append it to http://doi.crossref.org/doiRA/, e.g. http://doi.crossref.org/doiRA/10.6165/tai.1998.43(2).150). These provide stable identifiers, but not the services needed to match existing bibliographic data to the corresponding DOI (as I discovered to my cost while working with IPNI).

However, now things get a little messy. From 2015 PDFs for Revue Suisse de Zoologie are being uploaded to Zenodo, and are getting DataCite DOIs there (e.g., doi:10.5281/zenodo.30012). This means that the most recent articles for this journal will not have CrossRef DOIs. From my perspective, this is a disappointing move. It removes the journal from the CrossRef ecosystem at a time when the uptake of CrossRef DOIs for taxonomic journals is at an all time high (both ZooKeys and Zootaxa have CrossRef DOIs), and now BHL is starting to issue CrossRef DOIs for the "legacy" literature (bear in mind that "legacy" in this context can mean articles published last year).

I've rehearsed the reasons why I think CrossRef DOIs are best elsewhere, but the keys points are that articles are much easier to discover (e.g., using http://search.crossref.org), and are automatically first class citizens of the academic literature. However, not everybody buys these arguments.

Maybe a way forward is to treat the two types of DOI as identifying two different things. The CrossRef DOI identifies the article, not a particular representation. The Zenodo DOI (or any DataCite DOI) for a PDF identifies that representation (i.e., the PDF), not the article.

Having CrossRef and Zenodo  DataCite DOIs coexist

This would enable CrossRef and Zenod DOIs to coexist, providing we can (a) have some way of describing the relationship between the two kinds of DOI (e.g., CrossRef DOI - hasRepresentation -> Zenodo DOI).

This would give freedom to those who want the biodiversity literature to be part of the wider CrossRef community to mint CrossRef DOIs to do so. It gives those articles the benefits that come with CrossRef DOIs (findability, being included in lists of literature cited, citation statistics, customer support when DOIs break, altmetrics, etc.)

It would also enable those who want to ensure stable access to the contents of the biodiversity literature to use archives such as Zenodo, and have the benefits of those DOIs (stability, altmetrics, free file storage and free DOIs).

Having multiple DOIs for the same thing is, I'd argue, at the very least, unhelpful. But if we tease apart the notion of what we are identifying, maybe they can coexist. Otherwise I think we are in danger of making choices that, while they seem locally optimal (e.g., free storage and minting of DOIs), may in the long run cause problems and run counter to the goal of making the taxonomic literature has findable as the wider literature.

Wednesday, June 24, 2015

Thoughts on ReCon 15: DOIs, GitHub, ORCID, altmetric, and transitive credit

Man03gTw 400x400I spent last Friday and Saturday at (Research in the 21st Century: Data, Analytics and Impact, hashtag #ReCon_15) in Edinburgh. Friday 19th was conference day, followed by a hackday at CodeBase. There's a Storify archive of the tweets so you can get a sense of the meeting.

Sitting in the audience a few things struck me.

  1. No identifier wars, DOIs have won and are everywhere.
  2. GitHub is influencing the way we do science, but we've much still to learn.
  3. ORCIDs are gaining traction.
  4. Nobody really understands "impact".

GitHub

GitHub is becoming more and more important, not only as a repository of scientific code and data, but as a useful model of sorts of things we need to be doing. Arron Smith gave a fascinating talk on GitHub. Apart from the obvious things such as version control, Arfon discussed the tools and mindset of open source programmers, and who that could be applied to scientific data. For example, software on GitHub is often automatically tested for bugs (and GitHub displays a badge saying whether things are OK). Imagine doing this for a data set, having it automatically checked for errors and/or internal consistency. Reproducibility is a big topic in science, but open source software has to be reproducible by default in the sense that it has to be able to be downloaded and compiled on a user's computer. This is just a couple of the things Arfon covered, see his slides for more.

Transitive Credit

One idea which particularly struck me was that of "transitive credit":

Katz, D. S. (2014, February 10). Transitive Credit as a Means to Address Social and Technological Concerns Stemming from Citation and Attribution of Digital Products. JORS. Ubiquity Press, Ltd. http://doi.org/10.5334/jors.be

From the above paper:

The idea of transitive credit is as follows: The credit map for product A, which is used by product B, feeds into the credit map for product B. For example, product A is a software package equally written by two authors and its credit map is that 50 percent of the credit for this should go the lead developer, 20 percent to the second developer, and 10 percent to the third developer. In addition, 5 percent should go to each of the four libraries that are needed to run the code. When this product is created and registered, this credit map is registered along with it. Product B is a paper that obtains new science results, and it depended on Product A. The person who registers the publication also registers its credit map, in this case 75 percent to her/himself, and 25 percent to the software code previous mentioned. Credit is now transitive, in that the lead software developer of the code can be given credit for 12.5 percent of the paper. If another paper is later written that extends the product B paper and gives 10% credit to that paper, the lead software package developer will also have 1.25% credit for the new paper.
The idea of being able to track credit across derived products is interesting, and is especially relevant to projects such as GBIF, where users can download large datasets that are themselves aggregations of data from numerous different providers (making it was to calculate the relative contributions of each provider). If we then track citations of that data (and citations of those citations) we could give data providers a better estimate of the actual impact of their data.

Impact

Euan Adie of altimetric talked about "impact", and remarked on an example of a paper being cited in a policy document and this being picked up by altimetric and seen by the authors of the paper, who had no idea that their work had influenced a policy document. This raises some intriguing possibilities, related to the idea of "transitive credit" above.

In building BioNames I've added the ability to show altimetric "donuts" and I'm struck by examples like this one (see also reference in BioNames):

JENKINS, P. D., & ROBINSON, M. F. (2002, June). Another variation on the gymnure theme: description of a new species of Hylomys (Lipotyphla, Erinaceidae, Galericinae). Bulletin of The Natural History Museum. Zoology Series. Cambridge University Press (CUP) doi:10.1017/S0968047002000018

This paper has no recent "buzz" (e.g., Twitter, Facebook, Mendeley) but is cited on three Wikipedia pages. So, this paper has impact, albeit in social media. Many papers like this will slip below the social media radar but will be used by various databases and may contribute to subsequent work. Perhaps we could expand alt metrics sources of information to include some of those databases. For example, if a paper has been aggregated/cited by a major databases (such as GBIF) then it would be nice to see that on the altimetric donut. For authors this gives them another example of the impact of their work, but for the databases it's also an opportunity to increase engagement (if people have relevant work that doesn't appear in the donut they can take steps to have that work included in the aggregation). Obviously there are issues about what databases to count as providing signal for alt metrics, but there's scope here to broaden and quantify our notion of impact.

Hackday

The ReCon hackney was an pretty informal event held at CodeBase just down from Edinburgh Castle, and apparently the largest start-up incubator in the European tech scene. It was a pretty amazing place, and a great venue for a hackney. I spent the day looking at the ORCID API and seeing if I could create some mashups with Journal Map and my own BioNames. One goal was to see if we could generate a map of researcher's study sites starting with their ORCID, using ORCID's API to retrieve a list of their publications, then talking to the Journal Map API to get point localities for those papers. The code worked, but the results were a little disappointing because Jim Caryl and I were focussing on University of Glasgow researchers, and they had few papesri n Journal Map. The code, such as it is, is in GitHub.

My original idea was to focus on BioNames, and see how many authors of taxonomic papers had ORCIDs. Initial experiments seemed promising (see GitHub for code and data). Time was limited, so I got as far has building lists of DOIs from BioNames and discovering the associated ORCIDs. The next steps would be (a) providing ORCID login to BioNames, and using ORCID to help cluster author name strings in BioNames. Still much to do.

I've not been to many hackdays/hackathons, but I find them much more rewarding than simply sitting in a lecture theatre and listening to people talk. Combining both types of meeting is great, and I look forward to similar event sin the future.

Wednesday, January 28, 2015

Annotating GBIF, from datasets to nanopublications

Below I sketch what I believe is a straightforward way GBIF could tackle the issue of annotating and cleaning its data. It continues a series of posts Annotating GBIF: some thoughts, Rethinking annotating biodiversity data, and More on annotating biodiversity data: beyond sticky notes and wikis on this topic.

Let's simplify things a little and state that GBIF at present is essentially an aggregation of Darwin Core Archive files. These are for the most part simply CSV tables (spreadsheets) with some associated administrivia (AKA metadata). GBIF consumes Darwin Core Archives, does some post-processing to clean things up a little, then indexes the contents on key fields such as catalogue number, taxon name, and geographic coordinates.

What I'm proposing is that we make use of this infrastructure, in that any annotation is itself a Darwin Core Archive file that GBIF ingests. I envisage three typical use cases:

  1. A user downloads some GBIF data, cleans it for their purposes (e.g., by updating taxonomic names, adding some georeferencing, etc.) then uploads the edited data to GBIF as a Darwin Core Archive. This edited file gets a DOI (unless the user has go one already, say by storing the data in a digital archive like Zenodo).
  2. A user takes some GBIF data and enhances it by adding links to, for example, sequences in GenBank for which the GBIF occurrences are voucher specimens, or references which cite those occurrences. The enhanced data set is uploaded to GBIF as a Darwin Core Archive and, as above, gets a DOI.
  3. A user edits an individual GBIf record, say using an interface like this. The result is stored as a Darwin Core Archive with a single row (corresponding to the edit occurrence), and gets a DOI (this is a nanopublication, of which more later)

Note that I'm ignoring the other type of annotation, which is to simply say "there is a problem with this record". This annotation doesn't add data, but instead flags an issue. GBIF has a mechanism for doing this already, albeit one that is deeply unsatisfactory and isn't integrated with the portal (you can't tell whether anyone has raised an issue for a record).

Note also that at this stage we've done nothing that GBIF doesn't already do, or isn't about to do (e.g., minting DOIs for datasets). Now, there is one inevitable consequence of this approach, namely that we will have more than one record for the same occurrence, the original one in GBIF, and the edited record. But, we are in this situation already. GBIF has duplicate records, lots of them.

Duplication

As an example, consider the following two occurrences for Psilogramma menephron:

occurrencetaxonlongitudelatitudecatalogue numbersequence
887386322Psilogramma menephron Cramer, 1780145.86301-17.44BC ZSM Lep 01337
1009633027Psilogramma menephron Cramer, 1780145.86-17.44KJ168695KJ168695

These two occurrences come from the Zoologische Staatssammlung Muenchen - International Barcode of Life (iBOL) - Barcode of Life Project Specimen Data and Geographically tagged INSDC sequences data sets, respectively. They are for the same occurrence (you can verify this by looking at the metadata data for the sequence KJ168695 where the specimen_voucher field is "BC ZSM Lep 01337").

What do we do about this? One approach would be to group all such occurrences into clusters that represent the same thing. We are then in a position to do some interesting things, such as compare different estimates of the same values. In the example above, there is clearly a difference in precision of geographic locality between the two datasets. There are some nice techniques available for synthesising multiple estimates of the same value (e.g., Bayesian belief networks), so we could provide for each cluster a summary of the possible values for each field. We can also use these methods to build up a picture of the reliability of different sources of annotation.

In a sense, we can regard one record (1009633027) as adding an annotation to the other (887386322), namely adding the DNA sequence KJ168695 (in Darwin Core parlance, "associatedSequences=[KJ168695]").

But the key point here is that GBIF will have to at some point address the issue of massive duplication of data, and in doing so it will create an opportunity to solve the annotation problem as well.

Github and DOIs

In terms of practicalities, it's worth noting that we could use github to manage editing GBIF data, as I've explored in GBIF and Github: fixing broken Darwin Core Archives. Although github might not be ideal (there some very cool alternatives being developed, such as dat, see also interview with Max Ogden) it has the nice feature that you can publish a release and get a DOI via its integration with Zenodo. So people can work on datasets and create citable identifiers at the same time.

Nanopublications

If we consider that a Darwin Core Archive is basically a set of rows of data, then the minimal unit is a single row (corresponding to a single occurrence). This is the level at which some users will operate. They will see an error in GBIF and be able to edit the record (e.g., by adding georeferencing, an identification, etc.). One challenge is how to create incentives for doing this. One approach is to think in terms of nanopublications, which are:
A nanopublication is the smallest unit of publishable information: an assertion about anything that can be uniquely identified and attributed to its author.
A nanopublication comprises three elements:
  1. The assertion: In this context the Darwin Core record would be the assertion. It might be a minimal record in that, say, it only listed the fields relevant to the annotation.
  2. The provenance: the evidence for the assertion. This might be the DOI of a publication that supports the annotation.
  3. The publication information: metadata for the nanopublication, including a way to cite the nanopublication (such as a DOI), and information on the author of the nanopublication. For example, the ORCID of the person annotating the GBIF record.

As an example, consider GBIF occurrence 668534424 for specimen FMNH 235034, which according to GBIF is a specimen of Rhacophorus reinwardtii. In a recent paper

Matsui, M., Shimada, T., & Sudin, A. (2013, August). A New Gliding Frog of the Genus Rhacophorus from Borneo . Current Herpetology. Herpetological Society of Japan. doi:10.5358/hsj.32.112
Matsui et al. assert that FMNH 235034 is actually Rhacophorus borneensis based on a phylogenetic analysis of a sequence (GQ204713) derived from that specimen. In which case, we could have something like this:

The nanopublication standard is evolving, and has a lot of RDF baggage that we'd need to simplify to make fit the Darwin Core model of a flat row of data, but you could imagine having a nanopublication which is a Darwin Core Archive that includes the provenance and publication information, and gets a citable identifier so that the person who created the nanopublication (in the example above I am the author of the nanopublication) can get credit for the work involved in creating the annotation. Using citable DOIs and ORCIDs to identify the nanpublication and its author embeds the nanopublication in the wider citation graph.

Note that nanopublications are not really any different from larger datasets, indeed we can think of a dataset of, say, 1000 rows as simply an aggregation of nanopublications. However, one difference is that I think GBIF would have to setup the infrastructure to manage the creation of nanopublications (which is basically collect user's input, add user id, save and mint DOI). Whereas users working with large datasets may well be happy to work with those on, say github or some other data editing environment, people willing to edit single records are unlikely to want to mess with that complexity.

What about the original providers?

Under this model, the original data provider's contribution to GBIF isn't touched. If a user adds an annotation that amounts to adding a copy of the record, with some differences (corresponding to the user's edits). Now, the data provider may chose to accept those edits, in which case they can edit their own database using whatever system they have in place, and then the next time GBIF re-harvests the data, the original record in GBIF gets updated with the new data (this assumes that data providers have stable ids for their records). Under this approach we free ourselves from thinking about complicated messaging protocols between providers and aggregators, and we also free ourselves from having to wait until an edit is "approved" by a provider. Any annotation is available instantly.

Summary

My goal here is to sketch out what I think is a straightforward way to tackle annotation that makes use of what GBIF is already doing (aggregating Darwin Core Archives) or will have to do real soon now (cluster duplicates). The annotated and cleaned data can, of course, live anywhere (and I'm suggesting that it could live on github and be archived on Zenodo), so people who clean and edit data are not simply doing it for the good of GBIF, they are creating data sets that can be used independently and be cited independently. Likewise, even if somebody goes to the trouble of fixing a single record in GBIF, they get a citable unit of work that will be linked to their academic profile (via ORCD).

Another aspect of this approach is that we don't actually need to wait for GBIF to do this. If we adopt Darwin Core Archive as the format for annotations, we can create annotations, mint DOIs, and build our own database of annotated data, with a view to being able to move that work to GBIF if and when GBIF is ready.

Tuesday, October 21, 2014

On identifiers (again)

I'm going to the TDWG Identifier Workshop this weekend, so I thought I'd jot down a few notes. The biodiversity informatics community has been at this for a while, and we still haven't got identifiers sorted out.

From my perspective as both a data aggregator (e.g., BioNames) and a data provider (e.g., BioStor) there are four things I think we need to tackle in order to make significant progress.

Discoverability (strings to things)


A basic challenge is to go from strings, such as bibliographic citations, specimen codes, taxonomic names, etc., to digital identifiers for those things. Most of our data is not born digital, and so we spend a lot of time mapping strings to identifiers. For example, publishers do this a lot when they take the list of literature cited at the end of a manuscript and add DOIs. Hence, one of the first things CrossRef did was provide a discovery service for publishers. This has now morphed into a very slick search tool http://search.crossref.org. Without discoverabilty, nobody is going to find the identifiers in the first place.

Resolvability


Given an identifier it has to be resolvable (for both people and machines), and I'd argue that at least in the early days of getting that identifier accepted, there needs to be a single point of resolution. Some people are arguing that we should separate identifiers from their resolution, partly based on arguments that "hey, we can always Google the identifier". This argument strikes me as wrong-headed for a several of reasons.

Firstly, Google is not a resolution service. There's no API, so it's not scalable. Secondly, if you Google an identifier (e.g., 10.7717/peerj.190) you get a bunch of hits, which one is the definitive source of information on the thing with that identifier? It's not at all obvious, and indeed this is one of the reasons publishers adopted DOIs in the first place. If you Google a paper you can get all sorts of hits and all sorts of versions (preprint, manuscripts, PDFs on multiple servers, etc.). In contrast the DOI gives you a way to access the definitive version.

Another way of thinking about this is in terms of trust. At some point down the road we might have tools that can assess the trust worthiness of a source, and we will need these if we develop decent tools to annotate data (see More on annotating biodiversity data: beyond sticky notes and wikis). But until then the simplest way to engender trust is to have a single point of resolution (like http://dx.doi.org for DOIs). Think about how people now trust DOIs. They've become a mark of respectability for journals (no DOIs, you're not a serious journal), and new ideas such as citing diagrams and data gained further credence once sites like figshare started using DOIs.

Another reason resolvability matters is that I think it's a litmus test of how serious we are. One reason LSIDs failed is that we made them too hard to resolve, and as a consequence people simply minted "fake" LSIDs, dumb strings that didn't resolve. Nobody complained (because, let's face it, nobody was using them), so LSIDs became devalued to the point of uselessness. Anybody can mint a string and call it an identifier, if it costs nothing that's a good estimate of its actual value.

Persistence


Resolvability leads to persistence. Sometimes we hear the cliche that "persistence is a social matter, not a technological one". This is a vacuous platitude. The kind of technology adopted can have a big impact on the sociology.

The easiest form of identifier is a simple HTTP URL. But let's think about what happens when we use them. If I spend a lot of time mapping my data to somebody else's URLs (e.g., links to papers or specimens) I am taking a big risk in assuming that the provider of those URLs will keep those "live". At the same time, in linking to those URLs, I constrain the provider - if they decide that their URL scheme isn't particularly good and want to change it (or their institution decides to move to new servers or a new domain), they will break resources like mine that link to them. So a decision they made about their URL structure - perhaps late one Friday afternoon in one of those meetings where everybody just wants to go to the pub - will come back to haunt them.

One way to tackle this is indirection, which is the idea behind DOIs and PURLs, for example. Instead of directly linking to a provider URL, we link to an intermediate identifier. This means that I have some confidence that all my hard work won't be undone (I have seen whole journals disappear because somebody redesigned an institutional web site), and the provider can mess with different technologies for serving their content, secure in the knowledge that external parties won't be affected (because they link to the intermediate identifier). Programmers will recognise this as encapsulation.

Some have argued that we can achieve persistence by simply insisting on it. For example, we fire off a memo to the IT folks saying "don't break these links!". Really? We have that degree of power over our institutional IT policies? This also misses the great opportunity that centralised indirection provides us with. In the case of DOIs for publications, CrossRef sits in the middle, managing the DOIs (in the sense that if a DOI breaks you have a single place to go and complain). Because they also aggregate all the bibliographic metadata, they are automatically able to support discoverability (they can easily map bibliographic metadata to DOIs). So by solving persistence we also solve discoverability.

Network effects


Lastly, if we are serious about this we need to think about how to engineer the widespread adoption of the identifier. In other words, I think we need network effects. When you join a social networking site, one of the first things they do is ask permission to see your "contacts" (who you already know). If any of those people are already on the network, you can instantly see that ("hey, Jane is here, and so is Bob"). Likewise, the network can target those you know who aren't on the network and prompt them to join.

If we are going to promote the use of identifiers, then it's no use thinking about simply adding identifiers to things, we need to think about ways to grow the network, ideally by adding networks at a time (like a person's list of contacts), not single records. CrossRef does this with articles: when publishers submit an article to CrossRef, they are encouraged to submit not just that article and it's DOI, but the list of all references in the list of literature cited, identified where possible by DOIs. This means CrossRef is building a citation graph, so it can quickly demonstrate value to its members (through cited-by linking).

So, we need to think of ways of demonstrating value, and growing the network of identifiers more rapidling than one identifier at a time. Otherwise, it is hard to see how it would gain critical mass. In the context of, say, specimens, I think an obvious way to do this is have services that tell a natural history collection how many times its specimens have been cited in the primary literature, or have been used as vouchers for DNA seqences. We can then generate metrics of use (as well as start to trace the provenance of our data).


Summary


I've no idea what will come out of the TDWG Workshop, but my own view is that unless we tackle these issues, and have a clear sense of how they interrelate, then we won't make much progress. These things are intertwined, and locally optimal solutions ("hey, it's easy, I'll just slap a URL on everything") aren't enough ("OK, how exactly do I find your URL? What happens when it breaks?"). If we want to link stuff together as part of the infrastructure of biodiversity informatics, then we need to think strategically. The goal is not to solve the identifier problem, the goal is to build the biodiversity knowledge graph.

Thursday, May 15, 2014

DOIs are not enough

I had a long Twitter conversation with Terry Catapano (@catapanoth) today, and as can happen with a distracted stream of tweets, I think we were a little at cross purposes. This blog post is an attempt to unpack the debate.

What prompted the conversation was the following paper:
Emery, Carlo et al (1899). Formiche di Madagascar raccolte dal Sig. A. Mocquerys nei pressi della Baia di Antongil (1897-1898).. Bullettino della Società Entomologica Italiana: 31 (1899) pp. 263-290. 10.5281/zenodo.9785
Not the paper so much, as the fact that it is stored on the Zenodo repository (which I was only looking at because of the announcement that GitHub now supports DOIs through Zenodo). Given that the PDF for Emery's paper was uploaded by the Plazi project, I wondered what was the intention of assigning a Zenodo DOI to this paper, rather than one from CrossRef.

Not all DOIs are equal


As Geoffrey Bilder notes in his post DOIs unambiguously and persistently identify published, trustworthy, citable online scholarly literature. Right?
...some have adopted a cargo-cult practice of seeing the mere presence of a DOI on a publication as a putative sign of “citability” or “authority.”
There is a danger that we fall into the trap of thinking that all we need to do is slap a DOI on a paper and all the good things that we associate with DOIs will magically happen. This isn't the case. Not all DOIs are the same. Zenodo DOIs are proved by DataCite, and DataCite DOis don't have all the features that CrossRef provides for their DOIs.

CrossRef provides some key services, one of the most important is discoverability. Given a bibliographic references, CrossRef has tools that can find whether it has a DOI (e.g., http://search.crossref.org). I use this a lot to map taxonomic papers to DOIs (by a lot I mean searching for DOIs for tens of thousands of articles). Most people don't do this, but you benefit from this service every time you read an article and see the literature cited section decorated with DOIs. Publishers use CrossRef's tools to convert citations from dumb strings to useful links. This feature we come to expect from any modern article relies on CrossRef have definitive metadata for lots (millions) or articles, all of which have DOIs. When publishers submit article metadata when registering their DOIs, they usually submit lists of literature cited (and the DOIs). This means that CrossRef is building a citation database, which you can see if you visit the web page for an article and see a "cited by" link.

Then there are additional services. Given that CrossRef has high quality bibliographic metadata for articles, if you have a DOI there is no need to type in the details of a paper. Most bibliographic software such as Mendeley and Zotero can take a DOI and flesh out those details for you. If a DOI fails to resolve, you can contact CrossRef Support and have somebody investigate. Then there are the new services such as FundRef and Prospect, which provide information on who funded a paper, and what text and data mining rights are available for a paper.

Why use DOIs?


The rationale for using DOIs for articles is so that they can be unambiguously identified, which in turn means we can build a robust citation network. But this requires infrastructure, and that is what CrossRef provides through tools like citation to DOI matching. Other DOI registration agencies don't do this, and CrossRef isn't aware of other DOIs, so putting, say, a DataCite DOI (such as those used by Zenodo) on an article doesn't achieve the primary goal of a DOI (embedding it in the citation graph of academic literature).

Hence, I regard putting a Zenodo DOI as basically a wasted opportunity. If we aren't making the primary biodiversity literature discoverable, and hence linkable, then all we are doing is keeping that literature in a ghetto (and reinforcing the impression that this literature, and taxonomy itself, really doesn't matter). It is striking that if you read a recent paper that describes a new species, the bulk of the systematic or ecological literature has DOIs, but the bulk of the taxonomic literature does not. If it doesn't have a CrossRef DOI, it's effectively invisible. All academic literature should get first class DOIs. Whether it's "legacy" or not is irrelevant, the Royal Society of London has DOIs on articles going back to 1800, these are now as accessible as any paper published today.

Eyes on the prize


So, if we are going to bring the taxonomic literature into the mainstream, make it discoverable and citable, then we should focus on bringing that literature into CrossRef's infrastructure. Archives like JSTOR do it, the Biodiversity Heritage Library (BHL) does it for some of its content (and they should be doing it at article level, right now).

One response to this is to say "but doesn't this cost money?" Of course it does. Everything does, nothing is free. What frustrates me most about this is that it's the wrong question. The first question should not be "how much does this cost?". If it is, you've already lost sight of the goal. Instead, we should be asking, "What do we want? Where do we need to be able to do to progress our field?". Once we articulate that, then we figure out how to pay for it. And we figure that out because we've decided this is what we need.

I think we want discoverable, citable taxonomic literature, embedded in the rest of the scientific literature and the publishing process. We don't get that by simply buying the cheapest DOIs available and slapping them on articles. To do so is to fundamentally misunderstand why DOIs matter, and to ignore the role that infrastructure plays in their success in academic publishing.

Wednesday, July 17, 2013

Augmenting ZooKeys bibliographic data to flesh out the citation graph

Zookeys logoIn a previous post (Learning from eLife: GitHub as an article repository) I discussed the advantages of an Open Access journal putting its article XML in a version-controlled repository like GitHub. In response to that post Pensoft (the publisher of ZooKeys) did exactly that, and the XML is available at https://github.com/pensoft/ZooKeys-xml.

OK, "now what?" I hear you ask. Originally I'd used the example of incorrect bibliographic data for citations as the motivation, but there are other things we can do as well. For example, when reading a ZooKeys article (say, using my eLife Lens-inspired viewer) I notice references that should have a DOI but which don't. With the XML available I could add this. This adds another link in the citation graph (in this case connecting the ZooKeys paper with the article it cites). If Pensoft were to use that XML to regenerate the HTML version of the article on their web site then the reader will be able to click on the DOI and read the cited article (instead of the "cut-and-paste-and-Google-it" dance). Furthermore, Pensoft could update the metadata they've submitted to CrossRef, so that CrossRef knows that the reference with the newly added DOI has been cited by the ZooKeys paper.

To experiment with this I've written some scripts that take ZooKeys XML, extract each citation from the list of literature cited, and look up DOIs for each reference that lacks them (using the CrossRef metadata search API). If a DOI is found then I insert it into the original XML. I then push this XML to my fork of Pensoft's repository (https://github.com/rdmpage/ZooKeys-xml). I can then ask Pensoft to update their repository (by issuing a "pull request"), and if Pensoft like what they see, they can accept my edits.

Automating the process makes this much more scalable, although manual editing will still be useful in some cases, especially where the original references haven't been correctly atomised into title, journal, etc.

So that the output is visible independently of Pensoft deciding whether to accept it, I've updated my Zookeys article viewer to fetch the XML not from the ZooKeys web site, but from my GitHub repository. This means you get the latest version of the XML, complete with additional DOIs (if any have been added).

Initial experiments are encouraging, but it's also apparent that lots of citations lack DOIs. However, this doesn't mean that they aren't online. Indeed, a growing number of articles are available through my BioStor repository, and through BioNames. Both of these sites have an API, so the next step is to add them to the script that augments the XML. This brings us a little closer to the ultimate goal of having every taxonomic paper online and linked to every paper that either cites, or is cited by, that paper.

Monday, May 27, 2013

Multiple DOIs for the same article issued by different publishers

DoiI've stumbled on a case where two different publishers have issued different DOIs for the same articles. In this case, Springer and J-State both publish the Japanese Journal of Ichthyology (ISSN 0021-5090). The following article:
Randall, J. E., & McCarthy, L. J. (1989). Solea stanalandi, a new sole from the Persian Gulf. Japanese Journal of Ichthyology, 36(2), 196–199. doi:10.1007/BF02914322

is published by Springer with the DOI http://dx.doi.org/10.1007/BF02914322, and this DOI is registered with CrossRef. J-Stage publish the same article, with the DOI (http://dx.doi.org/10.11369/jji1950.36.196). This DOI is not registered with CrossRef. I haven't been able to find an easy way to discover the DOI registration agency for a DOI (surely there should be a simple service that tells me this?).

This illustrates a problem with the success of DOIs and the existence of multiple registration agencies. When there was essentially a single agency for publications (CrossRef) it was relatively easy to ensure that DOIs for publications were unique. Now that there are multiple DOI registration agencies it is possible for conflicts to arise. We might expect this to be rare, after all, surely there's only one publisher for an article? However, the publishing landscape is more complicated that that, with articles being served by multiple publishers, and archiving projects like JSTOR and BHL having content that overlaps with that of existing publishers. Messy (sigh).

Thursday, May 23, 2013

DOIs for specimens are here, but we're not quite there yet


I've been banging on about having citable, persistent identifiers for specimens, so was suitably impressed when Derek Sikes posted a comment on iPhylo that Arctos already does this. For example, here is a DOI for a specimen: http://dx.doi.org/10.7299/X7VQ32SJ.

Uam

So, we're all done, right? Not quite. DOIs by themselves don't get us where we (OK, where I think we) want to be. The DOI identifies a specimen, which is great (see discussion on iDigBio: You are putting identifiers on the wrong thing for why this matters). We can also get machine-readable metadata using the DOI (by using the URL http://data.datacite.org/10.7299/X7VQ32SJ ). The metadata is limited (ideally we'd want something like Darwin Core), but it is a start. It's not clear how we get from the DOI to Darwin Core.

There are at least two issues that remain to be tackled. The first is that we now have a bunch of identifiers for the same thing, e.g.:

Most of these identifiers don't know about each other (for example, GBIF doesn't know about the DOI, nor does Arctos link to GBIF). So we have disconnected pieces of information about the same thing.

The second issue is how do we discover a specimen DOI? CrossRef supports services where you can take a bibliographic citation, e.g. Phylogeny and biogeography of ice crawlers (Insecta: Grylloblattodea) based on six molecular loci: designating conservation status for Grylloblattodea species and get back a DOI (in this case, http://dx.doi.org/10.1016/j.ympev.2006.04.013). This makes it possible for publishers to take lists of literature cited in authors' manuscripts and quickly add DOIs to those citations. We don't have an equivalent service for specimens, which is going to make our task of linking specimens to sequences and the literature something of a challenge.

We are making progress, but there is some way to go. Identifiers are only part of the solution, we also need services.

Tuesday, March 26, 2013

Towards DOIs for Biodiversity Heritage Library articles

The new look Biodiversity Heritage Library includes articles extracted from BioStor, which is a step forwards in making the "legacy" biodiversity literature more accessible. But we still have some way to go. In particular the articles lack the obvious decoration of a modern article, the DOI. Consequently these articles still live in a twilight zone where they are cited in the literature but not linked to. DOIs are becoming more common for taxonomic articles. Zookeys has them, and now Zootaxa has adopted them (and will be applying them retrospectively to thousands of already published articles). Major archives of back issues digitised by Taylor and Francis, and Wiley, for example, also have DOIs.

One obstacle to assigning CrossRef DOIs to articles in BHL is the convention that DOIs are typically managed by the publisher of the journal. But in a number of cases the publisher may no longer exist, the journal may no longer be published, or the publisher may lack the commercial resources to support DOIs. In these cases perhaps BHL could adopt the role of publisher?

Another approach is that adopted by a number of other digital archives, whereby the archive assigns DOIs to articles, but these DOIs are registered not through CrossRef but with another DOI registration agency, such as DataCite. For example the Swiss Electronic Academic Library Service (SEALS) archive assigns DOIs to individual articles, such as http://dx.doi.org/10.5169/seals-88913.

There are some limitations to not using CrossRef DOIs, in particular, you don't get the full benefits of their metadata-based services such as getting metadata from a DOI, discovering DOIs from metadata, or citation linking. But all is not lost. Some services support both CrossRef and DataCite DOIs, such as http://crosscite.org/citeproc. For example, for the DOI 10.5169/seals-88913 we get some basic formatting:

Perret, Jean-Luc. (1961). Etudes herpétologiques africaines III. Société Neuchâteloise des Sciences Naturelles. doi:10.5169/seals-88913

This still leaves us lacking some services, such as finding DOIs for articles cited in a manuscript. However this is a service we can provide, and will have to anyway if we want to find all the digitised literature available (e.g., archives such as SEALS as well as numerous instances of DSpace). My preference would be for CrossRef DOIs, but if that proves problematics we can still get much of the functionality we need using other DOI providers.

Wednesday, September 05, 2012

BHL is duplicating DOIs because it doesn't know about articles

Quick note that as much as I like that the Biodiversity Heritage Library is using DOIs, they are generating them for publications that already have them (or are acquiring them from other sources). For example, here are the two DOIs for the same article (formatted using the DOI Citation Formatter), one from BHL and one from the Smithsonian:

Springer, V. G. (1982). Pacific Plate biogeography, with special reference to shorefishes / Victor G. Springer. Smithsonian Institution. doi:10.5962/bhl.title.37141
Springer, V. G. (1982). Pacific Plate biogeography, with special reference to shorefishes. Smithsonian Contributions to Zoology, (367), 1–182. doi:10.5479/si.00810282.367


The BHL DOI resolves to a page in BHL, the other DOI resolves to the a page in the Smithsonian Digital Repository (this article also has the handle hdl:10088/5222).

Now this is a problem, because DOIs are meant to be unique: one article, one DOI. I've encountered duplicates elsewhere, but in these cases one should be an alias of the other. In the example above, the DOIs resolve to different locations. If you are just after the content this isn't a huge problem, but if, say, you were using the DOI to uniquely identify the publication (say, in a database) you have a problem: which DOI to choose? If you and I choose differently then we will make statements about the same article but be unaware of that sameness.

Much of this problem arises because BHL has no concept of articles. Most articles are likely to reside within scanned volumes of a journal, but some articles (e.g., monographs) may be treated a single title by BHL, and each BHL title now gets a DOI.

I know that handling articles is on BHL's radar, but it because it hasn't tackled it yet we are going to have cases where BHL DOIs duplicate existing DOIs. In these cases, BHL may have to make their DOI an alias of the other DOI.

Friday, July 20, 2012

Figshare and F1000 integrate data into publication: could TreeBASE do the same?

Spiralsticker reasonably smallQuick thoughts on the recent announcement by figshare and F1000 about the new journals being launched on the F1000 Research site. The articles being published have data sets embedded as figshare widgets in the body of the text, instead of being, say, a static table. For example, the article:

Oliver, G. (2012). Considerations for clinical read alignment and mutational profiling using next-generation sequencing. F1000 Research. doi:10.3410/f1000research.1-2.v1
has a widget that looks like this:

Widget
You can interact with this widget to view the data. Because the data are in figshare those data are independently citable, e.g. the dataset "Simulated Illumina BRCA1 reads in FASTQ format" has a DOI http://dx.doi.org/10.6084/m9.figshare.92338.

Now, wouldn't it be cool if TreeBASE did something similar? Imagine if uploading trees to TreeBASE were easy, and that you didn't have to have published yet, you just wanted to store the trees and make them citable. Imagine if TreeBASE had a nice tree viewer (no, not a Java applet, a nice viewer that uses SVG, for exmaple). Imagine if you could embed that tree viewer as a widget when you published your results. It's a win all round. People have an incentive to upload trees (nice viewer, place to store them, and others can cite the trees because they'd have DOIs). TreeBASE builds its database a lot more quickly (make it dead easy to upload tree), and then as more publishers adopt this style of publishing TreeBASE is well placed to provide nice visualisations of phylogenies pre-packaged, interactive, and citable. And let's not stop there, how about a nice alignment viewer? Perhaps this is the something currently rather moribund PLoS Currents Tree of Life could think about supporting?

Wednesday, July 11, 2012

Citations, Social Media & Science

Quick note that Morgan Jackson (@BioInFocus) has written nice blog post Citations, Social Media & Science inspired by the fact that the following paper:

Kwong, S., Srivathsan, A., & Meier, R. (2012). An update on DNA barcoding: low species coverage and numerous unidentified sequences. Cladistics, no–no. doi:10.1111/j.1096-0031.2012.00408.x

cites my "Dark taxa" in the body of the text but not in the list of literature cited. This prompted some discussion of DOIs and blog posts on Twitter:



Read Morgan's post for more on this topic. While I personally would prefer to see my blog posts properly cited in papers like doi:10.1111/j.1096-0031.2012.00408.x, I suspect the authors did what they could given current conventions (blogs lack DOIs, are treated differently from papers, and many publishers cite URLs in the text, not the list of references cited). If we can provide DOIs (ideally from CrossRef so we become part of the regular citation network), suitable archiving, and — most importantly — content that people consider worthy of citation then perhaps this practice will change.

Wednesday, June 27, 2012

UUIDs

Just for future reference:

Friday, April 20, 2012

Quick thoughts on specimen identifiers

Based on recent discussions my sense is that our community will continue to thrash the issue of identifiers to death, repeating many of the debates that have gone on (and will go on) in other areas. To be trite, it seems to me we have three criteria: cheap, resolvable, and persistent. We get to pick two.

Cheap and resolvable means URLs, which everybody is nervous about because they break. They don't have to break, but for a bunch of reasons they do.

Cheap and persistent means things like Darwin Triplet Core or URNs. You can write things on paper and they will persist (the Biodiversity Heritage Library shows us that), but how in the digital era do we do anything with this? If it's not resolvable what, exactly, is the point? We tried URNs — even ones that were resolvable (LSIDs) — and that was a disaster (we learnt a lot, but what a mess).

Resolvable and persistent. This is where technologies such as DOIs reside. If every specimen had a DOI would we still be having this discussion? We'd have a resolvable identifier that is resistant to change (including loss of museum domain names, specimens moving to new institutions, etc.), and one that is already in use by CrossRef and DataCite, and will also play ball with linked data folks.

In practical terms, what if we had a convention that each collection gets it's own DOI prefix "10.nnnn", after which it appends whatever specimen identifier makes sense (and is unique within that collection).

The bulk of specimen identifiers in the wild are of the form "Institution" "Catalogue number", e.g. ANSP 332467 (from the example I discussed in BHL and GBIF as biomedical databases).

If we wrote this as a DOI of the form <doi prefix>/Collection/InstitutionCatalogue number then we'd have identifiers that (in part) matched what most people would expect to see. In the example above we would have something like:

10.nnnnn/MAL/ANSP332467

where "MAL" is the acronym for the Malacology collection. This is pretty close to "ANSP 332467", is human friendly, but would also be resolvable. It also carries limited branding, so if the specimen was moved from it's current collection to a new institution, people wouldn't get too upset by the presence of "ANSP"). It would also help make the links between specimen codes and DOIs. We couldn't rely on 10.nnnnn/MAL/ANSP332467 being a specimen in the Academy of Natural Sciences's malacological collection, but it would be a good place to start looking.

As I've argued before, we could centralise the minting of these identifiers using GBIF, but do it in a such a way that host institutions could assume responsibility for it if and when they are able (i.e., initially GBIF is responsible for managing the DOI prefixes for each institution, with the option for institutions to do this). The beauty of identifiers like DOIs is that from the user's perspective the identifier is unchanged.

I'm hoping we'll make some progress on this in the coming months...

Sunday, December 11, 2011

DNA Barcoding, the Darwin Core Triplet, and failing to learn from past mistakes

Banner05
Given various discussions about identifiers, dark taxa, and DNA barcoding that have been swirling around the last few weeks, there's one notion that is starting to bug me more and more. It's the "Darwin Core triplet", which creates identifiers for voucher specimens in the form <institution-code>:<OPTIONAL collection-code>:<specimen-id>. For example,

MVZ:Herp:246033

is the identifier for specimen 246033 in the Herpetology collection of the Museum of Vertebrate Zoology (see http://arctos.database.museum/guid/MVZ:Herp:246033).

On the face of it this seems a perfectly reasonable idea, and goes some way towards addressing the problem of linking GenBank sequences to vouchers (see, for example, http://dx.doi.org/10.1016/j.ympev.2009.04.016, preprint at PubMed Central). But I'd argue that this is a hack, and one which potentially will create the same sort of mess that citation linking was in before the widespread use of DOIs. In other words, it's a fudge to postpone adopting what we really need, namely persistent resolvable identifiers for specimens.

In many ways the Darwin Core triplet is analogous to an article citation of the form <journal>, <volume>:<starting page>. In order to go from this "triplet" to the digital version of the article we've ended up with OpenURL resolvers, which are basically web services that take this triple and (hopefully) return a link. In practice building OpenURL resolvers gets tricky, not least because you have to deal with ambiguities in the <journal> field. Journal names are often abbreviated, and there are various ways those abbreviations can be constructed. This leads to lists of standard abbreviations of journals and/or tools to map these to standard identifiers for journals, such as ISSNs.

This should sound familiar to anybody dealing with specimens. Databases such as the Registry of Biological Repositories and the Biodiversity Collectuons Index have been created to provide standardised lists of collection abbreviations (such as MVZ = Museum of Vertebrate Zoology). Indeed, one could easily argue that the what we need is an OpenURL for specimens (and I've done exactly that).

As much as there are advantages to OpenURL (nicely articulated in Eric Hellman's post When shall we link?), ultimately this will end in tears. Linking mechanisms that depend on metadata (such as museum acronyms and specimen codes, or journal names) are prone to break as the metadata changes. In the case of journals, publishers can rename entire back catalogues and change the corresponding metadata (see Orwellian metadata: making journals disappear), journals can be renamed, merged, or moved to new publishers. In the same way, museums can be rebranded, specimens moved to new institutions, etc. By using a metadata-based identifier we are storing up a world of hurt for someone in the future. Why don't we look at the publishing industry and learn from them? By having unique, resolvable, widely adopted identifiers (in this case DOIs) scientific publishers have created an infrastructure we now take for granted. I can read a paper online, and follow the citations by clicking on the DOIs. It's seamless and by and large it works.

On could argue that a big advantage of the Darwin Core triplet is that it can identify a specimen even if it doesn't have a web presence (which is another way of saying that maybe it doesn't have a web presence now, but it might in the future). But for me this is the crux of the matter. Why don't these specimens have a web presence? Why is it the case that biodiversity informatics has failed to tackle this? It seems crazy that in the context of digital data (DNA sequences) and digital databases (GenBank) we are constructing unresolvable text strings as identifiers.

But, of course, much of the specimen data we care about is online, in the form of aggregated records hosted by GBIF. It would be technically trivial for GBIF to assign a decent identifier to these (for example, a DOI) and we could complete the link between sequence and specimen. There are ways this could be done such that these identifiers could be passed on to the home institutions if and when they have the infrastructure to do it (see GBIF and Handles: admitting that "distributed" begets "centralized").

But for now, we seem determined to postpone having resolvable identifiers for specimens. The Darwin Core triplet may seem a pragmatic solution to the lack of specimen identifiers, but it seems to me it's simply postponing the day we actually get serious about this problem.





Tuesday, November 29, 2011

Mapping names to literature: closing in on 250,000 names

Following on from my earlier post Linking taxonomic names to literature: beyond digitised 5×3 index cards I've been slowly updating my latest toy:

http://iphylo.org/~rpage/itaxonAlpheus

This site displays a database mapping over 200,000 animal names to the primary literature, using a mix of identifiers (DOIs, Handles, PubMed, URLs) as well as links to freely available PDFs where they are available. Lots still to do as about a third of the 1.5 million names in the database have citations that my code hasn't been able to parse. There are also lots of gaps that need to be filled in, for example missing DOIs or PubMed identifiers, and a lot of the earlier names are linked by "microcitations" to names, and I'll need to handle those (using code from my earlier project Nomenclator Zoologicus meets Biodiversity Heritage Library: linking names directly to literature).

The mapping itself is stored in a database that I'm constantly editing, so this is far from production quality, but I've found it eye-opening just how much literature is available. There is a lot of scope for generating customised lists of papers, for example, primary taxonomic sources for taxa currently on the IUCN Red List, or those taxa which have sequences in GenBank (building on the mapping of NCBI taxa onto Wikipedia). Given that a lot of the relevant literature is in BHL, or available as PDFs, we could do some data mining, such as extracting geographical coordinates, taxonomic names, and citations. And if linked data is your thing, the 110,000 DOIs and nearly 9,000 CiNiii URLs all serve RDF (albeit not without a few problems).

I've set a "goal" of having 250,000 names mapped to the primary literature, at which point the database interface will get some much-needed attention, but for now have a look for your favourite animal and see if it's original description has been digitised.

Thursday, November 24, 2011

BHL needs to engage with publishers (and EOL needs to link to primary literature)

Browsing EOL I stumbled upon the recently described fish Protoanguilla palau, shown below in an image by rairaiken2011:
Palauan Primitive Cave Eel

Two things struck me, the first is that the EOL page for this fish gives absolutely no clue as to where you would to find out more about this fish (apart from an unclickable link to the Wikipedia page http://en.wikipedia.org/wiki/Protoanguilla - seriously, a link that isn't clickable?), despite the fact this fish has been recently described in an Open Access publication ("A 'living fossil eel (Anguilliformes: Protanguillidae, fam. nov.) from an undersea cave in Palau", http://dx.doi.org/10.1098/rspb.2011.1289).

Now that I've got my customary grumble about EOL out of the way, let's look at the article itself. On the first page of the PDF it states:
This article cites 29 articles, 7 of which can be accessed free
http://rspb.royalsocietypublishing.org/content/early/2011/09/16/rspb.2011.1289.full.html#ref-list-1

So 22 of the articles or books cited in this paper are, apparently, not freely available. However, looking at the list of literature cited it becomes obvious that rather more of these citations are available online than we might think. For example, there are articles that are in the Biodiversity Heritage Library (BHL), e.g.


Then there are articles that are available in other digitising projects


Furthermore, there are articles that aren't necessarily free, but which have been digitised and have DOIs that have been missed by the publisher, such as the Regan paper above, and


So, the Proceedings of the Royal Society has underestimated just how many citations the reader can view online. The problem, of course, is how does a publisher discover these additional citations? Some have been missed because of sloppy bibliographic data. The missing DOIs are probably because the Regan citation lacks a volume number, and the Trewavas paper uses a different volume number to that used by Wiley (who digitised Proc. Zool. Soc. Lond.). But the content in BHL and other digital archives will be missed because finding these is not part of a publisher's normal workflow. Typically citations are matched by using services ultimately provided by CrossRef, and the bulk of BHL content is not in CrossRef.

So it seems there's an opportunity here for someone to provide a service for publishers that adds value to their content in at least three ways:
  1. Add missing DOIs due to problematic citations for older literature
  2. Add links to BHL content
  3. Add links to content in additional digitisation projects, such as journal archives in DSpace respositories

For readers this would enhance their experience (more of the literature becomes accessible to them), and for BHL and the repositories it will drive more readers to those repositories (how many people reading the paper on Protoanguilla palau have even heard of BHL?). I've said most of this before, but I really think there's an opportunity here to provide services to the publishing industry, and we don't seem to be grasping it yet.