Showing posts with label reconciliation. Show all posts
Showing posts with label reconciliation. Show all posts

Monday, April 22, 2013

BioNames update - reconciliation strategies

Over on Google Plus (yeah, me neither) Donat Agosti is giving me a hard time regarding the quality of some data that I am using. I've responded to Donat directly, but here I just want to quickly outline two different approaches to cleaning and reconciling bibliographic metadata.

The problem addressed by Donat is the issue of multiple strings for the same journal (e.g., the plethora of different abbreviations and permutations people use to refer to the same journal). In trying to make sense of this mess there are a couple of strategies we can use. One is to cluster the strings into sets that we think refer to the same thing, e.g.:

R1
We could then synthesise the preferred journal name from this set. We could make some sort of consensus string, for example. There are also some quite nice Bayesian methods for combining contradictory metadata.

Another approach, which I use, is to map the strings to a third party identifier, in this case an ISSN:

R2
Once I've done this I can use the identifier to refer to the journal, hence ultimately I don't particularly care what string is best for the journal (indeed, I can defer to a third party for this decision).

The point is obsessing with clean, "correct" bibliographic metadata is something of a fool's errand. Obviously, it's nice to have clean metadata if you can get it, but in many cases there is no exact answer to what is the correct metadata. Some journals have multiple names (e.g., in different languages), some run different volume numbering schemes in parallel, and date of publication can be rather problematic (see my Mendeley group on publication dates). If we can map a publication to a globally unique identifier, such as a DOI, then we can sidestep this issue and focus on what I think really matters - linking data together.

Wednesday, April 17, 2013

Reconciling author names using Open Refine and VIAF

RefineIn an earlier post I discussed using Open Refine (formerly Google Refine) to clean and reconcile taxon names. I've added an additional service that can be used to reconcile author names that uses the Virtual International Authority File (VIAF) API. Using this service we can match authors to VIAF identifiers (you may have noticed these appearing on people's pages in Wikipedia, e.g. Mary J. Rathbun's Wikipedia page lists her VIAF as 61796012).

To use the service follow the instructions in the earlier post but add the service:

http://iphylo.org/~rpage/phyloinformatics/services/reconciliation_viaf.php

This service is fairly crude, in particular, I make no attempt to score the matches that VIAR returns because this would require parsing and normalising author names. This could be added if needed. If you want some exmaple names to try, here are some taxonomists:


George A Boulenger
G A Boulenger
Wilhelm Michaelsen
W Michaelsen
Colin Campbell Sanborn
Suzanne Hand
Philip Hershkovitz
Yehudah Leopold Werner
W B Spencer
Norman Platnick