Showing posts with label NHM. Show all posts
Showing posts with label NHM. Show all posts

Monday, March 23, 2020

Darwin Core Million promo: best and worst

Bob mesibovThe following is a guest post by Bob Mesibov.
There's still time (to 31 March) to enter a dataset in the 2020 Darwin Core Million, and by way of encouragement I'll celebrate here the best and worst Darwin Core datasets I've seen.
The two best are real stand-outs because both are collections of IPT resources rather than one-off wonders.


The first is published by the Peabody Museum of Natural History at Yale University. Their IPT website has 10 occurrence datasets totalling ca 1.6M records updated daily, and I've only found minor data issues in the Peabody offerings. A recent sample audit of the 151,138 records with 70 populated Darwin Core fields in the botany dataset (as of 2020-03-18) showed refreshingly clean data:
  • entries correctly assigned to DwC fields
  • no missing-but-expected entry gaps
  • consistent, widely accepted vocabularies and formatting in DwC fields
  • no duplicate records
  • no character encoding errors
  • no gremlin characters
  • no excess whitespace or fancy alternatives to simple ASCII characters
The dataset isn't perfect and occurrenceRemarks entries are truncated at 254 characters, but other errors are scarce and easily fixed, such as
  • 14 records with plant taxa mis-classified as animals
  • 4 records with dateIdentified earlier than eventDate
  • minor pseudo-duplication in several fields, e.g. "Anna Murray Vail; Elizabeth G. Britton" and "Anne Murray Vail; Elizabeth G. Britton" in recordedBy
  • minor content errors in some entries, e.g. "tissue frozen; tissue frozen" and "|" (with no other characters in the entry).
I doubt if it would take more than an hour to fix all the Peabody Museum issues besides the truncation one, which for an IPT dataset with 10.5M data items is outstanding. There are even fields in which the Museum has gone beyond what most data users would expect. Entries in vernacularName, for example, are semicolon-separated hierarchies of common names: "dwarf snapdragon; angiosperms; tracheophytes; plants" for Chaenorhinum minus.

The second IPT resource worth commending comes from GBIF Portugal and consists of 108 checklist, occurrence record and sampling event datasets. As with the Peabody resource, the datasets are consistently clean with only minor (and scattered) structural, format or content issues.

The problems appearing most often in these datasets are "double-encoding" errors with Portugese words and no-break spaces in place of plain spaces, and for both of these we can probably blame the use of Windows programs (like Excel) at the contributing institutions. An example of double-encoding: the Portugese "prôximo" is first encoded in UTF-8 as a 2-byte character, then read by a Windows program as two separate bytes, then converted back to UTF-8, resulting in the gibberish "prôximo". A large proportion of the no-break spaces in the Portugese datasets unfortunately occur in taxon name strings, which don't parse correctly and which GBIF won't taxon-match.

And the worst dataset? I've seen some pretty dreadful examples from around the world, but the UK's Natural History Museum sits at the top of my list of delinquent providers. The NHM offers several million records and a disappointingly high proportion of these have very serious data quality problems. These include invalid and inappropriate entries, disagreements between fields and missing-but-expected blanks.

Ironically, the NHM's data portal allows the visitor to select and examine/download records with any one of a number of GBIF issues, like "taxon_match_none". Further, for each record the data portal reports "GBIF quality indicators", as shown in this screenshot:



Clicking on that indicator box gives the portal visitor a list of the things that GBIF found wrong with the record (a list that overlaps incompletely with the list I can find with a data audit). I'm sure the NHM sees this facility differently, but to me it nicely demonstrates that NHM has prioritised Web development over data management. The message I get is
"We know there's a lot wrong with our data, but we're not going to fix anything. Instead, we're going to hand our mess as-is to any data users out there, with cleverly designed pointers to our many failures. Suck it up, people."
In isolation NHM might be seen as doing what it can with the resources it has. In a broader context the publication of multitudes of defective records by NHM is scandalous. Institutions with smaller budgets and fewer staff do a lot better with their data — see above.

Coronavirus

If your institution is closed and you have spare work-from-home time, consider doing some data cleaning. For those not afraid of the command line, I've archived the websites A Data Cleaner's Cookbook (version 2) and its companion blog BASHing data (first 100 posts) in Zenodo with local links between the two, so that the two resources can be downloaded and used offline in any Web browser.

Thursday, March 03, 2016

Cisco Pit Stop: Digitising the Natural History Museum’s collections

Last week (25-26 February) I was in London for CISCO Pit Stop event. Thursday evening was at the Natural History Museum where I gave a talk extolling the virtues of linking stuff together:

My slides are here:

Friday we assembled at the Digital Catapult Centre, which as Sandy Knapp notes, has some amazing views from it's 9th floor.

A group of experts (loosely defined, at least, if they include me in that category) and small businesses (with backgrounds in digitisation, text-mining, publishing, etc.) got together to try and come up with workable ideas where we could marry issues in digitisation and subsequent use of that data with tools and markets. A fascinating experience, although I'm not yet sure what the outcome will be. But it's always useful talking (and listening) to people with very different backgrounds and notions of what matters (and what is possible).

Thursday, December 18, 2014

Linking data from the NHM portal with content in BHL

02932 580 360One reason I'm excited by the launch of the NHM data portal is that it opens up opportunities to link publications about specimens i the NHM to the record of the specimens themselves. For example, consider specimen 1977.3097, which is in the new portal as http://data.nhm.ac.uk/dataset/collection-specimens/resource/05ff2255-c38a-40c9-b657-4ccb55ab2feb/record/2336568 (possibly the ugliest URL ever).

1977 3097

This specimen is of the bat Pteralopex acrodonta, shown in the image to the right (by William N. Beckon, taken from the EOL page for this species). This species was described in the following paper:
Hill JE, Beckon WN (1978) A new species of Pteralopex Thomas, 1888 (Chiroptera: Pteropodidae) from the Fiji Islands. Bulletin of the British Museum (Natural History) Zoology 34(2): 65–82. http://biostor.org/reference/8
This paper is in my BioStor project, and if you visit BioStor you'll see see that BioStor has extracted a specimen code (BM(NH) 77.3097) and also has a map of localities extracted from the paper.

Map
Looking at the paper we discover that BM(NH) 77.3097 is the type specimen of Pteralopex acrodonta:
HOLOTYPE. BM(NH) 77.3097. Adult . Ridge about 300 m NE of the Des Voeux Peak Radio Telephone Antenna Tower, Taveuni Island, Fiji Islands, 16° 50½' S, 179° 58' W, c. 3840ft (1170 m). Collected 3 May 1977 by W. N. Beckon, died 6-7 May 1977. Caught in mist net on ridge summit : bulldozed land with secondary scrubby growth, adjacent to primary forest. Original number 104. Skin and skull.
Note that the NHM data portal doesn't know that 1977.3097 is the holotype, nor does it have the latitude and longitude. Hence, if we can link 1977.3097 to BM(NH) 77.3097 we can augment the information in the NHM portal.

This specimen has also been cited in a subsequent paper:
Helgen, K. M. (2005, November). Systematics of the Pacific monkey‐faced bats (Chiroptera: Pteropodidae), with a new species of Pteralopex and a new Fijian genus . Systematics and Biodiversity. Informa UK Limited. doi:10.1017/s1477200005001702
You can read this paper in BioNames. In this paper Helgen creates a new genus, Mirimiri for Pteralopex acrodonta, and cites the holotype (as BMNH 1977.3097). Hence, if we could extract that specimen code from the text and link it to the NHM record we could have two citations for this specimen, and note that the taxon the specimen belongs to is also known as Mirimiri acrodonta.

Imagine being able to do this across the whole NHM data portal. The original description of this bat was published in a journal published by the NHM (and part of a volume contributed by the NHM to the Biodiversity Heritage Library). With a *cough* little work we could join up these two NHM digital resources (specimen and paper) to provide a more detailed view what we know about this specimen. From my perspective this cross-linking between the different digital assets of an institution such as the NHM (as well as linking to external data such as other publications, GenBank sequences, etc.) is where the real value of digitisation lies. It has the potential to be much more than simply moving paper catalogues and publications online.

Wednesday, December 17, 2014

The Natural History Museum launches their data portal

XVlUOuC5The Natural History Museum has released their data portal (http://data.nhm.ac.uk/). As of now it contains 2,439,827 of the Museum's 80 million specimens, so it's still early days. I gather that soon this data will also appear in GBIF, ending the unfortunate situation where data from one of the premier natural history collections in the world was conspicuous by its absence.

I've not had a chance to explore it in much detail, but one thing I'm keen to do is see whether I can link citations of NHM specimens in the literature (e.g., articles in BioStor) with records in the NHM portal. Being able to dip this would enable all sorts of cool things, such as being able to track what researchers have said about particular specimens, as well as develop citation metrics for the collection.

Nhmportal

Thursday, September 15, 2011

Anchoring Biodiversity Information: from Sherborn to the 21st century and beyond

Next month I'll be speaking in London at The Natural History Museum at a one day event Anchoring Biodiversity Information: From Sherborn to the 21st century and beyond. This meeting is being organised by the International Commission on Zoological Nomenclature and the Society for the History of Natural History, and is partly a celebration of his major work Index Animalium and partly a chance to look at the future of zoological nomenclature.

Details are available from the ICZN web site. I'll be giving a a talk entitled "Towards an open taxonomy" (no, I don't know what I mean by that either). But it should be a chance to rant about the failure of taxonomy to embrace the Interwebs.

SherbornPoster Sept 11

Wednesday, March 18, 2009

London Calling

Busy day yesterday, giving two talks, one at The Natural History Museum, one at the British Library. Slides for the NHM talk are below. Karen James pointed out the irony that a talk where I gave the NHM a hard time for being backward about embracing digitisation can't be viewed on most PCs at the NHM because SlideShare requires a recent version of Flash (which users can't install without IT's permission), and the downloaded presentation won't open because the NHM uses an older version of MS Office. So much for my attempts to share the slides. There will also be a video available at some point.

The second presentation was at the British Libraries "Talk Science" series, for some background see the forum on Nature Network. There will be a podcast available of this presentation. In her introduction to my talk, Sarah Kemmitt quoted from a recent paper by Antonio G. Valdecasas ([JACC]1175-5326:1820@41 where he described Vagabundia sci:
Vagabundia comes from the Spanish word 'vagabundo' that means 'wanderer'. It is a feminine substantive; sci refers to Science Citation Index. We pointed out some time ago (Valdecasas et al. 2000) that the popularity of the Science Citation Index (SCI) as a measure of ‘good’ science has been damaging to basic taxonomic work. Despite statements to the contrary that SCI is not adequate to evaluate taxonomic production (Krell 2000), it is used routinely to evaluate taxonomists and prioritize research grant proposals. As with everything in life, SCI had a beginning and will have an end. Before it becomes history, I dedicate this species to this sociological tool that has done more harm than good to taxonomic work and the basic study of biodiversity. Young biologists avoid the 'taxonomic trap' or becoming taxonomic specialists (Agnarsson & Kuntner 2007) due to the low citation rate of strictly discovery-oriented and interpretative taxonomic publications. Lack of recognition of the value of these publications, makes it difficult for authors to obtain grants or stable professional positions.

My own feeling is that SCI probably does a reasonable job of ranking the impact of taxonomic publication, the real task is to broaden our notion of what gets cited.

Friday, February 20, 2009

Talks at The Natural History Museum and British Library

Vince Smith has produced a nice flyer for my forthcoming talk at The Natural History Museum on March 17th (11-12).



It will be a busy day as I'm also talking at the British Library in the evening (6pm - 8:30pm), for which Sarah Kemmitt has produced a flyer, and set up a discussion forum on Nature Network. With all this effort going into the artwork, I'd better actually come up with something useful to say.