iPhylo: parsing

Roderic D. M. Page

Showing posts with label parsing. Show all posts

Monday, October 25, 2021

Problems with Plazi parsing: how reliable are automated methods for extracting specimens from the literature?

The Plazi project has become one of the major contributors to GBIF with some 36,000 datasets yielding some 500,000 occurrences (see Plazi's GBIF page for details). These occurrences are extracted from taxonomic publication using automated methods. New data is published almost daily (see latest treatments). The map below shows the geographic distribution of material citations provided to GBIF by Plazi, which gives you a sense of the size of the dataset.

By any metric Plazi represents a considerable achievement. But often when I browse individual records on Plazi I find records that seem clearly incorrect. Text mining the literature is a challenging problem, but at the moment Plazi seems something of a "black box". PDFs go in, the content is mined, and data comes up to be displayed on the Plazi web site and uploaded to GBIF. Nowhere does there seem to be an evaluation of how accurate this text mining actually is. Anecdotally it seems to work well in some cases, but in others it produces what can only be described as bogus records.

Finding errors

A treatment in Plazi is a block of text (and sometimes illustrations) that refers to a single taxon. Often that text will include a description of the taxon, and list one or more specimens that have been examined. These lists of specimens ("material citations") are one of the key bits of information that Plaza extracts from a treatment as these citations get fed into GBIF as occurrences.

To help explore treatments I've constructed a simple web site that takes the Plazi identifier for a treatment and displays that treatment with the material citations highlighted. For example, for the Plazi treatment 03B5A943FFBB6F02FE27EC94FABEEAE7 you can view the marked up version at https://plazi-tester.herokuapp.com/?uri=622F7788-F0A4-449D-814A-5B49CD20B228. Below is an example of a material citation with its component parts tagged:

This is an example where Plazi has successfully parsed the specimen. But I keep coming across cases where specimens have not been parsed correctly, resulting in issues such as single specimens being split into multiple records (e.g., https://plazi-tester.herokuapp.com/?uri=5244B05EFFC8E20F7BC32056C178F496), geographical coordinates being misinterpreted (e.g., https://plazi-tester.herokuapp.com/?uri=0D228E6AFFC2FFEFFF4DE8118C4EE6B9), or collector's initials being confused with codes for natural history collections (e.g., https://plazi-tester.herokuapp.com/?uri=252C87918B362C05FF20F8C5BFCB3D4E).

Parsing specimens is a hard problem so it's not unexpected to find errors. But they do seem common enough to be easily found, which raises the question of just what percentage of these material citations are correct? How much of the data Plazi feeds to GBIF is correct? How would we know?

Systemic problems

Some of the errors I've found concern the interpretation of the parsed data. For example, it is striking that despite including marine taxa no Plazi record has a value for depth below sea level (see GBIF search on depth range 0-9999 for Plazi). But many records do have an elevation, including records from marine environments. Any record that has a depth value is interpreted by Plazi as being elevation, so we have aerial crustacea and fish.

Map of Plazi records with depth 0-9999m

Map of Plazi records with elevation 0-9999m

Anecdotally I've also noticed that Plazi seems to do well on zoological data, especially journals like Zootaxa, but it often struggles with botanical specimens. Botanists tend to cite specimens rather differently to zoologists (botanists emphasise collector numbers rather than specimen codes). Hence data quality in Plazi is likely to taxonomic biased.

Plazi is using GitHub to track issues with treatments so feedback on erroneous records is possible, but this seems inadequate to the task. There are tens of thousands of data sets, with more being released daily, and hundreds of thousands of occurrences, and relying on GitHub issues devolves the responsibility for error checking onto the data users. I don't have a measure of how many records in Plazi have problems, but because I suspect it is a significant fraction because for any given day's output I can typically find errors.

What to do?

Faced with a process that generates noisy data there are several of things we could do:

Have tools to detect and flag errors made in generating the data.
Have the data generator give estimates the confidence of its results.
Improve the data generator.

I think a comparison with the problem of parsing bibliographic references might be instructive here. There is a long history of people developing tools to parse references (I've even had a go). State-of-the art tools such as AnyStyle feature machine learning, and are tested against human curated datasets of tagged bibliographic records. This means we can evaluate the performance of a method (how well does it retrieve the same results as human experts?) and also improve the method by expanding the corpus of training data. Some of these tools can provide a measures of how confident they are when classifying a string as, say, a person's name, which means we could flag potential issues for anyone wanting to use that record.

We don't have equivalent tools for parsing specimens in the literature, and hence have no easy way to quantify how good existing methods are, nor do we have a public corpus of material citations that we can use as training data. I blogged about this a few months ago and was considering using Plazi as a source of marked up specimen data to use for training. However based on what I've looked at so far Plazi's data would need to be carefully scrutinised before it could be used as training data.

Going forward, I think it would be desirable to have a set of records that can be used to benchmark specimen parsers, and ideally have the parsers themselves available as web services so that anyone can evaluate them. Even better would be a way to contribute to the training data so that these tools improve over time.

Plazi's data extraction tools are mostly desktop-based, that is, you need to download software to use their methods. However, there are experimental web services available as well. I've created a simple wrapper around the material citation parser, you can try it at https://plazi-tester.herokuapp.com/parser.php. It takes a single material citation and returns a version with elements such as specimen code and collector name tagged in different colours.

Summary

Text mining the taxonomic literature is clearly a gold mine of data, but at the same time it is potentially fraught as we try and extract structured data from semi-structured text. Plazi has demonstrated that it is possible to extract a lot of data from the literature, but at the same time the quality of that data seems highly variable. Even minor issues in parsing text can have big implications for data quality (e.g., marine organisms apparently living above sea level). Historically in biodiversity informatics we have favoured data quantity over data quality. Quantity has an obvious metric, and has milestones we can celebrate (e.g., one billion specimens). There aren't really any equivalent metrics for data quality.

Adding new types of data can sometimes initially result in a new set of quality issues (e.g., GBIF metagenomics and metacrap) that take time to resolve. In the case of Plazi, I think it would be worthwhile to quantify just how many records have errors, and develop benchmarks that we can use to test methods for extracting specimen data from text. If we don't do this then there will remain uncertainty as to how much trust we can place in data mined from the taxonomic literature.

Update

Plazi has responded, see Liberating material citations as a first step to more better data. My reading of their repsonse is that it essentially just reiterates Plazi's approach and doesn't tackle the underlying issue: their method for extracting material citations is error prone, and many of those errors end up in GBIF.

Thursday, July 22, 2021

Citation parsing tool released

Quick note on a tool I've been working on to parse citations, that is to take a series of strings such as:

Möllendorff O (1894) On a collection of land-shells from the Samui Islands, Gulf of Siam. Proceedings of the Zoological Society of London, 1894: 146–156.
de Morgan J (1885) Mollusques terrestres & fluviatiles du royaume de Pérak et des pays voisins (Presqúile Malaise). Bulletin de la Société Zoologique de France, 10: 353–249.
Morlet L (1889) Catalogue des coquilles recueillies, par M. Pavie dans le Cambodge et le Royaume de Siam, et description ďespèces nouvelles (1). Journal de Conchyliologie, 37: 121–199.
Naggs F (1997) William Benson and the early study of land snails in British India and Ceylon. Archives of Natural History, 24:37–88.

and return structured data. This is an old problem, and pretty much a "solved" problem. See for example AnyStyle. I've played with AnyStyle and it's great, but I had to install it on my computer rather than simply use it as a web service. I also wanted to explore the approach a bit more as a possible a model for finding citations of specimens.

Loving Sylvester Keil's AnyStyle reference parser https://t.co/pcbvctr5vf, but not loving the whole Ruby experience (whadda mean I need to upgrade Ruby on my Mac to install it?). Oh for a Docker version of a web service... still, very cool tool.
— Roderic Page (@rdmpage) July 7, 2020

After trying to install the underlying conditional random fields (CRF) engine used by AnyStyle and running into a bunch of errors, I switched to a tool I could get working, namely CRF++. After figuring out how to compiling a C++ application to run on Heroku I started to wonder how to use this as the basis of a citation parser. Fortunately, I had used the Perl-based ParsCit years ago, and managed to convert the relevant bits to PHP and build a simple web service around it.

Although I've abandoned the Ruby-based AnyStyle I do use AnyStyle's XML format for the training data. I also built a crude editor to create small training data sets that uses a technique published by the author of the blogging tool I'm using to write this post (see MarsEdit Live Source Preview). Typically I use this to correctly annotate examples where the parser failed. Over time I add these to the training data and the performance gets better.

This is pretty much a side project or a side project, but ultimately the goal is to employ it to help extract citation data from publications, both to generate data to populate (BioStor), and also start to flesh out the citation graph for publications in Wikidata.

If you want to play with the tool it is at https://citation-parser.herokuapp.com. At the moment it takes some citation strings and returns the result in CSL-JSON, which is becoming the default way to represent structured bibliographic data. Code is on GitHub.