iPhylo: ALA

Roderic D. M. Page

Showing posts with label ALA. Show all posts

Friday, June 04, 2021

Thoughts on BHL, ALA, GBIF, and Plazi

If you compare the impact that BHL and Plazi have on GBIF, then it's clear that BHL is almost invisible. Plazi has successfully in carved out a niche where they generate tens of thousands of datasets from text mining the taxonomic literature, whereas BHL is a participant in name only. It's not as if BHL lacks geographic data. I recently added back a map display in BioStor where each dot is a pair of latitude and longitude coordinates mentioned in an article derived from BHL's scans.

This data has the potential to fill in gaps in our knowledge of species distributions. For example, the Atlas of Living Australia (ALA) shows the following map for the cladoceran (water flea) Simocephalus:

Compare this to the localities mentioned in just one paper on this genus:

Timms, B. V. (1989). Simocephalus Schoedler (Cladocera: Daphniidae) in tropical Australia. The Beagle, 6, 89–96. Retrieved from https://biostor.org/reference/241776

There are records in this paper for species that currently have no records at all in ALA (e.g., Simocephalus serrulatus):

As it stands BioStor simply extracts localities, it doesn't extract the full "material citation" from the text (that is, the specimen code, date collected, locality, etc. for each occurrence). If it did, it would then be in a position to contribute a large amount of data to ALA and GBIF (and elsewhere). Not only that, if it followed the Plazi model this contribution would be measurable (for example, in terms of numbers of records added, and numbers of data citations). Plazi makes some of its parsing tools available as web services (e.g., http://tb.plazi.org/GgWS/wss/test and https://github.com/gsautter/goldengate-webservices), so in principle we could parse BHL content and extract data in a form usable by ALA and GBIF.

Notes on Plazi web service

The endpoint is http://tb.plazi.org/GgWS/wss/invokeFunction and it accepts POST requests, e.g. data=Namibia%3A%2058%20km%20W%20of%20Kamanjab%20Rest%20Camp%20on%20road%20to%20Grootberg%20Pass%20%2819%C2%B038%2757%22S%2C%2014%C2%B024%2733%22E%29&functionName=GeoCoordinateTaggerNormalizing.webService&dataUrl=&dataFormat=TXT and returns XML.

Monday, August 10, 2020

Australian museums and ALA

The following is a guest post by Bob Mesibov.

The Atlas of Living Australia (ALA) adds "assertions" to Darwin Core occurrence records. "Assertions" are indicators of particular data errors, omissions and questionable entries, such as "Coordinates are transposed", "Geodetic datum assumed WGS84" and "First [day] of the century".

Today (8 August 2020) I looked at assertions attached to records in ALA for non-fossil animals in the Australian State museums. There were 62 occurrence record collections from the seven museums (I lumped the two Tasmanian museums together), with 45 different assertions. I then calculated assertions per record for each collection. The worst performer was the Queensland Museum Porifera collection (3.84 ass/rec), and tied for best were the Museums Victoria Herpetology and Ichthyology collections (1.09 ass/rec).

I also aggregated museum collections to build a kind of league table by State:

The clear winner is Museums Victoria.

But how well do ALA's assertions measure the quality of data records? Not all that well, actually.

The tests used to make the assertions generate false positives and false negatives, although at a low rate
The tests aren't independent, so that a single data error can "smear" across several assertions
The tests ignore errors and omissions in DwC fields that many data users would consider important

ALA's assertions also have a strong spatial/geographical bias, with 23 of the 45 assertions in my sample dataset saying something about the "where" of the occurrence. Looking just at those 23 "where" assertions, the museums league table again shows Museums Victoria ahead, this time by a wide margin:

ALA is currently working on better ways for users to filter out records with selected assertions, in what's misleadingly called a "Data Quality Project". The title is misleading because the overall quality of ALA's holdings doesn't improve one bit. Getting data providers to fix their data issues would be a more productive way to upgrade data quality, but I haven't seen any evidence that Australian museums (for example) pay much attention to ALA's assertions. (There are no or minimal changes in assertion totals between data updates.)

It's been pointed out to me that that museum and herbarium records amount to only a small fraction of ALA's ca 90 million records, and that citizen scientists are growing the stock of occurrence records far faster than institutions do. True, and those citizen science records are often of excellent quality (see https://www.datafix.com.au/BASHing/2020-02-05.html). However, citizen science observations are strongly biased towards widespread and common species. ALA's records for just six common Australian birds (5,072,599 as of 8 August 2020; https://dashboard.ala.org.au/) outnumber all the museum animal records I looked at in the assertion analysis (4,669,508).

In my humble view, the longer ALA's institutional data providers put off fixing their mistakes, the less valuable ALA becomes as a bridge between biodiversity informatics and biodiversity science.

Tuesday, March 03, 2020

The 2020 Darwin Core Million

The following is a guest post by Bob Mesibov.

You're feeling pretty good about your institution's collections data. After carefully tucking all the data items into their correct Darwin Core fields, you uploaded the occurrence records to GBIF, the Atlas of Living Australia (ALA) or another aggregator, and you got back a great report:

all your scientific names were in the aggregator's taxonomic backbone
all your coordinates were in the countries you said they were
all your dates were OK (and in ISO 8601 format!)
all your recorders and identifiers were properly named
no key data items were missing

OK, ready for the next challenge for your data? Ready for the 2020 Darwin Core Million?

How it works

From the dataset you uploaded to the aggregator, select about one million data items. That could be, say, 50000 records in 20 populated Darwin Core fields, or 20000 records in 50 populated Darwin Core fields, or something in between. Send me the data for auditing before 31 March 2020 as a zipped plain-text file by email to robert.mesibov@gmail.com, together with a DOI or other identifier for their online, aggregated presence.

I'll audit datasets in the order I receive them. If I can't any find data quality problems in your dataset, I'll pay your institution AUD$150 and declare your institution the winner of the 2020 Darwin Core Million here on iPhylo. (One winner only; datasets received after the first problem-free dataset won't be checked.)

If I find data quality problems, I'll let you know by email. If you want to learn what the problems are, I'll send you a report detailing what should be fixed and you'll pay me AUD$150. At 0.3-0.75c/record, that's a bargain compared to commercial data-checking rates. And it would be really good to hear, later on, that those problems had indeed been fixed and corrected data had been uploaded to the aggregator.

What I look for

For a list of data quality problems, see this page in my Data Cleaner's Cookbook. The key problems are:

duplicate records
invalid data items
data items in the wrong fields
data items inappropriate for their field
truncated data items
records with items in one field disagreeing with items in another
character encoding errors
wildly erroneous dates or coordinates
incorrect or inconsistent formatting of dates, names and other data

If you think some of this is just nit-picking, you're probably thinking of your data items as things for humans to read and interpret. But these are digital data items intended for parsing and managing by computers. "Western Hill" might not be the same as "Western Hill" in processing, for example, because the second item might have a no-break space between the words instead of a plain space. Another example: humans see these 22 variations on collector names as "the same", but computers don't.

You might also be thinking that data quality is all about data correctness. Is Western Hill really at those coordinates? Is the specimen ID correct? Is the barely legible collector name on the specimen label correctly interpreted? But it's possible to have entirely correct digital data that can't be processed by an application, or moved between applications, because the data suffer from one or more of the problems listed above.

I think my money is safe

The problems I look for are all easily found and fixed. However, as mentioned in a previous iPhylo post, the quality of the many institutional datasets that I've sample-audited ranges from mostly OK to pretty awful. I've also audited more than 100 datasets (many with multiple data tables) for Pensoft Publishers, and the occurrence records among them were never error-free. Some of those errors had vanished when the records had been uploaded to GBIF, because GBIF simply deleted the offending data items during processing (GBIF, bless 'em, also publish the original data items).

Neither institutions nor aggregators seem to treat occurrence records with the same regard for detail that you find in real scientific data, the kind that appear in tables in scientific journal articles. A comparison with enterprise data is even more discouraging. I'm not aware of any large museum or herbarium with a Curator of Data on the payroll, probably because no institution's income depends on the quality of the institution's data, and because collection records don't get audited the way company records do, for tax, insurance and good-governance purposes.

So there might be a winner this year, but I doubt it. Maybe next year. ALA has a year-long data quality project underway, and GBIF Executive Secretary Joe Miller (in litt.) says that GBIF is now paying closer attention to data quality. The 2021 Darwin Core Million prize could be yours...

Friday, June 21, 2019

Messages from Melbourne: Towards linking all the things

I'm doing some work with Nicole Kearney (@nicolekearney) at the Melbourne Museum on the general theme of "linking all the things". It's the end of the first full week we've had, so here's a quick update of what we've been up to.

Brainstorming

The things we want to do are being captured as a project on GitHub. This is where we come up with ideas, comment on then, then try to figure out which ones can be done. So far there are three things we've made a serious start on.

Unpaywall

Unpaywall is a project by Impactstory. It is sort of a Sci-Hub without the legal issues (for the record, I think Alexandra Elbakyan's work on Sci-Hub is nothing short of heroic). Unpaywall scans open access archives for legal, freely available versions of articles and makes them easy to find. If you have Firefox or Chrome you can get a plugin that lights up if the paywall article you're looking at has a free version somewhere else.
Nicole has long wanted the BHL to provide data to Unpaywall, because BHL has open access versions of many papers relevant to taxonomy and biodiversity more broadly defined. After a bit of digging we figured out that Unpaywall didn't have access to BHL's data, so we've set about fixing that. We've got the data harvested, but we're still waiting for Unpaywall to process that data. So, for now, we're still waiting for the little green light to appear on pages such as this one: https://doi.org/10.1080/00222932208632640.

Adding taxonomic literature to Atlas of Living Australia

Part of "linking all the things" is making the taxonomic literature a first class citizen of biodiversity databases. It is frankly embarrassing to see how much better the scientific literature is handled by projects such as Wikipedia than scientific databases such as GBIF and the ALA. We've decided to try and do something about this by showing how easily the literature could be embedded into the existing ALA web site. Nicole crafted a mockup of the ALA names tab, and I wrote some code to make it "live". For example, if you click on this link you will see a list of publications for Pauropsalta herveyensis Owen & Moulds, 2016. Note that we have DOIs and links to BHL where ever possible (and we use Unpaywall's API to flag whether an article with a DOI is freely available). We want this literature (the primary evidence for what we know about a species) to be visible and accessible. The demo is powered by my Ozymandias project, but we hope to work out a mechanism for delivering the mapping between taxa and literature to ALA (and, indeed, anyone else) as a dataset.
Because Ozymandias only has data for animals, we've had to exclude plants from this demo. I'm frantically trying to figure out how to work with data in Australia's plant name databases to resolve this. I'm discovering that never mind having more than one name for the same species, taxonomists also delight in having many different ways of representing taxonomic information in their databases. So, plants will be a challenge.

Mapping taxonomists to ORCID and Wikidata

One reason for adding literature to taxonomic databases is to make the work of taxonomists more visible. One way to do this is to move beyond using only "dumb strings" as people names and linking taxonomists to their ORCIDs and to entries in Wikidata (this is something I touched on in Ozymandias, and David Shorthouse is doing on an epic scale in Bloodhound). We're playing with the idea of being able to generate a list of active taxonomists in Australia, linked to their identifiers and publications, solely based on querying Wikidata. The first step is to try and automate the initial mapping between taxonomists and Wikidata as much as possible, we've only just started looking at this.

Summary

It is early days, and we're still identifying things we could work on. As always, there are so manythings which could be done, we're hoping we can make progress on at least some of these in the next few weeks.

Friday, August 10, 2018

Ozymandias: a biodiversity knowledge graph of Australian taxa and taxonomic publications

In the spirit of release early and release often, here is the first workable version of a biodiversity knowledge graph that I've been working on for Australian animals (for some background on knowledge graphs see Towards a biodiversity knowledge graph now in RIO). The core of this knowledge graph is a classification of animals from the Atlas of Living Australia (ALA) combined with data on taxonomic names and publications from the Australian Faunal Directory (AFD). This has been enhanced by adding lots of digital identifiers (such as DOIs) to the publications and, where possible, full text either as PDFs or as page scans from the Biodiversity Heritage Library (BHL) (provided via BioStor). Identifiers enable us to further grow the knowledge graph, for example by adding "cites" and "cited by" links between publications (data from CrossRef), and displaying figures from the Biodiversity Literature Repository (BLR).

The demo is here: https://ozymandias-demo.herokuapp.com/ If you’re looking for starting points, you could try:

Assassin spiders (images from Plazi and citation data from CrossRef) https://ozymandias-demo.herokuapp.com/?uri=https://biodiversity.org.au/afd/publication/64908f75-456b-4da8-a82b-c569b4806c22

Memoirs of Museum Victoria (dynamic query finds record in Wikidata and adds map) https://ozymandias-demo.herokuapp.com/?uri=https://biodiversity.org.au/afd/publication/5c22a8d1-7456-4f8c-9384-1246ecbf15a6

G. R. Allen (we can from the taxonomic tree of his top 20 taxa that he studies fish - who knew?) https://ozymandias-demo.herokuapp.com/?uri=https://biodiversity.org.au/afd/publication/%23creator/g-r-allen

Paper on mosquito taxonomy with lots of citations, including material in BHL/BioStor https://ozymandias-demo.herokuapp.com/?uri=https://biodiversity.org.au/afd/publication/578d1dec-5816-49ec-8916-3f957fd230f5

Paper on Australian flies with full text in BioStor https://ozymandias-demo.herokuapp.com/?uri=https://biodiversity.org.au/afd/publication/0ffe4f28-b8ac-4132-be34-19eb03fbf685

The focus for now is on taxa, publications, journals, and people. Occurrences and sequences are on the “to do” list. As always there’s lots of data cleaning and cross linking to do, but an obvious next step is to link people’s names to identifiers such as ORCID and Wikidata ids, so that we can trace the activities of taxonomists as they discover and describe Australian biodiversity (the choice of Australia is simply to keep things manageable, and because the amount of data and digitisation they’ve done is pretty extraordinary). I’m also working to a deadline as I'm trying to get this demo wrapped up in the next couple of weeks.

Technical details

TL;DR the knowledge graph is implemented as a triple store where the data has been represented using a small number of vocabularies (mostly schema.org with some terms borrowed from TAXREF-LD and the TDWG LSID vocabularies). All results displayed in the first two panels are the result of SPARQL queries, the content in the rightmost panel comes from calls to external APIs. Search is implemented using Elasticsearch. If you are feeling brave you can query the knowledge graph directly in SPARQL. I’m constantly tweaking things and adding data and identifiers, so things are likely to break. More details and documentation will be going up on the GitHub repository.

Monday, September 18, 2017

Guest post: Our taxonomy is not your taxonomy

The following is a guest post by Bob Mesibov.

Do you know the party game "Telephone", also known as "Chinese Whispers"? The first player whispers a message in the ear of the next player, who passes the message in the same way to a third player, and so on. When the last player has heard the whispered message, the starting and finishing versions of the message are spoken out loud. The two versions are rarely the same. Information is usually lost, added or modified as the message is passed from player to player, and the changes are often pretty funny.

I recently compared ca 100 000 beetle records as they appear in the Museums Victoria (NMV) database and in DarwinCore downloads from the Atlas of Living Australia (ALA) and the Global Biodiversity Information Facility (GBIF). NMV has its records aggregated by ALA, and ALA passes its records to GBIF. The "Telephone" effect in the NMV to ALA to GBIF comparison was large and not particularly funny.

Many of the data changes occur in beetle names. ALA checks the NMV-supplied names against a look-up table called the National Species List, which in this case derives from the Australian Faunal Directory (AFD). If no match is found, ALA generalises the record to the next higher supplied taxon, which it also checks against the AFD. ALA also replaces supplied names if they are synonyms of an accepted name in the AFD.

GBIF does the same in turn with the names it gets from ALA. I'm not 100% sure what GBIF uses as beetle look-up table or tables, but in many other cases their GBIF Backbone Taxonomy mirrors the Catalogue of Life.

To give you some idea of the magnitude of the changes, of ca 85000 NMV records supplied with a genus+species combination, about one in five finished up in GBIF with a different combination. The "taxonRank" changes are summarised in the overview below, and note that replacement ALA and GBIF taxon names at the same rank are often different:

Of the species that escaped generalisation to a higher taxon, there are 42 names with genus triples: three different genus names for the same taxon in NMV, ALA and GBIF.

Just one example: a paratype of the staphylinid Schaufussia mona Wilson, 1926 is held in NMV. The record is listed under Rytus howittii (King, 1866) in the ALA Darwin Core download, because AFD lists Schaufussia mona as a junior subjective synonym of Tyrus howitti King, 1866, and Tyrus howittii in AFD is in turn listed as a synonym of Rytus howittii (King, 1866). The record appears in GBIF under Tyraphus howitti (King, 1865), with Rytus howittii (King, 1866) listed as a synonym. In AFD, Rytus howittii is in the tribe Tyrini, while Tyraphus howitti is a different species in the tribe Pselaphini.

ALA gives "typeStatus" as "paratype" for this record, but the specimen is not a paratype of Rytus howittii. In the GBIF download, the "typeStatus" field is blank for all records. I understand this may change in future. If it does, I hope the specimen doesn't become a paratype of Tyraphus howitti through copying from ALA.

There are lots of "Telephone" changes in non-taxonomic fields as well, including some geographical howlers. ALA says that a Kakadu National Park record is from Zambia and another Northern Territory record is from Mozambique, because ALA trusts the incorrect longitude provided by NMV more than it does the NMV-supplied locality text. GBIF blanks this locality text field, leaving the GBIF user with two African records for Australian specimens and no internal contradictions.

ALA trusts latitude/longitude to the extent of changing the "stateProvince" field for localities near Australian State borders, if a low-precision latitude/longitude places the occurrence a short distance away in an adjoining State.

Manglings are particularly numerous in the "recordedBy" field, where name strings are reformatted, not always successfully. Complex NMV strings suffer worst, e.g. "C Oke; Charles John Gabriel" in NMV becomes "Oke, C.|null" in ALA, and "Ms Deb Malseed - Winda-Mara Aboriginal Corporation WMAC; Ms Simone Sailor - Winda-Mara Aboriginal Corporation WMAC" is reformatted as in ALA "null|null|null|null"

Most of the "Telephone" effect in the NMV-ALA-GBIF comparison appears in the NMV-ALA stage. I contacted ALA by email and posted some of the issues on the ALA GitHub site; I haven't had a response and the issues are still open. I also contacted Tim Robertson at GBIF, who tells me that GBIF is working on the ALA-GBIF stage.

Can you get data as originally supplied by NMV to ALA, through ALA? Well, that's easy enough record-by-record on the ALA website, but not so easy (or not possible) for a multi-record download. Same with GBIF, but in this case the "original" data are the ALA versions.