iPhylo: synonymy

Roderic D. M. Page

Showing posts with label synonymy. Show all posts

Wednesday, August 24, 2022

Can we use the citation graph to measure the quality of a taxonomic database?

More arm-waving notes on taxonomic databases. I've started to add data to ChecklistBank and this has got me thinking about the issue of data quality. When you add data to ChecklistBank you are asked to give a measure of confidence based on the Catalogue of Life Checklist Confidence system of one - five stars: ★ - ★★★★★. I'm scepetical about the notion of confidence or "trust" when it is reduced to a star system (see also Can you trust EOL?). I could literally pick any number of stars, there's no way to measure what number of stars is appropriate. This feeds into my biggest reservation about the Catalogue of Life, it's almost entirely authority based, not evidence based. That is, rather than give us evidence for why a particular taxon is valid, we are (mostly) just given a list of taxa are asked to accept those as gospel, based on assertions by one or more authorities. I'm not necessarly doubting the knowledge of those making these lists, it's just that I think we need to do better than "these are the accepted taxa because I say so" implict in the Catalogue of Life.

So, is there any way we could objectively measure the quality of a particular taxonomic checklist? Since I have a long standing interest in link the primary taxonomic litertaure to names in databases (since that's where the evidence is), I keep wondering whether measures based on that literture could be developed.

I recently revisited the fascinating (and quite old) literature on rates of synonymy:

Gaston Kevin J. and Mound Laurence A. 1993 Taxonomy, hypothesis testing and the biodiversity crisisProc. R. Soc. Lond. B.251139–142 http://doi.org/10.1098/rspb.1993.0020

Andrew R. Solow, Laurence A. Mound, Kevin J. Gaston, Estimating the Rate of Synonymy, Systematic Biology, Volume 44, Issue 1, March 1995, Pages 93–96, https://doi.org/10.1093/sysbio/44.1.93

A key point these papers make is that the observed rate of synonymy is quite high (that is, many "new species" end up being merged with already known species), and that because it can take time to discover that a species is a synonym the actual rate may be even higher. In other words, in diagrams like the one reproduced below, the reason the proportion of synonyms declines the nearer we get to the present day (this paper came out in 1995) is not because are are creating fewer synonyms but because we've not yet had time to do the work to uncover the remaining synonyms.

Put another way, these papers are arguing that real work of taxonomy is revision, not species discovery, especially since it's not uncommon for > 50% of species in a taxon to end up being synonymised. Indeed, if a taxoomic group has few synonyms then these authors would argue that's a sign of neglect. More revisionary work would likely uncover additional synonyms. So, what we need is a way to measure the amount of research on a taxonomic group. It occurs to me that we could use the citation graph as a way to tackle this. Lets imagine we have a set of taxa (say a family) and we have all the papers that described new species or undertook revisions (or both). The extensiveness of that work could be measured by the citation graph. For example, build the citation graph for those papers. How many original species decsriptions are not cited? Those species have been potentially neglected. How many large-scale revisions have there been (as measured by the numbers of taxonomic papers those revisions cite)? There are some interesting approaches to quantifying this, such as using hubs and authorities.

I'm aware that taxonomists have not had the happiest relationship with citations:

Pinto ÂP, Mejdalani G, Mounce R, Silveira LF, Marinoni L, Rafael JA. Are publications on zoological taxonomy under attack? R Soc Open Sci. 2021 Feb 10;8(2):201617. doi: 10.1098/rsos.201617. PMID: 33972859; PMCID: PMC8074659.

Still, I think there is an intriguing possibility here. For this approach to work, we need to have linked taxonomic names to publications, and have citation data for those publications. This is happening on various platforms. Wikidata, for example, is becoming a repository of the taxonomic literature, some of it with citation links.

Page RDM. 2022. Wikidata and the bibliography of life. PeerJ 10:e13712 https://doi.org/10.7717/peerj.13712

Time for some experiments.

Friday, March 15, 2013

BioNames ideas - automatically finding synonyms from the literature

One of the biggest pains (and self-inflicted wounds) in taxonomy is synonymy, the existence of multiple names for the same taxon. A common cause of synonymy is moving species to different genera in order to have their name reflect their classification. The consequence of this is any attempt to search the literature for basic biological data runs into the problem that observations published at different times by different researchers (e.g., taxonomists, ecologists, parasitologists) may use different names for the same taxon.

Existing taxonomic databases often have lists of synonyms, but these are incomplete, and typically don't provide any evidence why two names are synonyms.

Reading literature extracted form the Biodiversity Heritage Library I'm struck by how often I come across papers such as taxonomic revisions, museum catalogues, and checklists, that list two names as synonyms. Wouldn't it be great if we could mine these to automatically build lists of synonyms?

One quick and dirty way to do this is look for sets of names that have the same species name but different generic names, e.g.

Atlantoxerus getulus
Sciurus getulus
Xerus getulus

If such names appear on the same page (i.e., in close proximity) there's a reasonable chance they are synonyms. So, one of the features I'm building in BioNames is an index of names like this. Hence, if we are displaying a page for the name Atlantoxerus getulus that page could also display Sciurus getulus and Xerus getulus as possible synonyms.

There's a lot more that could be done with this sort of approach. For example, this approach only works if the the species name remains unchanged. To improve it we'd need to do things like handle changes to the ending of a species name to agree with the gender of the genus, and cases where the taxa are demoted to subspecies (or promoted to species).

If we were even clever we'd attempt to parse synonymy lists to extract even more synonyms (for an example see Huber and Klump (PDF available here):

Huber, R., & Klump, J. (2009). Charting taxonomic knowledge through ontologies and ranking algorithms. Computers & Geosciences, 35(4), 862–868. doi:10.1016/j.cageo.2008.02.016

Then there's the broader topic of looking at co-occurrence of taxonomic names in general. As I noted a while ago there are examples of pages in BHL that lists taxonomically unrelated taxa that are ecologically closely associated (e.g., hosts and parasites). Hence we could imagine automatically building host-parasite databases by mining the literature. Initially we could simply display lists of names that co-occur frequently. Ideally we'd filter out "accidental" co-occurrences, such as indexes or tables of contents, but there seems to be a lot of potential in automating the extraction of basic information from the taxonomic literature.