iPhylo: cluster maps

Roderic D. M. Page

Showing posts with label cluster maps. Show all posts

Wednesday, August 14, 2013

Cluster maps, papaya plots, and the trouble with GBIF taxonomy

Continuing the theme of the failings of the GBIF classification I've been playing further with cluster maps to visualise the problem (see this earlier post for an introduction).

Browsing through bats in GBIF I keep finding the same species appearing more than once, albeit in different genera. As discussed in the gibbon example, GBIF merges several competing classifications for mammals, and these often don't agree on the "accepted name" for a species. In the absence of a decent database of taxonomic synonyms, GBIF ends up duplicating species, and each duplicate is often associated with different occurence data. If you are trying to get the distribution for a species this can be a disaster.

To get a sense of the scale of the problem I put together a simple tool to create cluster maps. The code is on github) and there is a live service at http://iphylo.org/~rpage/cluster-map/. The service takes a simple tab-delimited file that lists sets and their members, computes the overlap between the sets, calls Graphviz to layout a graph in SVG, then draws in the members of each cluster (phew).

The input file looks something like this:


Molossops	aequatorianus
Chaerephon	aloysiisabaudiae
Tadarida	aloysiisabaudiae
Chaerephon	ansorgei
Tadarida	ansorgei
Molossus	ater
Mormopterus	petrophilus
Sauromys	petrophilus

What can we do with this tool? Well, I created a quick list of all the species of bat in the family Molossidae according to GBIF. The sets are the bat genera, the members are the species (you can see the file here). I then ran this through the cluster map, and got something like this (this is only part of the cluster map):

Bats

(now can you see why I call these "papaya plots"?). Note that there are species names (i.e., specific epithets) in common to more than one genus. Some of these may be perfectly OK (it's not unusual for the same epithet to be used in different species, e.g. "major", etc.). But in many cases these bat species turn out to be the same species, just in different genera in different classifications. For example, GBIF has both Cynomops greenhalli and Molossops greenhalli. These are the same thing. Species in the genus Mormopterus may also occur in other genera. In some cases the issue is competing classifications, sometimes it is conflict over whether a species is a species or merely a subspecies, and some generic conflicts are because some genera are relegated to subgeneric status in some classifications. In short, it's an unholy mess.

Does this matter? Well, consider Mormopterus petrophilus and Sauromys petrophilus, which GBIF both regard as valid species (they're the same thing). Here are the distributions for the two different names in GBIF:

Depending on which name you use you'll get a very different picture of the distribution of this bat.

The next step is to figure out how to fix this. Is there a way we can automate fixing the GBIF classification so that it is not riddled with spurious duplicates like these?

Wednesday, June 13, 2012

Visualising differences between classifications using cluster maps

As part of a project to build a tool to navigate through taxonomic names and classifications I've become interested in quick ways to compare classifications. For example, EOL has multiple classifications for the same taxon, and I'd like to quickly discover what the similarities and differences are.

One promising approach is to use "cluster maps", a technique described by Fluit et al. (see Aduna Cluster Map for an implementation):

Fluit, C., Sabou, M., & Harmelen, F. (2006). Visualizing the Semantic Web. (V. Geroimenko & C. Chen, Eds.) (pp. 45–58). Springer Science + Business Media. doi:10.1007/1-84628-290-X_3 (see also http://www.cs.vu.nl/~frankh/abstracts/VSW05.html)

Cluster map details

Cluster maps can be thought of as fancy Venn Diagrams, in that they can be used to depict the overlap between sets of objects. The diagram is a graph with two kinds of nodes. One represents categories (in the example above, file formats and search terms), the other represents sets of objects that occur in one or more categories (in the example above, these are files that match the search terms "rdf" and "aperture").

I've cobbled together a crude version of cluster maps. For a given taxon (e.g., a genus) I list all the immediate sub-taxa (e.g., species) in each classification in EOL, and then find the sets of sub-taxa that are shared across the classification sources (e.g., ITIS, NCBI, etc.) and those that are unique to one source. I then create the cluster map using Graphviz. Inspired by the hexagonal packing used by Aduna, I've done something similar to display the taxa in each set. Adding these to the output of Graphviz required a little fussing with. First I get Graphviz to output the graph in SVG, then I load the SVG into a program that locates each node in the graph and inserts SVG for the packed circles (given that SVG is XML this is fairly straightforward).

As an example, consider the genus Demansia (http://eol.org/pages/34967/overview). EOL reports four classifications for this genus. Below is a cluster map for this genus:

34967

This diagram show that, for example, the Catalogue of Life (CoL) and Reptile databases share 4 names, these databases share three other names with ITIS. All databases have names unique to themselves, one database (NCBI) is completely disconnected from the other three databases.

One important caveat here is that I'm mapping the scientific names as returned by EOL, and in many cases these contain the taxonomic authority. This is a major headache, prompting this outburst:

Argh, taxonomic databases that don't provide canonical names (i.e., with authority names REMOVED) will be first against the wall @eol
— Roderic Page (@rdmpage) June 13, 2012

If we clean the names by removing the taxonomic authority the clusters overlap rather more:
Demansia

Now we see that only ITIS and the Reptile Database have unique names. This is one reason why I get stroppy when taxonomists start saying databases shouldn't have to supply cleaned "canonical" names. If the names have authorities then I have to clean them, because in many cases the authorities (while useful to know) are inconsistent across databases. For example:

Demansia olivacea GRAY 1842 versus Demansia olivacea (Gray, 1842)
Demansia torquata GÜNTHER 1862 versus Demansia torquata (Günther, 1862)

Taxonomic authorities are frequently misspelt, and people seem confused about when to use parentheses or not. Databases should spare the user some pain and provide clean names (and authority strings separately where they have them).

The visualisation is still incomplete (I need to make it interactive), but it shows promise. The names that are unique to one database are usually worth investigating. In some cases they are names other databases regard as synonyms, in other cases they represent spelling variations. The goal of this visualisation is to highlight the names that the user might want to investigate further.