iPhylo: Wikispecies

Roderic D. M. Page

Showing posts with label Wikispecies. Show all posts

Thursday, February 03, 2022

Deduplicating bibliographic data

There are several instances where I have a collection of references that I want to deduplicate and merge. For example, in Zootaxa has no impact factor I describe a dataset of the literature cited by articles in the journal Zootaxa. This data is available on Figshare (https://doi.org/10.6084/m9.figshare.c.5054372.v4), as is the equivalent dataset for Phytotaxa (https://doi.org/10.6084/m9.figshare.c.5525901.v1). Given that the same articles may be cited many times, these datasets have lots of duplicates. Similarly, articles in Wikispecies often have extensive lists of references cited, and the same reference may appear on multiple pages (for an initial attempt to extract these references see https://doi.org/10.5281/zenodo.5801661 and https://github.com/rdmpage/wikispecies-parser).

There are several reasons I want to merge these references. If I want to build a citation graph for Zootaxa or Phytotaxa I need to merge references that are the same so that I can accurate count citations. I am also interested in harvesting the metadata to help find those articles in the Biodiversity Heritage Library (BHL), and the literature cited section of scientific articles is a potential goldmine of bibliographic metadata, as is Wikispecies.

After various experiments and false starts I've created a repository https://github.com/rdmpage/bib-dedup to host a series of PHP scripts to deduplicate bibliographics data. I've settled on using CSL-JSON as the format for bibliographic data. Because deduplication relies on comparing pairs of references, the standard format for most of the scripts is a JSON array containing a pair of CSL-JSON objects to compare. Below are the steps the code takes.

Generating pairs to compare

The first step is to take a list of references and generate the pairs that will be compared. I started with this approach as I wanted to explore machine learning and wanted a simple format for training data, such as an array of two CSL-JSON objects and an integer flag representing whether the two references were the same of different.

There are various ways to generate CSL-JSON for a reference. I use a tool I wrote (see Citation parsing tool released) that has a simple API where you parse one or more references and it returns that reference as structured data in CSL-JSON.

Attempting to do all possible pairwise comparisons rapidly gets impractical as the number of references increases, so we need some way to restrict the number of comparisons we make. One approach I've explored is the “sorted neighbourhood method” where we sort the references 9for example by their title) then move a sliding window down the list of references, comparing all references within that window. This greatly reduces the number of pairwise comparisons. So the first step is to sort the references, then run a sliding window over them, output all the pairs in each window (ignoring in pairwise comparisons already made in a previous window). Other methods of "blocking" could also be used, such as only including references in a particular year, or a particular journal.

So, the output of this step is a set of JSON arrays, each with a pair of references in CSL-JSON format. Each array is stored on a single line in the same file in line-delimited JSON (JSONL).

Comparing pairs

The next step is to compare each pair of references and decide whether they are a match or not. Initially I explored a machine learning approach used in the following paper:

Wilson DR. 2011. Beyond probabilistic record linkage: Using neural networks and complex features to improve genealogical record linkage. In: The 2011 International Joint Conference on Neural Networks. 9–14. DOI: 10.1109/IJCNN.2011.6033192

Initial experiments using https://github.com/jtet/Perceptron were promising and I want to play with this further, but I deciding to skip this for now and just use simple string comparison. So for each CSL-JSON object I generate a citation string in the same format using CiteProc, then compute the Levenshtein distance between the two strings. By normalising this distance by the length of the two strings being compared I can use an arbitrary threshold to decide if the references are the same or not.

Clustering

For this step we read the JSONL file produced above and record whether the two references are a match or not. Assuming each reference has a unique identifier (needs only be unique within the file) then we can use those identifier to record the clusters each reference belongs to. I do this using a Disjoint-set data structure. For each reference start with a graph where each node represents a reference, and each node has a pointer to a parent node. Initially the reference is its own parent. A simple implementation is to have an array index by reference identifiers and where the value of each cell in the array is the node's parent.

As we discover pairs we update the parents of the nodes to reflect this, such that once all the comparisons are done we have a one or more sets of clusters corresponding to the references that we think are the same. Another way to think of this is that we are getting the components of a graph where each node is a reference and pair of references that match are connected by an edge.

In the code I'm using I write this graph in Trivial Graph Format (TGF) which can be visualised using a tools such as yEd.

Merging

Now that we have a graph representing the sets of references that we think are the same we need to merge them. This is where things get interesting as the references are similar (by definition) but may differ in some details. The paper below describes a simple Bayesian approach for merging records:

Councill IG, Li H, Zhuang Z, Debnath S, Bolelli L, Lee WC, Sivasubramaniam A, Giles CL. 2006. Learning Metadata from the Evidence in an On-line Citation Matching Scheme. In: Proceedings of the 6th ACM/IEEE-CS Joint Conference on Digital Libraries. JCDL ’06. New York, NY, USA: ACM, 276–285. DOI: 10.1145/1141753.1141817.

So the next step is to read the graph with the clusters, generate the sets of bibliographic references that correspond to each cluster, then use the method described in Councill et al. to produce a single bibliographic record for that cluster. These records could then be used to, say locate the corresponding article in BHL, or populate Wikidata with missing references.

Obviously there is always the potential for errors, such as trying to merge references that are not the same. As a quick and dirty check I flag as dubious any cluster where the page numbers vary among members of the cluster. More sophisticated checks are possible, especially if I go down the ML route (i.e., I would have evidence for the probability that the same reference can disagree on some aspects of metadata).

Summary

At this stage the code is working well enough for me to play with and explore some example datasets. The focus is on structured bibliographic metadata, but I may simplify things and have a version that handles simple string matching, for example to cluster together different abbreviations of the same journal name.

Wednesday, May 31, 2017

Wikidata, WikiCite, and the "bibliography of life"

3hhZSGOn 400x400 Last week I was at WikiCite 2017, a fascinating three day event in Vienna. Wikicite is "a proposal to build a bibliographic database in Wikidata to serve all Wikimedia projects", and is attracting increasing attention from academics, librarians, publishers, data geeks, and others. You can get a sense of the project by following @WikiCite on Twitter.

I went to the meeting in part to learn more about WikiCite, and also to spend some time hacking on Wikispecies. I'd been to only one Wiki event before (a Wiki Science Conference) so I'm still finding my way around this community. I spent the first two days listening to talks while coding away (more on this below), but on Wednesday put my own coding aside to join a bunch of people hacking the CrossRef event API in a great session led by Joe Wass. I've put some notes and code in GitHub. The event API tracks what people do with DOIs, including adding them to Wikipedia pages when citing a source in support of an assertion. A significant fraction of DOI resolutions are from Wikipedia pages, which is one reason why CrossRef was present at WikiCite.

Wikidata

In practice WikiCite's goal of building a bibliographic database to serve all Wikimedia projects means that articles, books, and other bibliographic items that are cited by Wikimedia projects will each be added to Wikidata. For example, the ZooKeys paper "Diversity of manota williston (Diptera, mycetophilidae) in ulu temburong national park, brunei" is item Q21188431 in Wikidata. Wikidata stores the key bibliographic metadata, including identifiers such as the DOI (which many at the WikiCite meeting pronounced "doy" much to my initial confusion). Screenshot 2017 05 31 12 46 43

This article was published in ZooKeys, which itself has a Wikidata item (Q219980), so in Wikidata the article is linked to the journal (i.e., "ZooKeys" isn't just a dumb string but a link to another Wikidata item). The article is also linked to two articles that it cites, and each of these is also a Wikidata item.

These citation links are one reason people are interested in WikiCite - it could be the basis of a free and open citation graph (for the benefits of such a graph see this piece by David Shotton doi:10.1038/502295a, a participant at the meeting in Vienna). Already some cool tools are being built on top of citation data in Wikidata, such as Scholia by Finn Årup Nielsen, Daniel Mietchen and Egon Willighagen. Here, for example, is my academic profile based on information in Wikidata. It's woefully incomplete, but intriguing. For a more complete example view Egon Willighagen's profile.

To some extent the utility of tools like Scholia will depend on how complete Wikidata's coverage is of the academic literature, which in turn raises the inevitable question of scope. Does Wikicite want to include just the literature cited in the various Wikimedia projects, or does it want to expand to include the total sum of academic literature?

Wikispecies, Wikidata and the bibliography of life

Wikispecies is one of the Wikimedia projects, and the only one that is topic-specific (the others are typically global in scope but have content in different languages, or host different data types such as images, scanned books, or structured data). As I've sketched out in an earlier post (Thoughts on Wikipedia, Wikidata, and the Biodiversity Heritage Library) I think Wikicite and Wikidata are potentially very important to projects such as BHL and the "bibliography of life". Much of our knowledge about the world's biodiversity is contained in the academic literature, and much of this is poorly known with no central database where we can find it, and much of it is still not digitised. It is tempting to think that Wikidata might be a platform around which the biodiversity community could focus its efforts on assembling a global database of biodiversity literature. Already major taxonomic journals such as ZooKeys are being fed into Wikidata, so it has a significant corpus of biodiversity literature already.

One way to grow this corpus is to focus on Wikispecies. In a post before the Wikicite meeting (Notes for WikiCite 2017: Wikispecies reference parsing) I elaborated on this idea. There are two stumbing blocks, one specific to Wikispecies, one a more general Wikidata issue.

The first issue is that Wikispecies bibliographic data is relatively unstructured, which makes converting it into structured data something of a challenge. I spent much of Wikicite hacking some code to do this on Glitch (more on Glitch later), you can see the results here: https://acoustic-bandana.glitch.me. This web site takes a Wikispecies reference and tries to convert it into CSL-JSON. Still very much a work in progress, but I've started building tools that use this web site as a service and process larger numbers of Wikispecies citations.

The second issue is how you get data into Wikidata, and this is something that's never been entirely clear to me. There are tools for adding an article using its DOI (sourcemd) but this isn't scalable, and doesn't handle the case of articles that don't have DOIs. This is still a "How do you Snapchat? You just Snapchat" moment. Wikidata desparately needs tools and a clear procedure whereby people like me with lots of bibliographic data can contribute.

Wikispecies

Another reason for my interest in Wikispecies (and other sources of bibliographic data such as the listed of cited literature being made available by CrossRef, see The Initiative for Open Citations) is that this data can be fed into BHL to locate more articles in that archive. Once these articles have been located they are stored in BioStor and BHL itself, but it makes sense to have them more accessible, and Wikidata looks to be an obvious candidate. Given that Wikispecies is essentially a crowd-source taxonomic database there is considerable overlap in content between Wikispecies and BHL. The Wikidata data model also allows for some of things that taxonomists care about, such as linking dates of publication to evidence relative to those dates (in older publications determining the publication date often requires quite extensive research).

Summary

Leaving aside the specific issues about how to get bibliographic data into Wikidata, I guess the question to ask is whether it makes sense to be developing large databases of bibliographic data without either using Wikidata as the platform to hold that data, or at least linking to Wikidata. Projects such as Gene Wiki are migrating from Wikipedia to Wikidata (see "Wikidata as a semantic framework for the Gene Wiki initiative" doi:10.1093/database/baw015), perhaps those of us interested in biodiversity literature could use projects like Gene Wiki as role models for how we could both contribute and benefit from Wikidata and Wikicite.

I've barely scratched the surface of what was discussed at Wikicite, for more details see the program. It is a very different sort of meeting in that the participants come from pretty diverse backgrounds, which helps shake up your own assumptions about what matters and how things should be done. It's also great that it's a meeting at which people write code or otherwise hack stuff together, so things actually get done. I've come away with lots to think about, and renewed enthusiasm about the role Wikimedia is playing in structuring our knowledge about the world.

Friday, March 24, 2017

Notes for WikiCite 2017: Wikispecies reference parsing

Wikispecies logo svg In preparation for WikiCite 2017 I'm looking more closely at extracting bibliographic information from Wikispecies. The WikiCite project "is a proposal to build a bibliographic database in Wikidata to serve all Wikimedia projects". One reason for doing this is so that each factual statement in WikiData can be linked to evidence for that statement. Practical efforts towards this goal include tools to add details of articles from CrossRef and PubMed straight into Wikidata, and tools to extract citations from Wikipedia (as these are likely to be sources of evidence for statements made in Wikipedia articles).

Wikispecies occupies a rather isoldated spot in the Wiikipedia landscape. Unlike other sites which are essentially comprehensive encyclopedias in different languages, Wikispecies focusses on one domain - taxonomy. In a sense, it's a prototype of Wikidata in that it provides basic facts (who described what species when, and what is the classification of those species) that in principle can be reused by any of the other wikis. However, in practice this doesn't seem to have happened much.

What Wikispecies has become, however, is a crowd-sourced database of the taxonomic literture. For someone like me who is desparately gathering up bibliographic data so that I can extract articles from the Biodiversity Heritage Library (BHL), this is a potential goldmine. But, there's a catch. Unlike, say, the English language Wikipedia which has a single widely-used template for describing a publication, Wikispecies has it's own method of representing articles. It uses a somewhat confusing mix of templates for author names, and then uses barely standardised formatting rules to mark out parts of a publication (such as journal, volume, issue, etc.). Instead of a single template to describe a publication, in Wikispecies a publication my itself be described by a unique template. This has some advantages, in that the same reference can be transcluded into multiple articles (in other words, you enter the bibliographic details once). But this leaves us with many individual templates with multiple, idiosyncratic styles of representing bibliographic data. Some have tried to get the Wikispecies community to adopt the same template as Wikipedia (see e.g., this discussion) but this proposal has met with a lot of resistance. From my perspective as a potential consumer of data, the current situation in Wikispecies is frustrating, but the reality is that the people who create the content get to decide how they structure that content. And understandably, they are less than impressed by requests that might help others (such as data miners) at the expense of making their own work more difficult.

In summary, if I want to make use of Wikispecies I am going to need to develop a set of parsers than can make a reasonable fist of parsing all the myriad citation formats used in Wikispecies (my first attempts are on GitHub). I'm looking at parsing the references and converting them to a more standard format in JSON (I've made some notes on various bibliographic formats in JSON such as BibJSON and CSL-JSON). One outcome of this work will be, I hope, more articles discovered in BHL and hence added to BioStor), and more links to identifiers, which could be fed back into Wikispecies. I also want to explore linking the authors of these papers to identifiers, as already sketched out in The Biodiversity Heritage Library meets Wikidata via Wikispecies: adding author identifiers to BioStor.

Wednesday, January 11, 2017

The Biodiversity Heritage Library meets Wikidata via Wikispecies: adding author identifiers to BioStor

I've added an experimental feature to BioStor that uses data from Wikidata and Wikispecies to augment what information BioStor displays on authors. This is a crude first step towards the goal of representing all the data in BioStor as a "knowledge graph" where articles, journals, and authors are all treated as entities, all have identifiers, and we can explore relationships between those entities (e.g., citation, co-authorship, etc.). At the moment this is true of articles, which have Biostor URLs (and in many cases DOIs), and for most journals which are identified by their ISSN. Using identifiers helps reduce ambiguity, especially if there are multiple ways to represent the same thing (e.g., all the alternative ways to write a journal name can be circumvented by using the journal's ISSN).

However, BioStor doesn't have a way to identify authors beyond simply searching for a name. As a first step to tackling this problem I've added a little widget that displays information about an author based on the name you are searching for. For example, searching for George Albert Boulenger will give you a list of publications where the author name is "George Albert Boulenger", as well as a picture of the author and some identifiers (from sources such as VIAF, ISNI, IPNI, and Wikidata):

For now this widget is independent of the data in BioStor. I don't link an article to its author(s) using identifiers for those authors, nor have I tackled the problem of clustering all the variations in people's names together into one set of names that share the same identifier (see Equivalent author names) nor do I attempt to match names to identifiers (see Reconciling author names using Open Refine and VIAF) other than by an exact text search (for details see below). At this stage I just want to get a sense of what identifiers exist for an author, and what I can learn from those identifiers. I also want to explore the potential of Wikispecies as a source of data on people and publications, and how this relates to Wikidata (for earlier thoughts on using Wikipedia for the same goal see Thoughts on Wikipedia, Wikidata, and the Biodiversity Heritage Library).

Wikispecies

I confess I've never really "got" Wikispecies (e.g., Wikispecies is not a database), it seems to exist in isolation from Wikipedia, which is arguably more informative about many species. But there are a couple of things Wikispecies does very well. Firstly, it is building a rich, crowd-sourced bibliography of papers on the taxonomy of many different species. Readers of iPhylo will recall how many times I've expressed frustration at the nearly evidence-free nature of many online taxonomic databases that simply have lists of names unconnected to the primary literature. Many Wikispecies pages have long lists of papers, making it a potential goldmine. Recently there is a lot of interest in extracting bibliographic data from Wikipedia (see WikiCite). Wikispecies could also be harvested, although a major obstacle any such project faces is the lack of a consistent format for references in Wikispecies.

The other nice thing about Wikipecies is that it has articles on taxonomic authorities, and these often list publications by those authors, and also list external identifiers for those authors, such as the VIAF and ISNI identifiers used in the library world, IPNI and ZooBank identifiers used in taxonomic databases, and ORCID which is becoming the de-facto identifier for academic researchers. This information also ends up in Wikidata.

Using Wikidata to glue things together

Wikidata is an interesting project that, like Wikispecies, I've been in two minds about (see Wikidata, Wikipedia, and #wikisci). However, I've started to make more use of it recently. Inspired by the Wikidata:SPARQL query service/2016 SPARQL Workshop I decided to explore the SPARQL query interface to Wikidata. I was struck by one of the example queries involving Wikispecies, and so after a little bit of messing about came up with a query that takes the name of an author and returns some identifiers from Wikidata, as well as an image of that person if one is available. I restrict the results to people that have an article about them in Wikispecies, because I want start exploring using those articles to make assertions about authorship. Here is a query to search for "George Albert Boulenger":

SELECT *
WHERE
{
  ?item rdfs:label "George Albert Boulenger"@en .
  ?article schema:about ?item .
  ?article schema:isPartOf  .
  OPTIONAL {
   ?item wdt:P213 ?isni .
	}
  OPTIONAL {
   ?item wdt:P214 ?viaf .
	}
  OPTIONAL {
   ?item wdt:P18 ?image .
	}
  OPTIONAL {
   ?item wdt:P496 ?orcid .
	}
  OPTIONAL {
   ?item wdt:P586 ?ipni .
	}
  OPTIONAL {
   ?item wdt:P2006 ?zoobank .
	}
}

This query simply asks whether Wikidata has an item on this person, whether that item is linked to Wikispecies, what identifiers Wikidata has, and whether there is an image of the person. You can see the query "live" here:

I've added some code to BioStor to do this query on the fly, and display the results. So, for Boulenger we get: Screenshot 2017 01 11 17 04 16 Here is the result for noted carcinologist Jocelyn Crane who currently lacks identifiers: Screenshot 2017 01 11 17 05 32 A nice surprise was Bernard Landry: Screenshot 2017 01 11 17 07 14 Note the ORCID 0000-0002-6005-1067. Interestingly, Bernard Landry's ORCID profile doesn't list any publications, whereas we can see lists of these in BioStor and Wikispecies.

Where next?

There are several obstacles to mapping the names of authors to identifiers. One is simply the lack of identifiers. This seems to be rapidly becoming less of a problem with the efforts of the library community around VIAF, the rise of ORCID for living researchers, and the creation of Wikidata items for every taxonomist in Wikispecies. The next challenge is clustering the different ways of writing the same person's name into sets that represent the same person. As discussed above, there are tools for this. Furthermore, with Wikipedia and Wikispecies we have sources of lists of publications linked to a person and their identifiers, which should simplify the task considerably. What is nice about this is that it relies on a crowd-sourcing effort which is already well-established, namely those people who in adding articles to Wikispecies and Wikipedia are created a curated database of publications linked to authors. In many cases those publications are linked to BHL (the source that BioStor extracts its articles from), so many of the links between publications and people are essentially lying there, just waiting for some skilful harvesting.

Tuesday, November 29, 2011

Towards the bibliography of life

David King et al.'s paper "Towards the bibliography of life" http://dx.doi.org/10.3897/zookeys.150.2167 has just appeared in a special issue of ZooKeys. I've written a number of posts on this topic, so I've a few comments.

King et al. survey some of the issues, but don't really tackle the big issue of how we're going to build this. If we define the "bibliography of life" somewhat narrowly as the list of all papers that have published a scientific name (or a new combination, such as moving a species from one genus to another), then this is a large, but measurable undertaking. According to ION's metrics page, these are the numbers involved (for animals and protozoa):

Total New Names	1,510,402
Total New Genera / Subgenera	215,242
Total New Species / Subspecies	1,192,366
Total Other New Names	102,794
Total New Combinations	241,296
Total New Synonyms	260,544

Even in the worse case scenario of one name per publication (clearly not the case) this is big, but not insurmountable, task.

Publications not taxa
Part of the challenge is figuring out the best way to tackle the problem. In the past, most efforts at building taxonomic bibliographies have focussed on specific taxa, which is natural — the bibliographies are being built by taxonomists and they specialise in particular groups. But I'd argue that this is not the most efficient way to tackle the problem. Because the taxonomic literature is so widely dispersed, after the obvious "low hanging fruit" have been collected, considerable effort must be spent tracking down the harder to find citations. There are few economies of scale in this approach. In contrast, if we focus on publications at, say, the level of journal, then we can build a bibliography much more quickly. Once we've found the source, say, for one article, often we could use that information to harvest many articles from the same source (e.g., write scripts to harvest from a digital repository such as a DSpace server, or a digital library such as Gallica). But if we are focussed on a particular taxon, we will ignore the other articles in that journal ("what do I care about fish, I like turtles").

Put another way, if we imagine a taxa × publication matrix, then we can either go after rows (i.e., a bibliography for a specific taxonomic group), or columns (a list of articles in a specific journal). The article-based approach will be faster, albeit at the cost of finding articles that aren't necessarily relevant to taxonomy. This is why I'm spending what feels like far too much time harvesting article lists and uploading these to Mendeley. It is also one reason BHL has been so successful. They've simply gone after scanning the literature wholesale, rather than focussing on particular taxonomic groups.

Taxapublicationmatrix

Crowd sourcing and Wikispecies
Crowd sourcing often strikes me as a euphemism for "we can't be bothered doing the tedious stuff, lets get the public to do it for us (plus it will look like we're engaged with the public)." I'm not denying can work, but I suspect it's not a magic bullet. Perhaps the best crowd sourcing is not to try and bring the crowd to a project, but go where the crowd has already gathered. In this case, an obvious crowd is the Wikispecies community. Working with the ION database for my Sherborn presentation, it's clear that the quality of bibliographic data in ION is variable, and rather poor for older references. In contrast, the reference lists on Wikispecies can be very good (e.g., the bibliography for George Boulenger). There are some issues with Wikispecies, notably the lack of a decent bibliographic template (unlike Wikipedia) so parsing references can be *cough* interesting, but there is scope here to use it to improve other databases. Citation matching can be a challenge, but in this case we have citations indexed by taxonomic name (in both ION and Wikispecies), which greatly reduces the scope of possible matches.

Summary
I think building the "bibliography of life" needs a combination of aggressive data gathering, and avoiding building additional tools unless absolutely needed. There are great tools and communities that can already be leveraged (e.g., Mendeley, Wikispecies), let's make use of them.

Tuesday, September 13, 2011

More BHL app ideas

Following on from my previous post on BHL apps and a Twitter discussion in which I appealed for a "sexier" interface for BHL (to which @elyw replied that is what BHL Australia were trying to do), here are some further thoughts on improving BHL's web interface.
Build a new interface
A fun project would be to create a BHL website clone using just the BHL API. This would give you the freedom to explore interface ideas without having to persuade BHL to change its site. In a sense, the app would be provide the persuasion.

Third party annotations
It would be nice if the BHL web site made use of third party annotations. For example, BHL itself is extracting some of the best images and putting them on Flickr. How about if you go to the page for an item in BHL and you see a summary of the images from that item in Flickr? At a glance you can see whether the item has some interesting content. For example, if you go to http://biodiversitylibrary.org/item/109846 you see this:

N2 w1150

which gives you no idea that it contains images like this:

Tables of contents
Another source of annotations is my own BioStor project, which finds articles in scanned volumes in BHL. If you are looking at an item in BHL it would be nice to see a list of articles that have been found in that item, perhaps displayed in a drop down menu as a table of contents. This would help provide a way to navigate through the volume.

Who links to BHL?
When I suggested third party annotations on Twitter @stho002 chimed in asking about Wikispecies, Species-ID, ZooBank, etc. These resources are different, in that they aren't repurposing BHL content but are linking to it. It woud be great if a BHL page for an item could display reverse links (i.e., the pages in those external databases that link to that BHL item).

Implementing reverse links (essential citation linking) can be tricky, but two ways to do it might be:

Use BHL web server logs to find and extract referrals from those projects
Perhaps more elegantly, encourage external databases to link to BHL content using an OpenURL which includes the URL of the originating page. OpenURL can be messy, but especially in Mediawiki-based projects such as Wikispecies and Species-ID it would be straightforward to make a template that generated the correct syntax. In this way BHL could harvest the inbound links and display them on the item page.

Tuesday, September 01, 2009

Google, Wikipedia, and EOL

One assumption I've been making so far is that when people search for information on an organism using its scientific name, Wikipedia will dominate the search results (see my earlier post for an example of this assumption). I've decided to quantify this by doing a little experiment. I grabbed the Mammal Species of the World taxonomy and extracted the 5416 species names. I then used Google's AJAX search API to look up each name in Google. For each search I took the top 10 hits and recorded for each hit the site URL and the rank in the search results (i.e., 1-10). Below is a table of how many mammal species had a hit in the top 10 Google results (showing just the top 20 most frequent sites).

Site	Hits
en.wikipedia.org	5266
species.wikimedia.org	2934
animaldiversity.ummz.umich.edu	2890
commons.wikimedia.org	1515
www.itis.gov	1418
ctd.mdibl.org	1288
www.bioone.org	1101
www.uniprot.org	1086
encyclopedia.farlex.com	1007
www.thewebsiteofeverything.com	955
www.answers.com	864
vertebrates.si.edu	854
www.interaktv.com	842
www.arkive.org	775
linkinghub.elsevier.com	727
www.springerlink.com	656
www.eol.org	618
www.reference.com	576
doi.wiley.com	572
noctilio.com	566

Wikipedia is the clear winner, with 5266 (97%) of mammals having a Wikipedia page in the top ten Google results. Next comes Wikispecies, then Animal Diversity Web, Wikimedia Commons, ITIS, the Comparative Toxicogenomics Database, BioOne, UniProt (derived from the NCBI taxonomy), and so on. Note that the Encyclopedia of Life comes in 17th.

Things get more interesting if we look at the ranking of search results. The graph below plots the cumulative rank of search results for some of the web sites listed above.

Wikipedia dominates things. For 48% of all mammal species Wikipedia is the first result returned by Google. Just under three quarters of all mammal species are either the first or second top hit in Google. The next best sites are Animal Diversity Web and Wikispecies, which get a small share of first place for some species (19% and 7% respectively). Note that EOL pages manage to make it into the top 10 for only 11% of all mammal species.

What does this all mean? Well, it seems clear that if people are using Google to find information about an organism, then Wikipedia is more likely than anything else to be the first result they see. It is also interesting that for all the energy (and funds) being expended on biodiversity databases (doi:10.1126/science.324_1632), ITIS is the only classical biodiversity database that routinely gets found in these searches (albeit in only a quarter of the searches).

I know I tend to go on a bit about EOL, but if I was running (or funding) EOL, I'd be worried. EOL barely figures in these search results, and is being taken to the cleaners by a volunteer effort (Wikipedia). Furthermore, it seems difficult to envisage what EOL can do to improve things. Sure it can link to (and make use of) content in sites such as Animal Diversity Web, ITIS (and maybe even, gasp, Wikipedia), but that just adds "link love" to those sites. Ironically, perhaps the single thing that would improve EOL's ranking would be if Wikipedia spread some of its link love over EOL, by linking all it's taxon pages to the corresponding EOL page.

But there are bigger issues at stake. Site popularity on the web tends to follow a power law, where a very few web sites grab the vast majority of eye balls. In a old blog post Clay Shirky wrote:

Now, thanks to a series of breakthroughs in network theory by researchers ... we know that power law distributions tend to arise in social systems where many people express their preferences among many options. We also know that as the number of options rise, the curve becomes more extreme. This is a counter-intuitive finding - most of us would expect a rising number of choices to flatten the curve, but in fact, increasing the size of the system increases the gap between the #1 spot and the median spot.

So, creating new and improved biodiversity web sites is likely to have the effect of only increasing the gap between Wikipedia and the rest.

Lastly, as I've mentioned before regarding Wikipedia and citations of taxonomic work, the graph above suggests to me that for anybody wanting to make basic biodiversity information available on the web, and attract readers to basic taxonomic literature, there really is only one game in town.

Tuesday, August 18, 2009

To wiki or not to wiki?

What follows are some random thoughts as I try and sort out what things I want to focus on in the coming days/weeks. If you don't want to see some wallowing and general procrastination, look away now.

I see four main strands in what I've been up to in the last year or so:

services
mashups
wikis
phyloinformatics

Let's take these in turns.

Services
Not glamourous, but necessary. This is basically bioGUID (see also hdl:10101/npre.2009.3079.1). bioGUID provides OpenURL services for resolving articles (it has nearly 84,000 articles in it's cache), looking up journal names, resolving LSIDs, and RSS feeds.

Mashups
iSpecies is my now aging tool for mashing up data from diverse sources, such as Wikipedia, NCBI, GBIF, Yahoo, and Google Scholar. I tweak it every so often (mainly to deal with Google Scholar forever mucking around with their HTML). The big limitation of iSpecies is that it doesn't make it's results reusable (i.e., you can't write a script to call iSpecies and return data). However, it's still the place I go to to quickly find out about a taxon.

The other mashups I've been playing with focus on taking standardised RSS feeds (provided by bioGUID, see above) and mashing them up, sometimes with a nice front end (e.g., my e-Biosphere 09 challenge entry).

Wiki
I've invested a huge amount of effort in learning how wikis (especially Mediawiki and its semantic extensions) work, documented in earlier posts. I created a wiki of taxonomic names as a sandbox to explore some of these ideas.

I've come to the conclusion that for basic taxonomic and biological information, the only sensible strategy for our community is to use (and contribute to) Wikipedia. I'm struggling to see any justification for continuing with a proliferation of taxonomic databases. After e-Biosphere 09 the game's up, people have started to notice that we've an excess of databases (see Claire Thomas in Science, "Biodiversity Databases Spread, Prompting Unification Call", doi:10.1126/science.324_1632).

Phyloinformatics
In truth I've not been doing much on this, apart from releasing tvwidget (code available from Google Code), and playing with a mapping of TreeBASE studies to bibliographic identifiers (available as a featured download from here). I've played with tvwidget in Mediawiki, and it seems to work quite well.

Where now?
So, where now? Here are some thoughts:

I will continue to hack bioGUID (it's now consuming RSS feeds from journals, as well as Zotero). Everything I do pretty much depends on the services bioGUID provides

iSpecies really needs a big overhaul to serve data in a form that can be built upon. But this requires decisions on what that format should be, so this isn't likely to happen soon. But I think the future of mashup work is to use RDF and triple stores (providing that some degree of editing is possible). I think a tool linking together different data sources (along the lines of my ill-fated Elsevier Challenge entry) has enormous potential.

I'm exploring Wikipedia and Wikispecies. I'm tempted to do a quantitative analysis of Wikipedia's classification. I think there needs to be some serious analysis of Wikipedia if people are going to use it as a major taxonomic resource.

If I focus on Wikipedia (i.e., using an existing wiki rather than try to create my own), then that leaves me wondering what all the playing with iTaxon was for. Well, actually I think the original goal of this blog (way back in December 2005) is ideally suited to a wiki. Pretty much all the elements are in place to dump a copy of TreeBASE into a wiki and open up the editing of links to literature and taxonomic names. I think this is going to handily beat my previous efforts (TbMap, doi:10.1186/1471-2105-8-158), especially as errors will be easy to fix.

So, food for thought. Now, I just need to focus a little and get down to actually doing the work.

Tuesday, August 11, 2009

Wikispecies RSS feed

Following on from my previous post about Wikispecies (which generated some discussion on TAXACOM) I've played some more with Wikispecies.

AS a first step I've added a Wikispecies RSS feed to my list of RSS feeds. This feed takes the original Wikispecies RSS feed for new pages (generated by the page Special:NewPages) and tries to extract some details before reformatting it as an ATOM feed. Specifically, I extract GUIDs such as IPNI and Index Fungorum identifiers, bibliographic references (which I will later parse to try and extract identifiers such as DOIs), and latitude and longitude if the Wikispecies page has type locality information. Having the later means that the RSS feed can be displayed as a map (Google Maps can take a RSS feed with geotagged items and display it on a map for you).

The map below is live, so it will show any geotagged items in the current Wikispecies feed.

View Larger Map

Friday, August 07, 2009

Wikispecies is not a database

This post was prompted by Stephen Thorpe's post on TAXACOM about Wikispecies in which he wrote (in a thread discussing Roger Hyam's recent blog post) that

[i]f it [Wikispecies] isn't a true database, then it is BETTER than a database. It can do anything a database can do, and more, if you know how it works properly.

I beg to differ. Wikispecies runs on a database (the Mediawiki software uses a database to store the wiki), and Mediawiki can be thought of as a database of semi-structured text, but it lacks a lot of the functionality database users would expect. For example, in Wikispecies there's no way to perform basic queries such as how many descendants a given taxon has, what names a particular author has published, or to find out in which geographic region most new names are being described from. Much of this information is in Wikispecies, it just isn't in a form that we can usefully use.

These limitations are mostly due to the underlying software (Mediawiki), which fortunately can be extended to address these issues using Semantic Mediawiki. I've explored these ideas earlier. With some restructuring, Wikispecies could become a database, but it would require some serious work.

But this raises the real issue with Wikispecies, namely what is it for? Wikipedia is much more informative for many taxa, and the two wikis are very poorly linked (surely we'd want Wikipedia pages linked to the corresponding Wikispecies pages?). Given that Wikipedia is the basis for some core efforts in linked data (e.g., DBPedia), it seems a no brainer that we would want our information stored in Wikipedia, rather than Wikispecies.

It seems to me that the split between Wikipedia and Wikispecies parallels that between "taxonomic concepts" and "taxonomic names". Wikipedia provides the former, in that it provides one (consensus) view of what a taxon is. Wikispecies would be ideally placed to be a nomenclatural database (and a great place to put all the synonyms that we've accumulated over time, but which would swamp Wikipedia). But Wikispecies seems also to want to provide a classification as well, which strikes me as unnecessary (and raises the issue of how this relates to the classification in Wikipedia).

I don't wish to denigrate the efforts of Wikispecies contributors (they are doing some neat things, such as harvesting new names from Zookeys), and by clever use of templates they avoid some of the serious problems with classification in Wikipedia, but it's not a taxonomic database, at least, not yet.

Friday, September 26, 2008

Half-baked ideas. I. Wiki for taxonomy

Next few weeks will be busy with term starting, kids visiting, and other commitments, so time to jot down some ideas. The first is to have a Wiki for taxonomic names. Bit like Wikispecies, but actually useful, by which I mean useful for working biologists. This would mean links to digital literature (DOIs, Handles, etc.), use of identifiers for names and taxa (such as NCBI taxids, LSIDs, etc.), and having it pre-populated with data. Imagine merging the NCBI taxonomy, Catalogue of Life, Index Fungorum, and IPNI, say, and having it automatically updated with sources such as WoRMS and uBio RSS. Why a Wiki? Well, partly to capture all the little textual annotations that are needed to flesh out the taxonomy, and partly to make it easy to correct the numerous mistakes that litter existing databases.

As an initial target, I'd aim for a comprehensively annotated NCBI taxonomy, as this is probably the most important taxonomic database that we have.