iPhylo: WikiCite

Roderic D. M. Page

Showing posts with label WikiCite. Show all posts

Monday, December 20, 2021

GraphQL for WikiData (WikiCite)

I've released a very crude GraphQL endpoint for WikiData. More precisely, the endpoint is for a subset of the entities that are of interest to WikiCite, such as scholarly articles, people, and journals. There is a crude demo at https://wikicite-graphql.herokuapp.com. The endpoint itself is at https://wikicite-graphql.herokuapp.com/gql.php. There are various ways to interact with the endpoint, personally I like the Altair GraphQL Client by Samuel Imolorhe.

As I've mentioned earlier it's taken me a while to see the point of GraphQL. But it is clear it is gaining traction in the biodiversity world (see for example the GBIF Hosted Portals) so it's worth exploring. My take on GraphQL is that it is a way to create a self-describing API that someone developing a web site can use without them having to bury themselves in the gory details of how data is internally modelled. For example, WikiData's query interface uses SPARQL, a powerful language that has a steep learning curve (in part because of the administrative overhead brought by RDF namespaces, etc.). In my previous SPARQL-based projects such as Ozymandias and ALEC I have either returned SPARQL results directly (Ozymandias) or formatted SPARQL results as schema.org DataFeeds (equivalent to RSS feeds) (ALEC). Both approaches work, but they are project-specific and if anyone else tried to build based on these projects they might struggle for figure out what was going on. I certainly struggle, and I wrote them!

So it seems worthwhile to explore this approach a little further and see if I can develop a GraphQL interface that can be used to build the sort of rich apps that I want to see. The demo I've created uses SPARQL under the hood to provide responses to the GraphQL queries. So in this sense it's not replacing SPARQL, it's simply providing a (hopefully) simpler overlay on top of SPARQL so that we can retrieve the data we want without having to learn the intricacies of SPARQL, nor how Wikidata models publications and people.

Thursday, July 22, 2021

Towards a WikiCite search engine

I've released a simple search engine for publications in Wikidata. Wikicite Search takes its name from the WikiCite project, which was an initiative to create a bibliographic database in Wikidata. Since bibliographic data is a core component of taxonomic research (arguably taxonomy is mostly tracing the fate of the "tags" we call taxonomic names) I've spent some time getting taxonomic literature into Wikidata. Since there are bots already adding articles by harvesting sources such as CrossRef and PubMed, I've focussed on literature that is harder to add, such as articles with non-CrossRef DOIs, or those without DOIs at all.

Once you have a big database, you are then faced with the challenge of finding things in that database. Wikidata supports generic search, but I wanted something more closely geared to bibliographic data. Hence Wikicite Search. Over the last few years I've made several attempts at a bibliographic search engine, for this project I've finally settled on some basic ideas:

The core data structure is CSL-JSON, a simple but rich JSON format for expressing bibliographic data.
The search engine is Elasticsearch. The documents I upload include the CSL-JSON for an article, but also a simple text representation of the article. This text representation may include multiple languages if, for example, the article has a title in more than one language. This means that if an article has both English and Chinese titles you can find it searching in either language.
The web interface is very simple: search for a term, get results. If the search term is a Wikidata identifier you get just the corresponding article, e.g. Q98715368.
There is a reconciliation API to help match articles to the database. Paste in one citation per line and you get back matches (if found) for each citation.
Where possible I display a link to a PDF of the article, which is typically stored in the Internet Archive or accessible via the Wayback Machine.

There are millions of publications in Wikidata, currently less than half a million are in my search engine. My focus is narrowly on eukaryote taxonomy and related topics. I will be adding more articles as time permits. I also periodically reload existing articles to capture updates to the metadata made by the Wikidata community - being a wiki the data in Wikidata is constantly evolving.

My goal is to have a simple search tool that focusses on matching citation strings. In other words, it is designed to find a reference you are looking for, rather than be a tool to search the taxonomic literature. If that sounds contradictory, consider that my tool will only find a paper about a taxon if it is explicitly named in the title. A more sophisticated search engine would support things like synonym resolution, etc.

The other reason I built this is to provide an API for accessing Wikidata items and displaying them in other formats. For example, an article in the WikiCite search engine can be retrieved in CSL-JSON format, or in RDF as JSON-LD.

As always, it's very early days. But I don't think it's unreasonable to imagine that as Wikidata grows we could envisage having a search engine that includes the bulk of the taxonomic literature.

Tuesday, May 18, 2021

Preprint on Wikidata and the bibliography of life

Last week I submitted a manuscript entitled "Wikidata and the bibliography of life". I've been thinking about the "bibliography of life" (AKA a database of every taxonomic publication ever published) for a while, and this paper explores the idea that Wikidata is the place to create this database. The preprint version is on bioRxiv (doi:10.1101/2021.05.04.442638). Here's the abstract:

Biological taxonomy rests on a long tail of publications spanning nearly three centuries. Not only is this literature vital to resolving disputes about taxonomy and nomenclature, for many species it represents a key source - indeed sometimes the only source - of information about that species. Unlike other disciplines such as biomedicine, the taxonomic community lacks a centralised, curated literature database (the “bibliography of life”). This paper argues that Wikidata can be that database as it has flexible and sophisticated models of bibliographic information, and an active community of people and programs (“bots”) adding, editing, and curating that information. The paper also describes a tool to visualise and explore bibliography information in Wikidata and how it links to both taxa and taxonomists.

The manuscript summarises some work I've been doing to populate Wikidata with taxonomic publications (building on a huge amount of work already done), and also describes ALEC which I use to visualise this content. I've made various (unreleased) knowledge graphs of taxonomic information (and one that I have actually released Ozymandias), I'm still torn between whether the future is to invest more effort in Wikidata, or construct lighter, faster, domain specific knowledge graphs for taxonomy. I think the answer is likely to be "yes".

Meantime, one chart I quite like from the submitted version of this paper is shown below.

It's a chart that is a bit tricky to interpret. My goal was to get a sense of whether bibliographic items added to Wikidata (e.g., taxonomic papers) were actually being edited by the Wikidata community, or whether they just sat there unchanged since they were added. If people are editing these publications, for example, by adding missing author names, linking papers to items for their authors, or adding additional identifiers (such as DOIs, ZooBank identifiers, etc.), then there is clear value in using Wikidata as a repository of bibliographic data. So I grabbed a sample of 1000 publications, retrieved their edit history from Wikidata, and plotted the creation timestamp of each item against the timestamps for each edit made to that item. If items were never edited then every point would fall along the diagonal line. If edits are made, they appear to the right of the diagonal. I could have just counted edits made, but I wanted to visualise those edits. As the chart shows, there is quite a lot of editing activity, so there his a community of people (and bots) curating this content. In many ways this is the strongest argument for using Wikidata for a "bibliography of life". Any database needs curation, which means people, and this is what Wikidata offers, a community of people who care about often esoteric details, and get pleasure from improving structured data.

There are still huge gaps in Wikidata's coverage of the taxonomic literature. Once you move beyond the "low hanging fruit" of publications with CrossRef DOIs the task of adding literature to Wikidata gets a bit more complicated. Then there is the reconciliation problem: given an existing taxonomic database with a list of references, how do we match those references to the corresponding items in Wikidata? There is still a lot to do.

Wednesday, May 31, 2017

Wikidata, WikiCite, and the "bibliography of life"

3hhZSGOn 400x400 Last week I was at WikiCite 2017, a fascinating three day event in Vienna. Wikicite is "a proposal to build a bibliographic database in Wikidata to serve all Wikimedia projects", and is attracting increasing attention from academics, librarians, publishers, data geeks, and others. You can get a sense of the project by following @WikiCite on Twitter.

I went to the meeting in part to learn more about WikiCite, and also to spend some time hacking on Wikispecies. I'd been to only one Wiki event before (a Wiki Science Conference) so I'm still finding my way around this community. I spent the first two days listening to talks while coding away (more on this below), but on Wednesday put my own coding aside to join a bunch of people hacking the CrossRef event API in a great session led by Joe Wass. I've put some notes and code in GitHub. The event API tracks what people do with DOIs, including adding them to Wikipedia pages when citing a source in support of an assertion. A significant fraction of DOI resolutions are from Wikipedia pages, which is one reason why CrossRef was present at WikiCite.

Wikidata

In practice WikiCite's goal of building a bibliographic database to serve all Wikimedia projects means that articles, books, and other bibliographic items that are cited by Wikimedia projects will each be added to Wikidata. For example, the ZooKeys paper "Diversity of manota williston (Diptera, mycetophilidae) in ulu temburong national park, brunei" is item Q21188431 in Wikidata. Wikidata stores the key bibliographic metadata, including identifiers such as the DOI (which many at the WikiCite meeting pronounced "doy" much to my initial confusion). Screenshot 2017 05 31 12 46 43

This article was published in ZooKeys, which itself has a Wikidata item (Q219980), so in Wikidata the article is linked to the journal (i.e., "ZooKeys" isn't just a dumb string but a link to another Wikidata item). The article is also linked to two articles that it cites, and each of these is also a Wikidata item.

These citation links are one reason people are interested in WikiCite - it could be the basis of a free and open citation graph (for the benefits of such a graph see this piece by David Shotton doi:10.1038/502295a, a participant at the meeting in Vienna). Already some cool tools are being built on top of citation data in Wikidata, such as Scholia by Finn Årup Nielsen, Daniel Mietchen and Egon Willighagen. Here, for example, is my academic profile based on information in Wikidata. It's woefully incomplete, but intriguing. For a more complete example view Egon Willighagen's profile.

To some extent the utility of tools like Scholia will depend on how complete Wikidata's coverage is of the academic literature, which in turn raises the inevitable question of scope. Does Wikicite want to include just the literature cited in the various Wikimedia projects, or does it want to expand to include the total sum of academic literature?

Wikispecies, Wikidata and the bibliography of life

Wikispecies is one of the Wikimedia projects, and the only one that is topic-specific (the others are typically global in scope but have content in different languages, or host different data types such as images, scanned books, or structured data). As I've sketched out in an earlier post (Thoughts on Wikipedia, Wikidata, and the Biodiversity Heritage Library) I think Wikicite and Wikidata are potentially very important to projects such as BHL and the "bibliography of life". Much of our knowledge about the world's biodiversity is contained in the academic literature, and much of this is poorly known with no central database where we can find it, and much of it is still not digitised. It is tempting to think that Wikidata might be a platform around which the biodiversity community could focus its efforts on assembling a global database of biodiversity literature. Already major taxonomic journals such as ZooKeys are being fed into Wikidata, so it has a significant corpus of biodiversity literature already.

One way to grow this corpus is to focus on Wikispecies. In a post before the Wikicite meeting (Notes for WikiCite 2017: Wikispecies reference parsing) I elaborated on this idea. There are two stumbing blocks, one specific to Wikispecies, one a more general Wikidata issue.

The first issue is that Wikispecies bibliographic data is relatively unstructured, which makes converting it into structured data something of a challenge. I spent much of Wikicite hacking some code to do this on Glitch (more on Glitch later), you can see the results here: https://acoustic-bandana.glitch.me. This web site takes a Wikispecies reference and tries to convert it into CSL-JSON. Still very much a work in progress, but I've started building tools that use this web site as a service and process larger numbers of Wikispecies citations.

The second issue is how you get data into Wikidata, and this is something that's never been entirely clear to me. There are tools for adding an article using its DOI (sourcemd) but this isn't scalable, and doesn't handle the case of articles that don't have DOIs. This is still a "How do you Snapchat? You just Snapchat" moment. Wikidata desparately needs tools and a clear procedure whereby people like me with lots of bibliographic data can contribute.

Wikispecies

Another reason for my interest in Wikispecies (and other sources of bibliographic data such as the listed of cited literature being made available by CrossRef, see The Initiative for Open Citations) is that this data can be fed into BHL to locate more articles in that archive. Once these articles have been located they are stored in BioStor and BHL itself, but it makes sense to have them more accessible, and Wikidata looks to be an obvious candidate. Given that Wikispecies is essentially a crowd-source taxonomic database there is considerable overlap in content between Wikispecies and BHL. The Wikidata data model also allows for some of things that taxonomists care about, such as linking dates of publication to evidence relative to those dates (in older publications determining the publication date often requires quite extensive research).

Summary

Leaving aside the specific issues about how to get bibliographic data into Wikidata, I guess the question to ask is whether it makes sense to be developing large databases of bibliographic data without either using Wikidata as the platform to hold that data, or at least linking to Wikidata. Projects such as Gene Wiki are migrating from Wikipedia to Wikidata (see "Wikidata as a semantic framework for the Gene Wiki initiative" doi:10.1093/database/baw015), perhaps those of us interested in biodiversity literature could use projects like Gene Wiki as role models for how we could both contribute and benefit from Wikidata and Wikicite.

I've barely scratched the surface of what was discussed at Wikicite, for more details see the program. It is a very different sort of meeting in that the participants come from pretty diverse backgrounds, which helps shake up your own assumptions about what matters and how things should be done. It's also great that it's a meeting at which people write code or otherwise hack stuff together, so things actually get done. I've come away with lots to think about, and renewed enthusiasm about the role Wikimedia is playing in structuring our knowledge about the world.

Friday, March 24, 2017

Notes for WikiCite 2017: Wikispecies reference parsing

Wikispecies logo svg In preparation for WikiCite 2017 I'm looking more closely at extracting bibliographic information from Wikispecies. The WikiCite project "is a proposal to build a bibliographic database in Wikidata to serve all Wikimedia projects". One reason for doing this is so that each factual statement in WikiData can be linked to evidence for that statement. Practical efforts towards this goal include tools to add details of articles from CrossRef and PubMed straight into Wikidata, and tools to extract citations from Wikipedia (as these are likely to be sources of evidence for statements made in Wikipedia articles).

Wikispecies occupies a rather isoldated spot in the Wiikipedia landscape. Unlike other sites which are essentially comprehensive encyclopedias in different languages, Wikispecies focusses on one domain - taxonomy. In a sense, it's a prototype of Wikidata in that it provides basic facts (who described what species when, and what is the classification of those species) that in principle can be reused by any of the other wikis. However, in practice this doesn't seem to have happened much.

What Wikispecies has become, however, is a crowd-sourced database of the taxonomic literture. For someone like me who is desparately gathering up bibliographic data so that I can extract articles from the Biodiversity Heritage Library (BHL), this is a potential goldmine. But, there's a catch. Unlike, say, the English language Wikipedia which has a single widely-used template for describing a publication, Wikispecies has it's own method of representing articles. It uses a somewhat confusing mix of templates for author names, and then uses barely standardised formatting rules to mark out parts of a publication (such as journal, volume, issue, etc.). Instead of a single template to describe a publication, in Wikispecies a publication my itself be described by a unique template. This has some advantages, in that the same reference can be transcluded into multiple articles (in other words, you enter the bibliographic details once). But this leaves us with many individual templates with multiple, idiosyncratic styles of representing bibliographic data. Some have tried to get the Wikispecies community to adopt the same template as Wikipedia (see e.g., this discussion) but this proposal has met with a lot of resistance. From my perspective as a potential consumer of data, the current situation in Wikispecies is frustrating, but the reality is that the people who create the content get to decide how they structure that content. And understandably, they are less than impressed by requests that might help others (such as data miners) at the expense of making their own work more difficult.

In summary, if I want to make use of Wikispecies I am going to need to develop a set of parsers than can make a reasonable fist of parsing all the myriad citation formats used in Wikispecies (my first attempts are on GitHub). I'm looking at parsing the references and converting them to a more standard format in JSON (I've made some notes on various bibliographic formats in JSON such as BibJSON and CSL-JSON). One outcome of this work will be, I hope, more articles discovered in BHL and hence added to BioStor), and more links to identifiers, which could be fed back into Wikispecies. I also want to explore linking the authors of these papers to identifiers, as already sketched out in The Biodiversity Heritage Library meets Wikidata via Wikispecies: adding author identifiers to BioStor.