Showing posts with label PLoS. Show all posts
Showing posts with label PLoS. Show all posts

Thursday, July 09, 2020

Lists of species don't matter: thoughts on "Principles for creating a single authoritative list of the world’s species"

Garnett et al. recently published a paper in PLoS Biology that starts with the sentence "Lists of species matter":

Garnett, S. T., Christidis, L., Conix, S., Costello, M. J., Zachos, F. E., Bánki, O. S., … Thiele, K. R. (2020). Principles for creating a single authoritative list of the world’s species. PLOS Biology, 18(7), e3000736. doi:10.1371/journal.pbio.3000736

This paper (one of a forthcoming series) is pretty much the kind of paper I try and avoid reading. It has lots of authors so it is a paper by committee, those authors all have a stake in particular projects, and it is an acronym soup of organisations the paper is pitched at. It's a well-worn strategy: write one or more papers outlining making the case that there is a problem, then get funding based on the notion that clearly there's a problem (you've published papers saying so) and that you and your co-applicants are best placed to solve it (clearly, because you wrote the papers identifying the problem in the first place). I'm not criticising the strategy, it's how you get things done in science. It just makes for a rather uninspiring read.

From my perspective focussing on "lists" is a mistake. Lists don't really matter, it is what is on the list that counts. And I think this is where the real prize is. As I play with Wikidata I'm becoming increasingly aware of the clusterfuck mess the taxonomic database community has created by conflating taxonomic names with taxa, and by having multiple identifiers for the same things. We survive this mess by relying on taxonomic names as somewhat fuzzy identifiers, and the hope that we can communicate successfully with other people despite this ambiguity (I guess this is pretty much the basis of all communication). As Roger Hyam notes:

These taxon names we are dealing with are really just social tags that need to be organised in a central place.

Having lots of names (tags) is fine, and Wikidata is busy harvesting all these taxonomic names and their identifiers (ITIS, IPNI, NCBI, EOL, iNaturalist, eBird, etc., etc., etc.). For most of these names all we have is a mapping to other identifiers for the same name, a link to a parent taxon, and sometimes a link to a reference for the name. But what happens if we want to attach data to a taxon? Take, for example, the African Piculet Verreauxia africana. This bird has at least two scientific names, each with a separate entry in Wikidata: Verreauxia africana Q28123873 and Sasia africana Q1266812. These are the same species yet it has two entries in Wikidata. If I want to add, say, body weight, or population size, or longevity, which Wikidata item do I add that data too?

What we need is an identifier for the species, an identifier that remains unchanged even if the name changes, or if that species moves in the taxonomic hierarchy. Some databases do this already. For example the eBird identifier for Verreauxia africana/Sasia africana is afrpic1. Because the identifier remains unchanged we can do things such as "diffs" between successive classifications showing how the species has moved between different genera (see Taxonomic publications as patch files and the notion of taxonomic concepts):

45759416 c9c5ed80 bc1f 11e8 98ca 5f4554ddca42

Ironically it seems that for birds the common name (in this case "African Piculet") is a more stable identifier than the scientific name (although that may well change). By having stable taxon identifiers we can then decide what entity to attach biological data to. Taxonomic names have failed to do this, but are still vital as well known tags. The actual taxon identifiers should be opaque identifiers (like "afrpic1" - not really opaque but close enough - or Avibase's C4DFB5E31495AE94). Make each opaque identifier a DOI, use existing taxonomic names as formalised tags so we aren't disconnected from the literature, use timestamped versions to track changes in species classification over time, and we have something useful.

This, I think, is the real prize. Rather than frame the task as making a list of species so that organisations can have a checklist they can all share, why not frame it as providing a framework that we can hang trait data on? We have vast quantities of data residing in siloed databases, spreadsheets, and centuries of biological literature. The argument shouldn't be about what is on a list, it should be how we group that information together and enable people to do their science. By providing stable identifiers that are resistant to name changes we can confidently associate trait data with taxa. Taxonomy could then actually be what it should be, the organisational framework for biological information (see Taxonomy as Information Science).

Monday, August 28, 2017

Let’s rise up to unite taxonomy and technology

Holly Bik (@hollybik) has an opinion piece in PLoS Biology entitled "Let’s rise up to unite taxonomy and technology" https://doi.org/10.1371/journal.pbio.2002231 (thanks to @sjurdur for bringing this to my attention).

Journal pbio 2002231 g001

It's a passionate plea for integrating taxonomic knowledge and "omics" data. In her article Bik includes a mockup of the kind of tool she'd like to see (based in part on Phinch), and writes:

Step 2: Clicking on a specific data point (e.g., an OTU) will pull up any online information associated with that species ID or taxonomic group, such as Wikipedia entries, photos, DNA sequences, peer-reviewed articles, and geolocated species observations displayed on a map.

This sort of plea has been made any times, and reminds me very much of PLoS's own efforts when they wanted to build a "Biodiversity Hub" and biodiversity informatics basically failed them. The hub itself later closed down.. There's clearly a need for a simply way to summarise what we know about a species, but we've yet to really tackle this (on the face of it) fairly simple task.

Quickly summarising the available information about a species was the motivation behind my little tool iSpecies, which I recently reworked to use DBpedia, GBIF, CrossRef, EOL, TreeBASE and OpenTreeofLife as sources. For the nematode featured in Bik's figure (Desmoscolex) there's not a great deal of easily available information (see http://ispecies.org/?q=Desmoscolex). We can get a little more form other sources not queried by iSpecies, such as BioNames, which aggregates the primary taxonomic literature, see http://bionames.org/search/Desmoscolex.

Part of the problem is that taxonomy is fundamentally a "long tail" field, both in terms of the subject matter (a few very well know species, then millions of poorly known species) and our knowledge of those species (a large, scattered taxonomic literature, much of it not yet digitised, although progress is being made). Furthermore, the names of species (and our conception of them) can change, adding an additional challenge.

But I think we can do a lot better. Simple web-based tools like iSpecies can assemble reasonable information from multiple sources (and in multiple languages) on the fly. It would be nice to expand those sources (the more primary sources the better). The current iSpecies tool searches on species name. This works well if the sources being queried mention that name (e.g., in the title of a paper that has a DOI and is indexed by CrossRef). Given that many of the "omics" datasets Bik works with are likely to have dark taxa, what we'll also need is the ability to search, say, using NCBI taxon ids, and retrieve literature linked to sequences for those taxa

It would also be useful to package those up in a simple API that other tools could consume. For example, if I wanted to improve the utility of iSpecies, one approach would be to package up the results in a JSON object. Perhaps even use JSON-LD (with global identifiers for taxa, documents, etc.) to make it possible for consumers to easily integrate that data with their own.

Taxonomy could be on the brink of another golden age—if we play our cards right. As it is reinvented and reborn in the 21st century, taxonomy needs to retain its traditional organismal-focused approaches while simultaneously building bridges with phylogenetics, ecology, genomics, and the computational sciences.

Taxonomy is, of course, doing just this, albeit not nearly fast enough. There are some pretty serious obstacles, some of them cultural, but some of them due to the nature of the problem. Taxonomic knowledge is massively decentralised, mostly non-digital, and many of the key sources and aggregations are behind paywalls. There is also a fairly large "technical debt" to deal with. Ian Mulvany was recently interviewed by PLoS and he emphasised that because academic publishers had been online from early on they were pioneers, but at the same time this left them with a legacy of older technologies and approaches that can sometimes get in the way of new idea. I think taxonomy suffers from some of the same problems. Because taxonomy has long been involved with computers, sometime we needed up betting on the "wrong" solutions. For example, at one time XML was the new hotness, and people invested a lot of effort in developing XML schema, and then ontologies and RDF vocabularies. Meantime much of the web has moved to simple data formats such as JSON, many specialist vocabularies are gathering dust as schema.org takes off, and projects like Wikidata force us to rethink the need to topic-specific databases.

But these are technical details. For me the key point of "Let’s rise up to unite taxonomy and technology" is that it's a symptom of the continued failure of biodiversity informatics to actually address the needs of its users. People keep asking for fairly simple things, and we keep ignoring them (or explaining why it's MUCH harder than people think, which is another way of ignoring them).

Wednesday, June 24, 2015

Visualising Geophylogenies in Web Maps Using GeoJSON

Fig3 GoogleMaps CC BY no logo 300x205I've published a short note on my work on geophylogenies and GeoJSON in PLoS Currents Tree of Life:

Page R. Visualising Geophylogenies in Web Maps Using GeoJSON. PLOS Currents Tree of Life. 2015 Jun 23 . Edition 1. doi:10.1371/currents.tol.8f3c6526c49b136b98ec28e00b570a1e.
At the time of writing the DOI hasn't registered, so the direct link is here. There is a GitHub repository for the manuscript and code.

I chose PLoS Currents Tree of Life because it is (supposedly) quick and cheap. Unfortunately a perfect storm of delays in reviewing together with licensing issues resulted in the paper taking nearly three months to appear. The licensing issues were a headache. PLoS uses the Creative Commons CC-BY license for all its content. Unfortunately, the original submission included maps from Google Maps and Open Street Map (OSM), to show that the GeoJSON produced by my tool could work with either. Google Maps tile imagery is not freely available, so I had to replace that in order for PLoS to be able to publish my figures. At first I used simply replaced the tiles Google Maps displays with ones from OSM, but those tiles are CC-BY-SA, which is incompatible with PLoS's use of CC-BY. Argh! I got stroppy about this on Twitter:

Eventually I discovered maps from CartoDB that have CC-BY licenses, and so could be used in the PLoS Currents article. After replacing Google's and OSM tiles with these maps (and trimming off the "Google" logo) the figures were acceptable to PLoS. Increasingly I think Creative Commons has resulted in a mess of mutually incompatible licenses that make mashing up things hard. The idea was great ("skip the intermediaries" by declaring that your content can be used), but the outcome is messy and frustrating.

But, enough grumbling. The article is out, the code is in GitHib. Now to think about how to use it.

Thursday, July 11, 2013

Barcode Index Number (BIN) System in DNA barcoding explained

Journal pone 0066213 g001Quick note to highlight the following publication:
Ratnasingham, S., & Hebert, P. D. N. (2013). A DNA-Based Registry for All Animal Species: The Barcode Index Number (BIN) System. (D. Fontaneto, Ed.)PLoS ONE, 8(7), e66213. doi:10.1371/journal.pone.0066213
This paper outlines the methods used by the BOLD project to cluster sequences into "BINS", and touches on the issue of dark taxa (taxa that are in GenBank but which lack formal scientific names). Might be time to revisit the dark taxa idea, especially now that I've got a better handle on the taxonomic literature (see BioNames) where the names of at least some dark taxa may lurk.

Tuesday, July 09, 2013

The demise of the @PLoS Biodiversity Hub: what lessons can we learn?

2etoq0zjwxicokm1woge
Jonathan Eisen recently wrote that the PLOS Hub for Biodiversity is soon to be retired, and sure enough it's vanished from the web (the original URL hubs.plos.org/web/biodiversity/ now bounces you straight to http://www.plosone.org/, you can still see what it looked like in the Wayback Machine).

Like Jonathan, I was involved in the hub, which was described in the following paper:
Mindell, D. P., Fisher, B. L., Roopnarine, P., Eisen, J., Mace, G. M., Page, R. D. M., & Pyle, R. L. (2011). Aggregating, Tagging and Integrating Biodiversity Research. (S. A. Rands, Ed.)PLoS ONE, 6(8), e19491. doi:10.1371/journal.pone.0019491

In retrospect PLoS's decision to pull the hub is not surprising. The original proposal imagined a web site looking like this, with the goal of building a "dynamic community".

Proposal

From my perspective the PLoS HUb failed for two reasons. The first is that PLoS weren't nearly as ambitious as they could have been. The second is that the biodiversity informatics community simply couldn't (an arguably still can't) provide the kind of services that PLoS would have needed to make the Hubs something worth nurturing.

After a meeting at the California Academy of Science in April 2010 to discuss the hub idea I wrote a ranty blog post (Biodiversity informatics = #fail (and what to do about it)) where I expressed my frustration that we had a group of people (i.e., PLoS) rock up and express serious interest in doing something with biodiversity data, and biodiversity informatics collectively failed them. We could have been aiming for a cool database of "semantically enhanced" publications that we could query taxonomically, geographically, phylogenetically, etc. (at least, that's what I was hoping PLoS were aiming for). Instead it became clear that most of the basic services were simply not available (we didn't have a simple code to extract GenBank accession numbers, specimens codes, etc., we couldn't link specimen codes to anything online, and woe betide you if you asked what a taxon name was).

In fairness, it also became pretty clear that PLoS weren't going to go too far down the line of an all-singing portal to biodiversity data. They were really looking at a shiny web site that housed a collection of Open Access papers on biodiversity. But my point is it could have been so much more than that. We had a chance to build a platform,a knowledge base for biodiversity data that had an accessible front end (e.g., the traditional publication) but exploded that into its component parts so we could spin the data around and ask other questions.

Inspired by the possibilities I spent the next couple of months playing with some linked data demos (see here and here, the links in these demos have long since died). The idea was to explore how much of what I imagined the PLoS Hub could be it was possible to build using RDF and SPARQL. It was fun, but RDF and SPARQL are awful things to "play" with, and the vast bulk of the data had to be wrapped in custom scripts I wrote because the original data providers didn't supply RDF. As I've written elsewhere, I think the cost of getting to a place where RDF enables you to do meaningful stuff is just too high. Our data are too messy, we lack agreed identifiers, and we either have too many or too few vocabularies (and those we do have invariably spark lengthy, philosophical debates - vocabularies are taxonomies of data, need I say more). The RDF approach is also doomed to fail because it assumes multiple decentralised data repositories are the way forward. In my experience, these cannot deliver the kinds of things we need. The data need to be brought together, cleaned, aligned, augmented, and finally linked together. This is much easier to do if all the data are in one place.

So where does this leave us? In many ways I'd like to attempt something like PLoS Hubs again, or perhaps more precisely, think about building a platform so that if a publisher came along and wanted to do something similar (but more ambitious) we would have the tools in place that could make it happen. What I'd like is a way more sophisticated version of this, where you could explore data in various dimensions (geography, taxonomy, phylogeny), track citation and provenance information (what papers cite this specimen, what sequences is it a voucher for, what trees are built on those sequences). If we had a platform that supported these sorts of queries, not only could we provide great environment upon which we could embed scientific publications, we could also support the kinds of queries we can't do at the moment (e.g., give me all the molecular phylogenies for species in Madagascar, locate all the data - publications, taxonomic identifications, sequences - about a specimen, etc.).

I'll leave you with a great rant about platforms. It's long but it's fun, and I think it speaks to where we are now in biodiversity informatics (hint, we aren't Amazon).

Monday, December 19, 2011

Towards an interactive taxonomic article: displaying an article from ZooKeys

One of the things I keep revisiting is the way we display scientific articles. Apart from Nature's excellent iPhone and iPad apps, most efforts to re-imagine how we display articles are little more than glorified PDF viewers (e.g., the PLoS iPad app).

Part of the challenge is that if we make the article more interactive we immediately confront the problem of how to link to other content. For example, we may have a lovingly crafted ePub view (e.g., Nature's apps), but what happens when the user clicks on a citation to another paper? If the paper is published by the same journal, then potentially it could be viewed using the same viewer, but if not then we are at the mercy of the other publisher. They will have their own ideas of how to display articles, so the simplest fallback is to display the cited article in a web browser view. The problem with this is that it breaks the user experience - the other publisher is unlikely to follow the same conventions for displaying an article and its links. If we are lucky the cited article might be published in an Open Access journal that provides, say, XML based on the NLM DTD standard. Knowing whether an article is Open Access or not is not straightforward, and different journals have their own unique interpretation of the NLM standard.

Then there is the issue of other kinds of content, such as taxonomic names, specimens, DNA sequences, geographic localities, etc. We lack decent services for many of these objects, as a result efforts like PLoS Biodiversity Hub end up being underwhelming collections of reformatted journal articles, rather then innovative integrations of biodiversity knowledge.

With these issues in mind I've started playing with ZooKeys XML, initially looking at ways to display the article beyond the conventional format. Ultimately I'd like to embed the article in a broader web of citations and data. ZooKeys articles are available in PDF, HTML, and XML. The HTML has links to taxon pages, maps, etc., which is nice, but I personally find this a little jarring because it interrupts the reading experience. The ZooKeys web site also surrounds the article with all paraphernalia of a publisher's web site:

Zookeys
As a first experiment, I've taken the XML for article At the lower size limit for tetrapods, two new species of the miniaturized frog genus Paedophryne (Anura, Microhylidae) http://dx.doi.org/10.3897/zookeys.154.1963 and used a XSLT style sheet to reformat the article. I've borrowed some ideas from Nature's apps, such as the font for the title, displaying the abstract in bold, and showing all the figures in the article as thumbnails near the top. I've also added some basic interactivity, which you can see in the video below. Instead of figures being in one place in the article, wherever a figure is mentioned in the article (e.g., "Fig. 1") if you click on the reference to the figure it appears. If the article display a point locality using latitude and longitude, instead of launching a separate browser window with a Google map, click on the locality and the map appears. The idea is that the flow of reading isn't interrupted, figures, maps, and citations all appear in the text.


This demo (which you can see live at http://iphylo.org/~rpage/zookeys) is limited, but most of its functionality comes from simply reformatting XML using XSLT. There's a little bit of jQuery for animation, and I ended up having to write a PHP script to convert verbatim latitude and longitude coordinates to the decimal coordinates expected by Google Maps, but it's all very light weight. It wouldn't take much to add some JSON queries to make the taxon names clickable (e.g., showing a summary of a taxon from EOL). Because ZooKeys uses the NLM DTD for its XML, some of this code could also be applied to other journals, such as PLoS, so we could start to grow a library of linked, interactive taxonomic articles.

Wednesday, September 07, 2011

Suggested apps for BHL's Life and Literature Code Challenge


Since I won't be able to be at the Biodiversity Heritage Library's Life and Literature meeting I thought I'd share some ideas for their Life and Literature Code Challenge. The deadline is pretty close (October 17) so having ideas now isn't terribly helpful I admit. That aside, here are some thoughts inspired by the challenge. In part this post has been inspired by the Results of the PLoS and Mendeley "Call for Apps", where PLoS and Mendeley asked for people (not necessarily developers) to suggest the kind of apps they'd like to see. As an aside, one thing conspicuous by it's absence is a prize for winning the challenge. PLoS and Mendeley have a "API Binary Battle" with a prize of $US 10,001, which seems more likely to inspire people to take part.

Visual search engine
I suspect that many BHL users are looking for illustrations (exemplified by the images being gathered in BHL's Flickr group). One way to search for images would be to search within the OCR text for figure and plate captions, such as "Fig. 1". Indexing these captions by taxonomic name would provide a simple image search tool. For modern publications most figures are on the same page as the caption, but for older publications with illustrations as plates, the caption and corresponding image may be separated (e.g., on facing pages), so the search results might need to show pages around the page containing the caption. As an aside, it's a pity the Flickr images only link to the BHL item and not the BHL page. If they did the later, and the images were tagged with what they depict, you could great a visual search engine using the Flickr API (of course, this might be just the way to implement the visual search engine — harvest images, tags with PageID and taxon names, upload to Flickr).

Mobile interface
The BHL web site doesn't look great on an iPhone. It makes no concessions to the mobile device, and there are some weird things such as the way the list of pages is rendered. A number of mainstream science publishers are exploring mobile versions of their web sites, for example Taylor and Francis have a jQuery Mobile powered interface for mobile users. I've explored iPad interfaces to scientific articles in previous posts. BHL content posses some challenges, but is fundamentally the same as viewing PDFs — you have fixed pages that you may want to zoom.

OCR correction
There is a lot of scope for cleaning up the OCR text in BHL. Part of the trick would be to have a simple use interface for people to contribute to this task. In an earlier post I discussed a Firefox hOCR add-on that provides a nice way to do this. Take this as a starting point, add a way to save the cleaned up text, and you'd be well on the way to making a useful tool.

Taxon name timeline
Despite the shiny new interface, the Encyclopedia of Life still displays BHL literature in the same clunky way I described in an earlier blog post. It would great to have a timeline of the usage of a name, especially if you could compare the usage of different names (such as synonyms). In many ways this is the BHL equivalent Google Books Ngram viewer.

These are just a few hastily put together thoughts. If you have any other ideas or suggestions, feel free to add them as comments below.

- Posted using BlogPress from my iPad

Thursday, March 31, 2011

Paper on NCBI and Wikipedia published in PLoS Currents: Tree of Life

__logo__1.jpg
My paper describing the mapping between NCBI and Wikipedia has been published in PLoS Currents: Tree of Life. You can see the paper here. It's only just gone live, so it's yet to get a PubMed Central number (one of the nice features of PLoS Currents is that the articles get archived in PMC).

Publishing in PLoS Currents: Tree of Life was a pleasant experience. The Google Knol editing environment was easy to use, and the reviewing process quick. It's obviously a new and rather experimental journal, and there are a few things that could be improved. Automatically looking up articles by PubMed identifier is nice, but it would also be great to do this for DOIs as well. Furthermore, the PubMed identifiers aren't displayed as clickable links, which rather defeats the point of having references on the web (I've added DOI links to the articles wherever possible). But, minor grumbles aside, as a way to get an Open Access article published for free, and have it archived in PubMed Central, PLoS Currents is hard to beat. What will be interesting is whether the article receives any comments. This seems to be one area online journals haven't really cracked — providing an environment where people want to engage in discussion.

Tuesday, October 05, 2010

PLoS Biodiversity Hub launches

hubs.png

The PLoS Biodiversity Hub has launched today. There's a PLoS blog post explaining the background to the project, as well as a summary on the Hub itself:

The vision behind the creation of PLoS Hubs is to show how open-access literature can be reused and reorganized, filtered, and assessed to enable the exchange of research, opinion, and data between community members.

PLoS Hubs: Biodiversity provides two main functions to connect researchers with relevant content. First, open-access articles on the broad theme of biodiversity are selected and imported into the Hub. In time, the content will also be enhanced so that the articles are connected with data, and we will provide features to make the articles easier for people to use. These two functions - aggregation and adding value - build on the concept of open access, which removes all the barriers to access and reuse of journal article content.


Readers of iPhylo may recall my account of one of the meetings involved in setting up this hub, in which I began to despair about the lack of readiness of biodiversity informatics to provide much of the information needed for projects such as hubs. Despite this (or perhaps, because of it), I've become a member of the steering committee for the Biodiversity Hub. There's clearly a lot of interest in repurposing the content found in scientific articles, and I think we're going to see an increasing number of similar projects from the major players in science publishing, Open Access or otherwise. One of the challenges is going to be moving beyond the obvious things (such as making taxon names clickable) to enable new kinds of ways of reading, navigating, and querying the literature, and exploring ways to track the use that is made of the information in these articles. Biodiversity studies are ideally placed to explore this as the subject is data rich and much of that data, such as specimens and DNA sequences, persist over time and hence get reused (data citation gets very boring if the data is used just once). We also have obvious ways to enrich navigation, such as spatially and taxonomically.

For now the PLoS Biodiversity Hub is very pretty, but it's more a statement of intent than a real demonstration of what can be done. Let's hope our field gets its act together and seizes the opportunity that initiatives like the Hub represents. Publishers are desperate to differentiate themselves from their competitors by providing added value as part of the publication process, and they provide a real use case for all the data that the biodiversity projects have been accumulating over the last couple of decades.

Friday, September 03, 2010

Viewing scientific articles on the iPad: browsing articles

touchevents.pngIn previous articles I've looked at how various apps display scientific articles. The apps I looked at were:

So, where next? As Ian Mulvany noted in a comment on an earlier post, I haven't attempted to summarise the best user interface metaphors for navigation. Rather than try and do that in the abstract, I'd like to create some prototypes to play with various ideas. The Sencha Touch framework looks a good place to start. It's web-based, so things can be prototyped rapidly (I'm not going to learn Objective C anytime soon). There's a moderately steep learning curve, unless you've written a lot of Javascript (I've done some, but not a lot), but it seems to offer a lot of functionality. Another advantage of developing a web app is that it keeps the focus on making the content accessible across devices, and using the web as the means to display and interact with content.

Then there is also the issue (in addition to displaying an individual article) of how to browse and find articles to view. Here are some possibilities.

Publisher's stream
Apps such as the Nature app and the PLos Reader provide you with a stream of articles from a single publisher. This is obviously a bit limiting for the reader, but might have some advantages if the publisher has specifically enhanced their content for devices such as the iPad.

Personal library
Apps such as Mendeley and Papers provide articles from your personal library. These are papers you care about, and one you may make active use of.

Social
Social readers such as Flipboard show the power of bringing together in one place content derived from social streams, such as Twitter and Facebook, as well as curated sources and publisher streams. Mendeley and other social bookmarking services (e.g., CiteULike, Connotea) could be used to provide social similar streams of papers for an article viewer. Here the goal is probably to find out what papers people you know find interesting.

Spatialipadmap.png
In an earlier post I used a map to explore papers in my BioStor archive. This would be an obvious thing to add to an iPad app, especially as the iPad knows where you are. Hence, you could imagine browsing papers about areas that are near you, or perhaps by authors near you. This would be useful if, say, you wanted to know about ecological or health studies of the area you live in. If the geographic search was for people rather than papers, you could easily discovering what kind of research is published by universities or other research bodies that are near your current location.

Of course, Earth is not the only thing we can explore spatially. Google maps can display other bodies in the solar system, (e.g., Mars), as well as the night sky. Imagine being interested in astronomy and being able to browse papers about specific planetary or stellar objects. Likewise, genomes can be browsed using Google maps-inspired browsers (e.g., jBrowse), so we could have an app where you could easily retrieve articles about a particular gene or other region of a genome.

Categories
Another way to browse content is by topic. Classifying knowledge into categories is somewhat fraught, but there are some obvious wasy this could be useful. A biologist might want to navigate content by taxonomic group, particularly if they want to browse through the 1000's of articles published in a journal such as Zootaxa (hence my experiments on browsing EOL). Of course, a tree is not the only way to navigate hierarchical content. Treemaps are another example, and I've played with various versions in the past (see here and here).

qt.png

I have a love-hate relationship with treemaps, but some of the most interesting work I've seen on treemaps has been motivated by displaying information on small screens, e.g. "Using treemaps to visualize threaded discussion forums on PDAs" (doi:10.1145/1056808.1056915).

Summary
These notes list some of the more obvious ways to browse a collection of articles. It would be fun to explore these (and other approaches) in parallel with thinking about how to display the actual articles. These two issues are related, in the sense that the more metadata we can extract from the articles (such as keywords, taxonomic names and other named entities, geographic localities, etc.) the richer the possibilities for finding our way through those articles.

Tuesday, August 24, 2010

Viewing scientific articles on the iPad: the PLoS Reader

Continuing on from my previous post Viewing scientific articles on the iPad: towards a universal article reader, here are some brief notes on the PLoS iPad app that I've previously been critical of.

There are two key things to note about this app. The first is that it uses the page turning metaphor. The article is displayed as a PDF, a page at a time, and the user swipes the page to turn it over. Hence, the app is simulating paper on the iPad screen.

turn.jpg


But perhaps more interesting is that, unlike the Nature app discussed earlier, the PLoS app doesn't use a custom API to retrieve articles. Instead the app uses RSS feeds from the PLoS site. PLoS provides journal-specific RSS feeds, as well as subject-specific feeds within journals (see, for example, the PLoS ONE home page). The PLoS Reader app takes these feeds and uses them to create a list of articles the reader can choose from.

A nice feature of the PLoS ATOM feeds is the provision of links to alternative formats for the article (unlike many journal RSS feeds, which provide just a DOI or a URL). For example, the feed item for the article "Transmission of Single HIV-1 Genomes and Dynamics of Early Immune Escape Revealed by Ultra-Deep Sequencing" doi:10.1371/journal.pone.0012303 contains links to the PDF and XML versions of the article:


<link rel="related"
type="application/pdf"
href="http://www.plosone.org/article/fetchObjectAttachment.action?uri=info:doi/10.1371/journal.pone.0012303&representation=PDF"
title="(PDF) Transmission of Single HIV-1 Genomes and Dynamics of Early Immune Escape Revealed by Ultra-Deep Sequencing" />
<link rel="related"
type="text/xml"
href="http://www.plosone.org/article/fetchObjectAttachment.action?uri=info:doi/10.1371/journal.pone.0012303&representation=XML"
title="(XML) Transmission of Single HIV-1 Genomes and Dynamics of Early Immune Escape Revealed by Ultra-Deep Sequencing" />


This makes the task of an article reader much easier. Rather than attempt to screen scrape the article web page, or rely on a rule for constructing the link to the desired file, the feed provides an explicit URL to the different available formats.

I've not seen this feature in other journal RSS feeds, although article web pages sometimes provide this information. BMC journals, for example, provide <link rel="alternate"> tags in the web page for each article, from which we can extract links to the XML and PDF versions, and some journals (BMC included) provide the Google Scholar metadata data tag <meta name="citation_pdf_url"> to link to the PDF. Hence, a generic article reader will need to be able to extract metadata tags from article web pages as it seeks formats suitable to display.

Friday, August 06, 2010

Extracting semantic goodness from Zootaxa articles

zootaxa.png

I've just come back from a holiday in New Zealand, during which time I spent a morning chatting with Zhi-Qiang Zhang (@Zootaxa, editor of Zootaxa) and Stephen Thorpe (stho002, a major contributor to Wikispecies).

Fresh from playing with PLoS XML to explore ways of redisplaying articles (described in my commentary on the PLoS iPad app), I was extolling the virtues of the XML mark-up that underlies PLoS (and other Open Access journals, such as the BMC series). These publishers provide Open Access XML versions of their papers that are quite richly marked up: internal citations, links to figures, the bibliography, etc. are all clearly identified, although they don't have the semantic mark-up of TaxPub, used in some recent Zookeys papers.

Talking to Zhi-Qiang Zhang is always a useful reality check. Zootaxa describes itself as the
World's foremost journal in taxonomy; publisher of 15,421 new taxa in 141,518 pages by 7,385 authors worldwide since 2001

This is taxonomic publishing on a grand scale, averaging more than an article a day. Since 2004 Zootaxa has published 12.60% percent of the new taxa recorded in Zoological Record, an order of magnitude more it's nearest rival. The journal is being tightly run, and doesn't have cash to spare (it has nothing like the funding PLoS has, for example). Any change to the basic work flow (author submits Word file, this is imported into Adobe Framemaker, which creates the PDF files displayed on the Zootaxa web site) requires compelling justification. Furthermore, any change would have to scale. The level of work required to embellish articles using custom mark-up, such as TaxPub, just isn't feasible.

Zhi-Qiang waxed enthusiastically about Google Books' interface, where basic information such as keywords, geographic location, and references are extracted automatically. Google Books was one inspiration for the article display I use in BioStor, so I wondered how hard it would be to take some of the work I've been doing on BioStor and on adding mark-up to PLoS XML and apply it to Zootaxa PDFs. After some fussing with regular expressions, the bioGUID OpenURL resolver and uBio's FindIT taxonomic name tool, I've some scripts that automate extracting basic information from a Zootaxa PDF, such as the abstract, localities, taxonomic names, GenBank sequences, and the bibliography. You can see some examples at http://iphylo.org/~rpage/zootaxa/. It's all a bit crude, and isn't the same as being able to mark-up the actual text (which could be done, but with rather more effort), but there's potential here to create nice interfaces to Zootaxa papers, as well as extract the data needed to do some interesting queries.



Thursday, June 17, 2010

PLoS doesn't "get" the iPad (or the web)

PLoS recently announced a dedicated iPad app, that covers all the PLoS Journals, and which is available from the App Store. Given the statement that "PLoS is committed to continue pushing the boundaries of scientific communication" I was expecting something special. Instead, what we get (as shown) in the video below is a PDF viewer with a nice page turning effect (code here). Maybe it's Steve Job's fault for showing iBooks when he first demoed the iPad, but there desire to imitate 3D page turning effects leaves me cold (for a nice discussion of how this can lead to horribly mixed metaphors see iA's Designing for iPad: Reality Check).




But I think this app shows that PLoS really don't grok the iPad. Maybe it's early days, but I find it really disappointing that page-turning PDFs is the first thing they come up with. It's not about recreating the paper experience on a device! There's huge scope for interactivity, which the PLoS app simply ignores — you can't select text, and none of the references. It also ignores the web (without which, ironically, PLoS couldn't exist).

Instead of just moaning about this, I've spent a couple of days fussing with a simple demo of what could be done. I've taken a PLoS paper ("Discovery of the Largest Orbweaving Spider Species: The Evolution of Gigantism in Nephila", doi:10.1371/journal.pone.0007516), grabbed the XML, applied a XSLT style sheet to generate some HTML, and added a little Javascript functionality. References are displayed as clickable links inline. If you click on one a window pops up displaying the citation, and it then tries to find it for you online (for the technically mined, it's using OpenURL and bioGUID). If it succeeds it displays a blue arrow — click that and you're off to the publisher's web site to view the article.
reference.png

Figures are also links, click on and you get a Lightbox view of the image.
You can view this article live, in a regular browser or in iPad. Here's a video of the demonstration page:


This is all very crude and rushed. There's a lot more that could be done. For references we could flag which articles are self citations, we could talk to bookmarking services via their APIs to see which citations the reader already has, etc. We could also make data, sequences, and taxonomic names clickable, providing the reader with more information and avenues for exploration. Then there's the whole issue of figures. For graphs we should have the underlying data so that we can easily make new visualisations, phylogenies should be interactive (at least make the taxon names clickable), and there's no need to amalgamate figures into aggregates like Fig .2 below. Each element (A-E) should be separately addressable so when the text refers to Fig. 2D we can show the user just that element.

journal.pone.0007516.g002.png

The PLoS app and reactions to Elsevier's "Article 2.0" (e.g., Elsevier's 'Article of the Future' resembles websites of the past and The “Article of the Future” — Just Lipstick Again?) suggests publishers are floundering in their efforts to get to grips with the web, and new platforms for interacting with the web.

So, PLoS, I challenge you to show us that you actually "get" the iPad and what it could mean for science publishing. Because at the moment, I've seen nothing that suggests you grasp the opportunity it represents. Better yet, why not revisit Elsevier's Article 2.0 project and have a challenge specifically about re-imagining the scientific article? And please, no more page turning effects

Thursday, May 06, 2010

Linnaeus meets the Internet: PLoS + Botany = #fail

C2914D0E-13E9-4CA6-BE0A-7A8645BC6A72.jpgTo much fanfare (e.g., Nature News, "Linnaeus meets the Internet" doi:10.1038/news.2010.221), on May 5th PLoS ONE published Sandy Knapp's "Four New Vining Species of Solanum (Dulcamaroid Clade) from Montane Habitats in Tropical America" doi:10.1371/journal.pone.0010502. To quote the Nature News piece:
The paper represents the culmination of a campaign to institute the electronic publication of scientific names, a case Knapp and others have made in journals including Nature[doi:10.1038/446261a]. Allowing electronic publication should make accessing information easier for scientists worldwide — especially those in developing countries who may not have access to fully stocked libraries. This, in turn, will aid conservation efforts, Knapp says.

Given the profile of this paper, "...the first time new plant names have been published in a purely electronic journal and still complied with ICBN rules", you'd think the participants would ensure the electronic aspects of the publication worked. Sadly, this is not the case.

The four names in question have apparently been deposited in IPNI with the following LSID's:

  • Solanum aspersum: urn:lsid:ipni.org:names:77103633-1

  • Solanum luculentum: urn:lsid:ipni.org:names:77103634-1

  • Solanum sanchez-vegae: urn:lsid:ipni.org:names:77103635-1

  • Solanum sousae: urn:lsid:ipni.org:names:77103636-1


Today is May 6th. None of these names are returned by a search of IPNI, for example http://www.ipni.org/ipni/simplePlantNameSearch.do?find_wholeName= returns this:

ipni1.png

Resolving the LSID returns this:

<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:dcterms="http://purl.org/dc/terms/"
xmlns:tn="http://rs.tdwg.org/ontology/voc/TaxonName#"
xmlns:tm="http://rs.tdwg.org/ontology/voc/Team#"
xmlns:tcom="http://rs.tdwg.org/ontology/voc/Common#"
xmlns:p="http://rs.tdwg.org/ontology/voc/Person#"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:owl="http://www.w3.org/2002/07/owl#">
<tn:TaxonName rdf:about="urn:lsid:ipni.org:names:77103633-1">
<tcom:versionedAs rdf:resource="urn:lsid:ipni.org:names:77103633-1:1.2"/>
<tcom:Deleted>Yes</tcom:Deleted>
</tn:TaxonName>
</rdf:RDF>

Hmmm, so apparently this record has been "deleted"?

The paper also states that:
The IPNI LSIDs (Life Science Identifiers) can be resolved and the associated information viewed through any standard web browser by appending the LSID contained in this publication to the prefix http://ipni.org/.

This sentence mirrors similar ones in other PLoS ONE papers saying we can resolve ZooBank LSIDs by appending the LSID to http://zoobank.org (e.g., see doi:10.1371/journal.pone.0001787).

Thing is, URLs such as http://ipni.org/urn:lsid:ipni.org:names:77103633-1 return a 404 from Kew (any IPNI LSID I've tried does this).


Update As per Alan Paton's comment below, the http://ipni.org prefix now works.


So, to recap:

  1. The names aren't in IPNI

  2. The LSIDs state the record has been deleted

  3. The LSID's can't be resolved by the means stated in the paper

Now, I don't know what happened (perhaps IPNI wanted to hold off until the paper actually appeared before releasing the names), but the paper is out, the buzz in Nature is out, and IPNI doesn't have the resolver in place, yet alone the names.

Given the milestone this paper represents, and the fuss over the publication of the name Darwinius, you'd expect the bioinformatics side of it to be, you know, actually working. In these circumstances, how on Earth do we make the case that the LSID and name databasing side of taxonomic publication is useful?

Wednesday, April 14, 2010

Biodiversity informatics = #fail (and what to do about it)

The context for this post is the PLos markup meeting held at the California Academy of Sciences over the weekend (many thanks to Brian Fisher for the invitation). PLoS are launching a "biodiversity hub" and were looking for ideas on how to implement this. The fact that nobody -- least of all those attending from PLoS -- could adequately explain what a hub was made things a tad tricky, but that didn't matter, because PLoS did know when the first iteration of the hub was going live (later this summer). So, once we got past the fact that PLoS operates with a timeline that says "cool stuff will happen here" then sets about figuring what that cool stuff will actually be (in retrospect you gotta admire this approach), we then tried to figure out what PLoS needed from us.

That's when things got messy. It became very clear that PLoS wanted basic things like, you know, information on names, being able to link to specimens, etc., and our community can't do this, at least not yet. Nor can we provide simple answers to simple questions. For example, Rich Pyle, gave an overview of taxonomic names, nomenclature, concepts, and the horrendous alphabet soup of databases (uBio, ZooBank, IPNI, IndexFungorum, GNA, GNUB, GNI, CoL, etc.) that have a stake in this. You could see the look of horror in the eyes of the PLoS developers who were tasked with making the hub happen ("run away, run away now"). And this was after the simple version of things. In a week where taxonomy was in the news because of the possibility that Drosophila melanogaster would have to, *cough*, change its name (doi:10.1038/464825a)1, this was not a great start.

At each step when we outlined some of the stuff that would be cool, it became clear we couldn't deliver what we were actually arguing PLoS should do. For example, we have millions of digitised specimen records, and lots of papers refer to these specimens by name, but because individual specimens don't have URIs we can't refer to them (instead we have horrific query interfaces like TAPIR, see Accessing specimens using TAPIR or, why do we make this so hard?). We're digitising the taxonomic literature, but don't provide a way to link this to modern literature at the level of granularity publishers use (i.e., articles).

Readers of this blog will have heard this all before, but what made this meeting different was we actually had a "customer" rock up and ask for our help to enhance their content and create something useful for the community...and the best we could do was um and er and confess we couldn't really give them what they wanted2.

Think of the children
It's time biodiversity informatics stopped playing "let's make an acronym", stopped trying to keep taxonomists happy (face it, that's never going to happen, and frankly, they'll be extinct soon anyway), and stopped obsessing with who owns the data, and instead focus on delivering some simple, solid, services that address the needs of people who, you know, will actually do something useful with them. Otherwise we'll be like digital librarians, who thought people would search the way librarians do, then got their nose out of joint when Google ate their lunch.

It's time to make some simple services, and stop the endless cycle of inward looking meetings where we talk to each other. We need to learn to hide what people don't need (nor want) to see. We need to be able to:

  1. Extract entities from text, e.g. scientific names, specimen codes, localities, GenBank accession numbers.

  2. Lookup a taxonomic name and return basic information about that name (rather like iSpecies but as a service).

  3. Make specimen codes resolvable.

  4. Make taxonomic literature accessible using identifiers and tools publishers know about (that means DOIs and OpenURL).


We're close to a lot of this already, but we're still far enough away to make some of this non-trivial. And we keep having meetings about this stuff, and fail to actually get it done. Something is wrong somewhere when E O Wilson has his name on yet another call for megabucks for a biodiversity project (the "Barometer of Life, doi:10.1126/science.1188606). At what point will someone ask "um, we've given you guys a lot of money already, why can't you tell me the stuff we need to know?"

Let me just say that I'm a short term pessimist, but a long term optimist. The things I complain about will get fixed, one day. It's just that I see little evidence they'll get fixed by us. Prove me wrong, go on, I dare you...

  1. Personally I'm intensely relaxed about Drosophila melanogaster remaining Drosophila melanogaster, even if it ends up in a clade surrounded by flies with other generic names. Having (a) a stable name and (b) knowing where it fits in the tree of life is all we need to do science.

  2. At the meeting I couldn't stop thinking of the scene in The West Wing where President Bartlett walks up to the Capitol for an impromptu meeting with the Speaker of the House to sort out the budget, and is left waiting outside while the Speaker sorts out his game plan. By the time the Speaker is ready, the President has turned on his heels and left, making the Speaker look a tad foolish.


Monday, April 20, 2009

Semantic Publishing: towards real integration by linking

PLoS Computational Biolgy has recently published "Adventures in Semantic Publishing: Exemplar Semantic Enhancements of a Research Article" (doi:10.1371/journal.pcbi.1000361) by David Shotton and colleagues. As a proof of concept, they took Reis et al. (doi:10.1371/journal.pntd.0000228) and "semantically enhanced" it:
These semantic enhancements include provision of live DOIs and hyperlinks; semantic markup of textual terms, with links to relevant third-party information resources; interactive figures; a re-orderable reference list; a document summary containing a study summary, a tag cloud, and a citation analysis; and two novel types of semantic enrichment: the first, a Supporting Claims Tooltip to permit “Citations in Context”, and the second, Tag Trees that bring together semantically related terms. In addition, we have published downloadable spreadsheets containing data from within tables and figures, have enriched these with provenance information, and have demonstrated various types of data fusion (mashups) with results from other research articles and with Google Maps.
The enhanced article is here: doi:10.1371/journal.pntd.0000228.x001. For background on these enhancements, see also David's companion article "Semantic publishing: the coming revolution in scientific journal publishing" (doi:10.1087/2009202, PDF preprint available here). The process is summarised in the figure below (Fig. 10 from Shotton et al., doi:10.1371/journal.pcbi.1000361.g010).



While there is lots of cool stuff here (see also Elsevier's Article 2.0 Contest, and the Grand Chalenge, for which David is one of the judges), I have a couple of reservations.

The unique role of the journal article?

Shotton et al. argue for a clear distinction between journal article and database, in contrast to the view articulated by Philip Bourne (doi:10.1371/journal.pcbi.0​010034) that there's really no difference between a database and a journal article and that the two are converging. I tend to favour the later viewpoint. Indeed, as I argued in my Elsevier Challenge entry (doi:10.1038/npre.2008.2579.1), I think we should publish articles (and indeed data) as wikis, so that we can fix the inevitable error. We can always roll back to the original version if we want to see the author's original paper.

Real linking

But my real concern is that the example presented is essentially "integration by linking", that is, the semantically enhanced version gives us lots of links to other information, but these are regular hyperlinks to web pages. So, essentially we've gone from pre-web documents with no links, to documents where the bibliography is hyperlinked (most online journals), to documents where both the bibliography and some terms in the text are hyperlinked (a few journals, plus the Shotton et al. example). I'm a tad underwhelmed.
What bothers me about this is:
  1. The links are to web pages, so it will be hard to do computation on these (unless the web page has easily retrievable metadata)
  2. There is no reciprocal linking -- the resource being linked to doesn't know it is the target of the link


Web pages are for humans

The first concern is that the marked-up article is largely intended for human readers. Yes, there are associated metadata files in RDF N3, but the core "added value" is really only of use to humans. For it to be of use to a computer, the links would have to go to resource that the computer can understand. A human clicking on many of the links will get a web page and they can interpret that, but computers are thick and they need a little help. For example, one hyperlinked term is Leptospira spirochete, linked to the uBio namebank record (click on the link to see it). The link resolves to a web page, so it's not much use to a computer (unless if has a scrapper for uBio HTML). Ironically, uBio serves LSIDs, so we could retrieve RDF metadata for this name (urn:lsid:ubio.org:namebank:255659), but there's nothing in the uBio web page that tells the computer that.

Of course, Shotton et al. aren't responsible for the fact that most web pages aren't easily interpreted by computers, but simply embedding links to web pages isn't a big leap forward. What could they have done instead? One approach is to link to resources that are computer-readable. For example, instead of linking the term "Oswaldo Cruz Foundation" to that organisation's home page (http://www.fiocruz.br/cgi/cgilua.exe/sys/start.htm?tpl=home), why not use the DBpedia URI http://dbpedia.org/page/Instituto_Oswaldo_Cruz? Now we get both a human-readable page, and extensive RDF that a computer can use. In other words, if we crawl the semantically enhanced PLoS article with a program, I want to be able to have that crawler follow the links and still get useful information, not the dead end of a HTML web page. Quite a few of the institutions listed in the enhanced paper have DBPedia URIs:


Why does this matter? Well, if you use DBPedia URIs you get RDF, plus you get connections with the Linked Data crowd, who are rapidly linking diverse data sets together:


I think this is where we need to be headed, and with a little extra effort we can get there, once we move on from thinking solely about human readers.

An alternative approach (and one that I played with in my Challenge entry, as well as my ongoing wiki efforts) is to create what Vandervalk et al. term a "semantic warehouse" (doi:10.1093/bib/bbn051). Information about each object of interest is stored locally, so that clicking on a link doesn't take you off-site into the world wide wilderness, but to information about that object. For example, the page for the paper Mitochondrial paraphyly in a polymorphic poison frog species (Dendrobatidae; D. pumilio) lists the papers cited, clicking on one takes you to the page about that paper. There are limitations to this approach as well, but the key thing is that one could imagine doing computations over this (e.g., computing citation counts for DNA sequences, or geospatial queries across papers) that simple HTML hyperlinking won't get you.

Reciprocal links

The other big issue I have with the Shotton et al. "integration by linking" is that it is one-way. The semantically enhanced paper "knows" that it links to, say, the uBio record for Leptospira, but uBio doesn't know this. It would enhance the uBio record if it knew that doi:10.1371/journal.pntd.0​000228.x001 linked to it.

Links are inherently reciprocal, in the sense that if paper 1 cites paper 2, then paper 2 is cited by paper 1.

Publishers understand this, and the web page of an article will often show lists of papers that cite the paper being displayed. How do we do this for data and other objects of interest? If we database everything, then it's straightforward. CrossRef is storing citation metadata and offers a "forward linking" service, some publishers (e.g., Elsevier and Highwire) offer their own versions of this. In the same way, this record for GenBank sequence AY322281 "knows" that it is cited by (at least) two papers because I've stored those links in a database. Knowing that you're being linked to dramatically enhances discoverability. If I'm browsing uBio I gain more from the experience if I know that the PLoS paper cites Leptospira.

Knowing when you're being linked to

If we database everything locally then reciprocal linking is easy. But, realistically, we can't database everything (OK, maybe that's not strictly true, can can think of Google as a database of everything). The enhanced PLoS paper "knows" that it cites the uBio record, how can the uBio record "know" that it has been cited by the PLoS paper? What if the act of linking was reciprocal? How can we achieve this in a distributed world? Some possibilities:
  • we have an explicit API embedded in the link so that uBio can extract the source of the link (could be spoofed, need authentication?)
  • we use OpenURL-style links that embed the PLoS DOI, so that uBio knows the source of the link (OpenURL is a mess, but potentially very powerful)
  • uBio uses the HTTP referrer header to get the source of the link, then parses the PLoS HTML to extract metadata and the DOI (ugly screen scraping, but no work for PLoS)

Obviously this needs a little more thought, but I think that real integration by linking requires that the resources being linked are both computer and human readable, and that both resources know about the link. This would create much more powerful "semantically enhanced" publications.