iPhylo: Kew

Roderic D. M. Page

Showing posts with label Kew. Show all posts

Wednesday, August 10, 2016

On asking for access to data

In between complaining about the lack of open data in biodiversity (especially taxonomy), and scraping data from various web sites to build stuff I'm interested in, I occasionally end up having interesting conversations with the people whose data I've been scraping, cleaning, cross-linking, and otherwise messing with.

Yesterday I had one of those conversations at Kew Gardens. Kew is a large institution that is adjusting to a reduced budget, a changing ditigal landscape, and a rethinking of it's science priorities. Much of Kew's data has not been easily accessible to the outside world, but this is changing. Part of the reason for this is that Defra, which part-funds Kew, is itself opening up (see Ellen Broad's fascinating post Lasers, hedgehogs and the rise of the Age of Yoghurt: reflections on #OpenDefra).

During this conversation I was asked "Why didn't you just ask for the data instead of scraping it? We would most likely have given it to you." My response to this was "well, you might have said no". In my experience saying "no" is easy because it is almost always the less risky approach. And I want a world where we don't have to ask for data, in the same way that we don't ask to get source code for open source software, and we don't ask to download genomic data from GenBank. We just do it and, hopefully, do cool things with it. Just as importantly, if things don't work out and we fail to make cool things, we haven't wasted time negotiating access for something that ultimately didn't work out. The time I lose is simply the time I've spent playing with the data, not any time negotiating access. The more obstacles you put in front of people playing with your data, the fewer innovative uses of that data you're likely to get.

But it was pointed out to me that a consequence of just going ahead and getting the data anyway is that it doesn't necessarily help people within an organisation make the case for being open. The more requests for access to data that are made, the easier it might be to say "people want this data, lets work to make it open". Put another way, by getting the data I want regardless, I sidestep the challenge of convincing people to open up their data. It solves my problem (I want the data now) but doesn't solve it for the wider community (enabling everyone to have access).

I think this is a fair point, but I'm going to try and wiggle away from it. From a purely selfish perspective, my time is limited, there are only so many things I can do, and making the political case for opening up specific data sets is not something I really want to be doing. In a sense, I'm more interested in what happens when the data is open. In other words, let's assume the battle for open has been won, what do we then? So, I'm essentially operating as if the data is already open because I'm betting that it will be at some point in time.

Without wishing to be too self-serving, I think there are ways that treating closed data as effectively open can help make the case that the data should (genuinely) open. For example, one argument for being open is that people will come along and do cool things with the data. In my case, "cool" means cross linking taxonomic names with the primary literature, eventually to original decsriptions and fundamental data about the organisms tagged with the taxonomic names (you may feel that this stretches the definitoon of "cool" somewhat). But adding value to data is hard work, and takes time (in some cases I've invested years in cleaning and linking the data). The benefits from being open may take time, especially if the data is messy, or relatively niche so that few people are prepared to invest the time necessary to do the work.

Some data, such as the examples given in Lasers, hedgehogs and the rise of the Age of Yoghurt: reflections on #OpenDefra will likely be snapped up and give rise to nice visualisations, but a lot of data won't. So, imagine that you're making the case for data to be open, and one of your arguments is "people will do cool things with it", eventually you win that argument, the data is opened up... and nothing happens. Wouldn't it be better if once the data is open, those of us who have been beavering away with "illicit" copies of the data can come out of the woodwork and say "by the way, here are some cool things we've been doing with that data"? OK, this is a fairly self-serving argument, but my point is that while internal arguments about being open are going on I have three choices:

Wait until you open the data (which stops me doing the work I want to do)
Help make the case for being open (which means I engage in politics, an area in which I have zero aptitude)
Assume you will be open eventually, and do the work I want to do so that when you're open I can share that work with you, and everyone else

Call me selfish, but I choose option 3.

Tuesday, May 10, 2016

State of open knowledge about the World's Plants

A1BHupvR Kew has released a new report today, entitled the State of the World's Plants, complete with it's own web site https://stateoftheworldsplants.com. Its aim:

...by bringing the available information together into one document, we hope to raise the profile of plants among the global community and to highlight not only what we do know about threats, status and uses, but also what we don’t. This will help us to decide where more research effort and policy focus is required to preserve and enhance the essential role of plants in underpinning all aspects of human wellbeing.

This is, of course, a laudable goal, and a lot of work has gone into this report, and yet there are some things about the report that I find very frustrating.

PDF but no ePub It's nice to have an interactive web site as well as a glossy PDF, but why restrict yourself to a PDF? Why not an ePub so people can view it and rescale fonts for their device, etc. Why not provide the original text in a form people can translate? The report states that much of the newly discovered plant biodiversity is found in Brazil and China, why not make it easier to support automatic translation into Portuguese and Chinese?
Why no DOI for the report? If this is such an important document, why doesn't it have a DOI so it can be easily cited?
Why no DOIs for cited literature? The report cites 219 references, very few of them are accompanied by a DOI, yet most of the references have them. Why not include the DOI so readers can click on that and go straight to the literature. Surely you want to encourage readers to engage with the subject by reading more? The whole point of having digital documents online is that they can link to other documents.
No open access taxonomy Sadly the examples of exciting new plant species discovered are all in closed access publications, including The Gilbertiodendron species complex (Leguminosae: Caesalpinioideae), Central Africa DOI:10.1007/s12225-015-9579-4 published in Kew's own journal Kew Bulletin. This article costs $39.95 / €34.95 / £29.95 to read. Why do taxonomists continue to publish their research, often about taxa in the developing world, behind paywalls?
Why is the data not open? Much of the section on "Describing the world’s plants" uses data from Kew's database IPNI. This database is not open, so how does the reader verify the numbers in the report? Or, more importantly, how does the reader explore the data further and ask questions not asked in the report?

These may seem like small issues given the subject of the report (the perilous state of much of the planet's biodiversity), but if we are to take seriously the goal of "help[ing] us to decide where more research effort and policy focus is required to preserve and enhance the essential role of plants in underpinning all aspects of human wellbeing" then I suggest that open access to knowledge about plant diversity is a key part of that goal.

Over a decade ago Tom Moritz wrote of the need for a "biodiversity commons": DOI:10.1045/june2002-moritz

Provision of free, universal access to biodiversity information is a practical imperative for the international conservation community — this goal should be accomplished by promotion of the Public Domain and by development of a sustainable Biodiversity Information Commons adapting emergent legal and technical mechanisms to provide a free, secure and persistent environment for access to and use of biodiversity information and data. - "Building the Biodiversity Commons" DOI:10.1045/june2002-moritz

The report itself alludes to the importance of "opening up of global datasets with long-time series (such as maps of forest loss)", and yet botany has been slow to do this for much of its data (see Why are botanists locking away their data in JSTOR Plant Science?). We need data on plant taxonomy, systematics, traits, sequences, and distribution to be open and freely available to all, not closed behind paywalls or limited access APIs. Indeed, Donat Agosti has equated copyright to biopiracy (Biodiversity data are out of local taxonomists' reach DOI:10.1038/439392a.

It would be nice to think that Kew, as well as leading the way in summarising the state of the world's plants, would also be leading the way in making that knowledge about those plants open to all.

Wednesday, December 29, 2010

The Plant List: nice data, shame it's not open

The Plant List (http://www.theplantlist.org/) has been released today, complete with glowing press releases. The list includes some 1,040,426 names. I eagerly looked for the Download button, but none is to be found. You can grab download individual search results (say, at family level), but not the whole data set.

OK, so that makes getting the complete data set a little tedious (there are 620 plant families in the data set), but we can still do it without too much hassle (in fact, I've grabbed the complete data set while writing this blog post). Then I see that the data is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs (CC BY-NC-ND) license. Creative Commons is good, right? In this case, not so much. The CC BY-NC-ND license includes the clause:

You may not alter, transform, or build upon this work.

So, you can look but not touch. You can't take this data (properly attributed, or course) and build your own list, for example with references linked to DOIs, or to the Biodiversity Heritage Library (which is, of course, exactly what I plan to do). That's a derivative work, and the creators of the Plant List don't want you to do that. Despite this, the Plant List want us to use the data:

Use of the content (such as the classification, synonymised species checklist, and scientific names) for publications and databases by individuals and organizations for not-for-profit usage is encouraged, on condition that full and precise credit is given to The Plant List and the conditions of the Creative Commons Licence are observed.

Great, but you've pretty much killed that by using BY-NC-ND. Then there's this:

If you wish to use the content on a public portal or webpage you are required to contact The Plant List editors at editors@theplantlist.org to request written permission and to ensure that credits are properly made.

Really? The whole point of Creative Commons is that the permissions are explicit in the license. So, actually I don't need your permission to use the data on a public portal, CC BY-NC-ND gives me permission (but with the crippling limitation that I can't make a derivative work).

So, instead of writing a post congratulating the Royal Botanic Gardens, Kew and Missouri Botanical Garden (MOBOT) for releasing this data, I'm left spluttering in disbelief that they would hamstring its use through such a poor choice of license. Kew and MOBOT could have made the Plant List available as open data using one of the licenses listed on the Open Definition web site, such as putting the data in the public domain (for example, or using a Creative Commons CC0 license). Instead, they've chosen a restrictive license which makes the data closed, effectively killing the possibility for people to build upon the effort they've put into creating the list. Why do biodiversity data providers seem determined to cling to data for dear life, rather than open it up and let people realise its potential?