iPhylo: glitch

Roderic D. M. Page

Showing posts with label glitch. Show all posts

Thursday, July 05, 2018

GBIF at 1 billion - what's next?

How to cite: Page, R. (2018). GBIF at 1 billion - what's next? https://doi.org/10.59350/d8dwz-3v524

GBIF has reached 1 billion occurrences which is, of course, something to celebrate:

#GBIF1billion has arrived! Merci beaucoup, @Le_Museum @INPN_MNHN et @gbiffrance!

Thanks and congratulations, too, to the 1,217 data publishers and 92 participants who make the GBIF network go! More details to follow Thursday (champagne doesn't drink itself)… pic.twitter.com/xQ2f5fIt2x
— GBIF (@GBIF) July 4, 2018

An achievement on this scale represents a lot of work by many people over many years, years spent developing simple standards for sharing data, agreeing that sharing is a good thing in the first place, tools to enable sharing, and a place to aggregate all that shared data (GBIF).

So, I asked a question:

So I guess the real #GBIF1billion question is what can we do with a billion data points that we couldn't do with, say, a hundred million? Does more data simply mean more of same kind of analyses, or does it enable something new (and exciting)? @GBIF
— Roderic Page (@rdmpage) July 4, 2018

My point is not to do this:

Hey, don't spoil the party!
— Dimitri Brosens (@Dimibro) July 4, 2018

Rather it is to encourage a discussion about what happens when we have large amounts of biodiversity data. Is it the case that as we add data we simply enable more of the same kind of science, only better (e.g., more data for species distribution modelling), or do we reach a point where new things become possible?

To give a concrete example, consider iNaturalist. This started out as a Masters project to collect photos of organisms on Flickr. As you add more images you get better coverage of biodiversity, but you still have essentially a bunch of pictures. But once you have LOTS of pictures, and those are labelled with species names, you reach the point where it is possible to do something much more exciting - automatic species identification. To illustrate, I recently took the photos below:

Note the reddish tubular growths on the leaves. I asked iNaturalist to identify these photos and within a few seconds it came back with Eriophyes tiliae, the Red Nail Gall Mite. This feels like magic. It doesn't rely on complicated analysis of the image (as many earlier efforts at automated identification have done) it simply "knows" that images that look like this are typically of the galls of this mite because it has seen many such images before. (Another example of the impact of big data is Google Translate, initially based on parsing lots of examples of the same text in multiple languages.)

Okay, but then not sure I see what you're looking for. Why would 1 billion, as opposed to, say, 100 million, mean a paradigm shift? Do you have any (even hypothetical) answers to suggest yourself?
— Leif Schulman (@Leif_Sch) July 5, 2018

The "1 billion" number is not, by itself, meaningful. It's rather that I hope that while we're popping the champagne and celebrating a welcome, if somewhat arbitrary milestone, I'm hoping that someone, somewhere is thinking about whether biodiversity data on this scale enables something new.

Do I have answers? Not really, but here's one fairly small-scale example. One of the big challenges facing GBIF is getting georeferenced data. We spend a lot of time using a variety of tools and databases to convert text descriptions one collection localities into latitude and longitude. Many of these descriptions include phrases such as "5 mi NW of" and so we've developed parsers to attempt to make sense of these. All of these phrases and the corresponding latitude and longitude coordinates have ended up in GBIF. Now, this raises the possibility that after a point, pretty much any locality phrase will be in GBIF, so a way to georeference a locality is simply to search GBIF for that locality and use the associated latitude and longitude. GBIF itself becomes the single best tool to georeference specimen data. To explore this idea I've built a simple tool on glitch https://lyrical-money.glitch.me that takes a locality description and geocodes it using GBIF.

You paste in a locality string and it attempt to find that on a map based on data in GBIF. This could be automated, so you could imagine being able to georeference whole collections as part of the process of uploading the data to GBIF. Yes, the devil is in the details, and we'd need ways to flag errors or doubtful records, but the scale of GBIF starts of open up possibilities like this.

So, my question is, "what's next?".

Wednesday, May 31, 2017

Programming with Glitch: microservices and serverless computing

LgbNpkq 400x400 Yes, this post is indeed an attempt to fit as many buzzwords that I don't really understand into the title. I've been playing around with Glitch, which is a delightful project from Fog Creek (makers of Trello and co-creators of Stack Overflow).

On first glance Glitch looks weirdly retro, and it took a little while for me to get the hang of things. Bit it's fun and very powerful. Basically it's a place where you can start creating web apps in your browser, and each app is automatically hosted online. If you see an app that you like you can see the source code (just like you can see HTML using "view source" in your browser). if you want to hack on the code you can simply create a copy and it's yours to play with (this is called "remixing", like forking on GitHub). Your copy gets a cute name (possibly annoyingly cute) and away you go.

If you're a developer, then at this point you're probably wondering what is actually happening under the hood. Each Glitch app is a node.js app, which means you're programming in Javascript (you can just use HTML and client side Javascript if you want to avoid node.js). I'm very new to node.js, so Glitch has been a fun way to experiment.

There are two things which make Glitch very powerful. The first is the "remix" feature. Don't know where to start? Find an app that looks like it might do something you want to do, remix it, and hack away. The code is edited online, and the editor works very well. It also checks your code for Javascript errors as you type, which is helpful (usually).

The second great feature is that you get built in hosting for free. As soon as you remix an app you have a functioning web site. Remixing is very like forking in GitHub, and if you're running node.js on your local machine then the benefits of Glitch might not seem obvious. But hosting is often a pain, either you need to set up your own servers, or use a hosting service. Glitch takes care of this for you, so your app is instantly available for others to use.

So, what can you do with Glitch? There's some great examples on the Glitch site, but I want to show an almost trivial example. I've created an app called "enchanting-bongo" https://enchanting-bongo.glitch.me (yes, the name is a bit irritating) that does one simple thing. You give it a DOI for an article and enchanting-bongo tells you whether any of the authors of that work have an ORCID. For example, try the DOI 10.3897/zookeys.555.6173. Why did I write this? I'm interested in ways to link people to the work that they've done, especially work that ends up being aggregated in large-scale biodiversity databases like GBIF (see Possible project: #itaxonomist, combining taxonomic names, DOIs, and ORCID to measure taxonomic impact).

The app does one thing. It takes the DOI and calls the ORCID API to see if anyone has claimed authorship of the paper with that DOI. You can use the app with a web browser, or you can use an HTTP client and call the API (e.g., https://enchanting-bongo.glitch.me/search?q=10.3897%2Fzookeys.555.6173).

Glitch is an example of servers computing, where you don't have to worry about physical servers or the software infrastructure that runs on them (e.g., the web server itself), you just write code. Like any buzzword, there is some pushback, see for example What Is “Serverless”? An Alternative Take, but for a fascinating essay I recommend Why the fuss about serverless?. But the notion that I can simply hack away on some code and have an instantly available web app is very attractive.

The other buzzword is "microservices". I'm forever needing to do tasks such as find a DOI for a paper, match a "microcitation" to the enclosing article, locate a specimen in GBIF based on catalogue number in a paper, parse some text into structured data, such as a reference, geographic coordinates, etc. These are tools that I need in lots of contexts, and I've written software to do this on my machine, often as part of larger projects. "Microservices" is the idea that instead of large, monolithic apps we write a series of minimal tools that typically do one thing, and do it well. We then chain the together to do various tasks. Having small tools means that we can treat each problem independently, and if the tools communicate over the web (HTTP) then it doesn't matter what programming language we use. I've started thinking more and more about adopting this model and developing a bunch of small services to perform many of the tasks I need. Hosting these services then becomes in issue, I have web servers in my office but they are a pain to maintain (my university is forever insisting that I upgrade their software), so cloud-based hosting seems the obvious way forward. Free-hosting looks ideal, so Glitch is looking very attractive.

So, I'm hoping to experiment more with this approach. One thing I might do is create a series of services very like enchanting-bongo, have a simple web interface and an API that the web interface calls. That way users can play with it in their web browser, then call the service via the API if it does something useful. As a more sophisticated example of a service, I'm working on tools to parse Wikispecies reference strings, and link specimen codes to records in GBIF.

One reason I'm enthusiastic about Glitch is that it is fun!. Some of the best shifts in technology that I've made have been because a tool made something easy and fun to do. For example, CouchDB made working with structured data fun, and that was a revelation (databases, fun, surely not). Fun is a much neglected characteristic of the tools we use.