iPhylo: ElasticSearch

Roderic D. M. Page

Showing posts with label ElasticSearch. Show all posts

Thursday, July 22, 2021

Towards a WikiCite search engine

I've released a simple search engine for publications in Wikidata. Wikicite Search takes its name from the WikiCite project, which was an initiative to create a bibliographic database in Wikidata. Since bibliographic data is a core component of taxonomic research (arguably taxonomy is mostly tracing the fate of the "tags" we call taxonomic names) I've spent some time getting taxonomic literature into Wikidata. Since there are bots already adding articles by harvesting sources such as CrossRef and PubMed, I've focussed on literature that is harder to add, such as articles with non-CrossRef DOIs, or those without DOIs at all.

Once you have a big database, you are then faced with the challenge of finding things in that database. Wikidata supports generic search, but I wanted something more closely geared to bibliographic data. Hence Wikicite Search. Over the last few years I've made several attempts at a bibliographic search engine, for this project I've finally settled on some basic ideas:

The core data structure is CSL-JSON, a simple but rich JSON format for expressing bibliographic data.
The search engine is Elasticsearch. The documents I upload include the CSL-JSON for an article, but also a simple text representation of the article. This text representation may include multiple languages if, for example, the article has a title in more than one language. This means that if an article has both English and Chinese titles you can find it searching in either language.
The web interface is very simple: search for a term, get results. If the search term is a Wikidata identifier you get just the corresponding article, e.g. Q98715368.
There is a reconciliation API to help match articles to the database. Paste in one citation per line and you get back matches (if found) for each citation.
Where possible I display a link to a PDF of the article, which is typically stored in the Internet Archive or accessible via the Wayback Machine.

There are millions of publications in Wikidata, currently less than half a million are in my search engine. My focus is narrowly on eukaryote taxonomy and related topics. I will be adding more articles as time permits. I also periodically reload existing articles to capture updates to the metadata made by the Wikidata community - being a wiki the data in Wikidata is constantly evolving.

My goal is to have a simple search tool that focusses on matching citation strings. In other words, it is designed to find a reference you are looking for, rather than be a tool to search the taxonomic literature. If that sounds contradictory, consider that my tool will only find a paper about a taxon if it is explicitly named in the title. A more sophisticated search engine would support things like synonym resolution, etc.

The other reason I built this is to provide an API for accessing Wikidata items and displaying them in other formats. For example, an article in the WikiCite search engine can be retrieved in CSL-JSON format, or in RDF as JSON-LD.

As always, it's very early days. But I don't think it's unreasonable to imagine that as Wikidata grows we could envisage having a search engine that includes the bulk of the taxonomic literature.

Friday, March 24, 2017

This is what phylodiversity looks like

Following on from earlier posts exploring how to map DNA barcodes and putting barcodes into GBIF it's time to think about taking advantage of what makes barcodes different from typical occurrence data. At present GBIF displays data as dots on a map (as do I in http://iphylo.org/~rpage/bold-map/). But barcodes come with a lot more information than that. I'm interested in exploring how we might measure and visualise biodiversity using just sequences.

Based on a talk by Zachary Tong (Going Organic - Genomic sequencing in Elasticsearch) I've started to play with n-gram searches on DNA barcodes using Elasticsearch, an open source search engine. The idea is that we break the DNA sequence into every possible "word" of length n (also called a k-mer or k-tuple, where k = n).

For example, for n = 5, the sequence GTATCGGTAACGAACTT would look like this:

GTATCGGTAACGAACTT

GTATC
 TATCG
  ATCGG
   TCGGT
    CGGTA
     GGTAA
      GTAAC
       TAACG
        AACGAA
         ACGAAC
          CGAACT
           GAACTT

The sequence GTATCGGTAACGAACTT comes from Hajibabaei and Singer (2009) who discussed "Googling" DNA sequences using search engines (see also Kuksa and Pavlovic, 2009). If we index sequences in this way then we can do BLAST-like searches very quickly using Elasticsearch. This means it's feasible to take a DNA barcode and ask "what sequences look like this?" and return an answer qucikly enoigh for a user not to get bored waiting.

Another nice feature of Elasticsearch is that it supports geospatial queries, so we can ask for, say, all the barcodes in a particular region. Having got such a list, what we really want is not a list of sequences but a phylogenetic tree. Traditionally this can be a time consuming operation, we have to take the sequences, align them, then input that alignment into a tree building algorithm. Or do we?

There's growing interest in "alignment-free" phylogenetics, a phrase I'd heard but not really followed up. Yang and Zhang (2008) described an approach where every sequences is encoded as a vector of all possible k-tuples. For DNA sequences k = 5 there are 4⁵ = 1024 possible combinations of the bases A, C, G, and T, so a sequence is represented as a vector with 1024 elements, each one is the frequency of the corresponding 5-tuple. The "distance" between two sequences is the mathematical distance between these vectors for the two sequences. Hence we no longer need to align the sequences being comapred, we simply chunk them into all "words" of 5 bases in length, and compare the frequencies of the 1024 different possible "words".

In their study Yang and Zhang (2008) found that:

We compared tuples of diﬀerent sizes and found that tuple size 5 combines both performance speed and accuracy; tuples of shorter lengths contain less information and include more randomness; tuples of longer lengths contain more information and less random- ness, but the vector size expands exponentially and gets too large and computationally ineﬃcient.

So we can use the same word size for both Elasticsearch indexing and for computing the distance matrix. We still need to create a tree, for which we could use something quick like neighbour-joining (NJ). This method is sufficiently quick to be available in Javascript and hence can be computed by a web browser (e.g., biosustain/neighbor-joining).

Putting this all together, I've built a rough-and-ready demo that takes some DNA barcodes, puts them on a map, then enables you to draw a box on a map and the demo will retrieve the DNA barcodes in that area, compute a distance matrix using 5-tuples, then build a NJ tree, all on the fly in your web browser.

Phylodiversity on the fly from Roderic Page on Vimeo.

This is all very crude, and I need to explore scalability (at the moment I limit the results to the first 200 DNA sequences found), but it's encouraging. I like the idea that, in principle, we could go to any part of the globe, ask "what's there?" and get back a phylogenetic tree for the DNA barcodes in that area.

This also means that we could start exploring phylogenetic diversity using DNA barcodes, as Faith & Baker (2006) wanted a decade ago:

...PD has been advocated as a way to make the best-possible use of the wealth of new data expected from large-scale DNA “barcoding” programs. This prospect raises interesting bio-informatics issues (discussed below), including how to link multiple sources of evidence for phylogenetic inference, and how to create a web-based linking of PD assessments to the barcode–of-life database (BoLD).

The phylogenetic diversity of an area is essentially the length of the tree of DNA barcodes, so if we build a tree we have a measure of diversity. Note that this contrasts with other approaches, such as Miraldo et al.'s "An Anthropocene map of genetic diversity" which measured genetic diversity within species but not between (!).

Practical issues

There are a bunch of practical issues to work through, such as how scalable it is to compute phylogenies using Javascript on the fly. For example, could we do something like generate a one degree by one degree grid of the Earth, take all the barcodes in each cell and compute a phylogeny for each cell? Could we do this in CouchDB? What about sampling, should we be taking a finite, random sample of sequences so that we try and avoid sampling bias?

There are also data management issues. I'm exploring downloading DNA barcodes, creating a Darwin Core Archive file using the Global Genome Biodiversity Network (GGBN) data standard, then converting the Darwin Core Archive into JSON and sending that to Elasticsearch. The reason for the intermediate step of creating the archive is so that we can edit the data, add missing geospatial informations, etc. I envisage having a set of archives, hosted say on GitHub. These archives could also be directly imported into GBIF, ready for the time that GBIF can handle genomic data.

References

Faith, D. P., & Baker, A. M. (2006). Phylogenetic diversity (PD) and biodiversity conservation: some bioinformatics challenges. Evol Bioinform Online. 2006; 2: 121–128. PMC2674678
Hajibabaei, M., & Singer, G. A. (2009). Googling DNA sequences on the World Wide Web. BMC Bioinformatics. Springer Nature. https://doi.org/10.1186/1471-2105-10-s14-s4
Kuksa, P., & Pavlovic, V. (2009). Efficient alignment-free DNA barcode analytics. BMC Bioinformatics. Springer Nature. https://doi.org/10.1186/1471-2105-10-s14-s9
Miraldo, A., Li, S., Borregaard, M. K., Florez-Rodriguez, A., Gopalakrishnan, S., Rizvanovic, M., … Nogues-Bravo, D. (2016, September 29). An Anthropocene map of genetic diversity. Science. American Association for the Advancement of Science (AAAS). https://doi.org/10.1126/science.aaf4381
Yang, K., & Zhang, L. (2008, January 10). Performance comparison between k-tuple distance and four model-based distances in phylogenetic tree reconstruction. Nucleic Acids Research. Oxford University Press (OUP). https://doi.org/10.1093/nar/gkn075

Thursday, December 17, 2015

Will JSON, NoSQL, and graph databases save the Semantic Web?

OK, so the title is pure click bait, but here's the thing. It seems to me that the Semantic Web as classically conceived (RDF/XML, SPARQL, triple stores) has had relatively little impact outside academia, whereas other technologies such as JSON, NoSQL (e.g., MongoDB, CouchDB) and graph databases (e.g., Neo4J) have got a lot of developer mindshare.

In biodiversity informatics the Semantic Web has been a round for a while. We've been pumping out millions of RDF documents (mostly served by LSIDs) since 2005 and, to a first approximation, nothing has happened. I've repeatedly blogged about why I think this is (see this post for a summary).

I was an early fan of RDF and the Semantic Web, but soon decided that it was far more hassle than it was worth. The obsession with ontologies, the problems of globally unique identifiers based on HTTP (http-14 range, anyone?), the need to get a lot of ducks in a row all mad it a colossal pain. Then I discovered the NoSQL document database CouchDB, which is a JSON store that features map-reduce views rather than on the fly queries. To somebody with a relational database background this is a bit of a headfuck:

But CouchDB has a great interface, can be replicated to the cloud, and is FUN (how many times can you say that about a database?). So I starting playing with CouchDB for small projects, then used it to build BioNames and more recently moved BioStor to CouchDB hosted both locally and in the cloud.

Then there are graph databases such as Neo4J, which has some really cool things such as GraphGists which is a playground where you can create interactive graphs and query them (here's an example I created). Once again, this is FUN.

Another big trend over the last decade is the flight from XML and its hideous complexities (albeit coupled with great power) to the simplicity of JSON (part of the rise of JavaScript). JSON makes it very easy to pass around data in simple key-value documents (with more complexity such as lists if you need them). Pretty much any modern API will serve you data in JSON.

So, what happened to RDF? Well, along with a plethora of formats (XML, triples, quads, etc., etc.) it adopted JSON in the form of JSON-LD (see JSON-LD and Why I Hate the Semantic Web for background). JSON-LD lets you have data in JSON (which both people and machines find easy to understand) and all the complexity/clarity of having the data clearly labelled using controlled vocabularies such as Dublin Core and schema.org. This complexity is shunted off into a "@context" variable where it can in many cases be safely ignored.

But what I find really interesting is that instead of JSON-LD being a way to get developers interested in the rest of the Semantic Web stack (e.g. HTTP URIs as identifiers, SPARQL, and triple stores), it seems that what it is really going to do is enable well-described structured to get access to all the cool things being developed around JSON. For example, we have document databases such as CouchDB which speaks HTTP and JSON, and search servers such as ElasticSearch which make it easy to work with large datasets. There are also some cool things happening with graph databases and Javascript, such as Hexastore (see also Weiss, C., Karras, P., & Bernstein, A. (2008, August 1). Hexastore. Proc. VLDB Endow. VLDB Endowment. http://doi.org/10.14778/1453856.1453965, PDF here) where we create the six possible indexes of the classic RDF [subject,predicate,object] triple (this is the sort of thing can also be done in CouchDB). Hence we can have graph databases implemented in a web browser!

So, when we see large-scale "Semantic Web" applications that actually exist and solve real problems, we may well be more likely to see technologies other than the classic Semantic Web stack. As an example, see the following paper:

Szekely, P., Knoblock, C. A., Slepicka, J., Philpot, A., Singh, A., Yin, C., … Ferreira, L. (2015). Building and Using a Knowledge Graph to Combat Human Trafficking. The Semantic Web - ISWC 2015. Springer Science + Business Media. http://doi.org/10.1007/978-3-319-25010-6_12

There's a free PDF here, and a talk online. The consortium behind this project researchers did extensive text mining, data cleaning and linking, creating a massive collection of JSON-LD documents. Rather than use a triple store and SPARQL, they indexed the JSON-LD using ElasticSearch (notice that they generated graphs for each of the entities they care about, in a sense denormalising the data).

I think this is likely to be the direction many large-scale projects are going to be going. Use the Semantic Web ideas of explicit vocabularies with HTTP URIs for definitions, encode the data in JSON-LD so it's readable by developers (no developers, no projects), then use some of the powerful (and fun) technologies that have been developed around semi-structured data. And if you have JSON-LD, then you get SEO for free by embedding that JSON-LD in your web pages.

In summary, if biodiversity informatics wants to play with the Semantic Web/linked data then it seems obvious that some combination of JSON-LD with NoSQL, graph databases, and search tools like ElasticSearch are the way to go.