Showing posts with label XSLT. Show all posts
Showing posts with label XSLT. Show all posts

Monday, August 10, 2015

Demo of full-text indexing of BHL using CouchDB hosted by Cloudant

One of the limitations of the Biodiversity Heritage Library (BHL) is that, unlike say Google Books, its search functions are limited to searching metadata (e.g., book and article titles) and taxonomic names. It doesn't support full-text search, by which I mean you can't just type in the name of a locality, specimen code, or a phrase and expect to get back much in the way of results. In fact, in many cases when I Google a phrase that occurs in BHL content I'm more likely to find that phrase in content from the Internet Archive, and then it's a matter of following the links to the equivalent item in BHL.

So, as an experiment I've created a live demo of what full-text search in BHL could look like. I've done this using the same infrastructure the new BioStor is built on, namely CouchDB hosted by Cloudant. Using BHL's API I've grabbed some volumes of the British Ornithological Club's Bulletin and put them into CouchDB (BHL's API serves up JSON, so this is pretty straightforward to do). I've added the OCR text for each page, and asked Cloudant to index that. This means that we can now search on a phrase in BHL (in the British Ornithological Club's Bulletin) and get a result.

I've made a quick and dirty demo of this approach and you can see it in the "Labs" section on BioStor, so you can try it here. You should see something like this:

Bhl couchdb The page image only appears if you click on the blue labels for the page. None of this is robust or optimised, but it is a workable proof-of-concept of how fill-text search could work.

What could we do with this? Well, all sorts of searches are no possible. We can search for museum specimen codes, such as 1900.2.27.13. This specimen is in GBIF (see http://bionames.org/~rpage/material-examined/www/?code=BMNH%201900.2.27.13) so we could imagine starting to link specimens to the scientific literature about that specimen. We can also search for locations (such as Mt. Albert Edward), or common names (such as crocodile).

Note that I've not completed uploading all the page text and XML. Once I do I'll have a better idea of how scalable this approach is. But the idea of having full-text search across all of BHL (or, at least the core taxonomic journals) is tantalising.

Technical details

Initially I simply displayed a list of the pages that matched the search term, together with a fragment of text with the search term highlighted. Cloudant's version of CouchDB provides these highlights, and a "group_field" that enabled me to group together pages from the same BHL "item" (roughly corresponding to a volume of a journal).

This was a nice start, but I really wanted to display the hits on the actual BHL page. To do this I grabbed the DjVu XML for each BHL page for British Ornithological Club's Bulletin, and used a XSLT style-sheet that renders the OCR text on top of the page image. You can't see the text because it I set the colour of the text to "rgba(0, 0, 0, 0)" (see http://stackoverflow.com/a/10835846) and set the "overflow" style to "hidden". But the text is there, which means you can select with the mouse and copy and paste it. This still leaves the problem of highlighting the text that matches the search term. I originally wrote the code for this to handle species names, which comprise two words. So, each DIV in the HTML has a "data-one-word" and "data-two-words" attribute set, which contains the first (and forst plus second) word in the search term, respectively. I then use a JQuery selector to set the CSS of each DIV that has a "data-one-word" or "data-two-words" attribute that matches the search term(s). Obviously, this is terribly crude, and doesn't do well if you've more than two word sin your search query.

As an added feature, I use CSS to convert the BHL page scan to a black-and-white image (works in Webkit-based browsers).

Monday, December 19, 2011

Towards an interactive taxonomic article: displaying an article from ZooKeys

One of the things I keep revisiting is the way we display scientific articles. Apart from Nature's excellent iPhone and iPad apps, most efforts to re-imagine how we display articles are little more than glorified PDF viewers (e.g., the PLoS iPad app).

Part of the challenge is that if we make the article more interactive we immediately confront the problem of how to link to other content. For example, we may have a lovingly crafted ePub view (e.g., Nature's apps), but what happens when the user clicks on a citation to another paper? If the paper is published by the same journal, then potentially it could be viewed using the same viewer, but if not then we are at the mercy of the other publisher. They will have their own ideas of how to display articles, so the simplest fallback is to display the cited article in a web browser view. The problem with this is that it breaks the user experience - the other publisher is unlikely to follow the same conventions for displaying an article and its links. If we are lucky the cited article might be published in an Open Access journal that provides, say, XML based on the NLM DTD standard. Knowing whether an article is Open Access or not is not straightforward, and different journals have their own unique interpretation of the NLM standard.

Then there is the issue of other kinds of content, such as taxonomic names, specimens, DNA sequences, geographic localities, etc. We lack decent services for many of these objects, as a result efforts like PLoS Biodiversity Hub end up being underwhelming collections of reformatted journal articles, rather then innovative integrations of biodiversity knowledge.

With these issues in mind I've started playing with ZooKeys XML, initially looking at ways to display the article beyond the conventional format. Ultimately I'd like to embed the article in a broader web of citations and data. ZooKeys articles are available in PDF, HTML, and XML. The HTML has links to taxon pages, maps, etc., which is nice, but I personally find this a little jarring because it interrupts the reading experience. The ZooKeys web site also surrounds the article with all paraphernalia of a publisher's web site:

Zookeys
As a first experiment, I've taken the XML for article At the lower size limit for tetrapods, two new species of the miniaturized frog genus Paedophryne (Anura, Microhylidae) http://dx.doi.org/10.3897/zookeys.154.1963 and used a XSLT style sheet to reformat the article. I've borrowed some ideas from Nature's apps, such as the font for the title, displaying the abstract in bold, and showing all the figures in the article as thumbnails near the top. I've also added some basic interactivity, which you can see in the video below. Instead of figures being in one place in the article, wherever a figure is mentioned in the article (e.g., "Fig. 1") if you click on the reference to the figure it appears. If the article display a point locality using latitude and longitude, instead of launching a separate browser window with a Google map, click on the locality and the map appears. The idea is that the flow of reading isn't interrupted, figures, maps, and citations all appear in the text.


This demo (which you can see live at http://iphylo.org/~rpage/zookeys) is limited, but most of its functionality comes from simply reformatting XML using XSLT. There's a little bit of jQuery for animation, and I ended up having to write a PHP script to convert verbatim latitude and longitude coordinates to the decimal coordinates expected by Google Maps, but it's all very light weight. It wouldn't take much to add some JSON queries to make the taxon names clickable (e.g., showing a summary of a taxon from EOL). Because ZooKeys uses the NLM DTD for its XML, some of this code could also be applied to other journals, such as PLoS, so we could start to grow a library of linked, interactive taxonomic articles.

Wednesday, March 24, 2010

DjVu XML to HTML

This post is simply a quick note on some experiments with DjVu that I haven't finished. Much of BHL's content is available as DjVu files, which contain both the scanned images and OCR text, complete with co-ordinates of each piece of text. This means that it would, in principle, be trivial to lay out the bounding boxes of each text element on a web page. Reasons for doing this include:

  1. To support Chris Freeland's Holy Grail of Digital Legacy Taxonomic Literature, where user can select text overlaid on BHL scan image.

  2. Developing a DjVu viewer along the lines of Google's very clever Javascript-based PDF viewer (see How does the Google Docs PDF viewer work?).

  3. Highlighting search results on a BHL page image (by highlighting the boxes containing terms the user was searching for).


As an example, here is a BHL page image:



and here's the bounding boxes of the text recognised by OCR overlain on the page image:





























































































































































































































































and here's the bounding boxes of the text recognised by OCR without the page image:































































































































































































































































The HTML is generated using a XSL transformation that take two parameters, an image name and a scale factor, where 1.0 generates HTML at the same size as the original image (which may be rather large). The view above were generated with a scale of 0.1. The XSL is here:


<?xml version='1.0' encoding='utf-8'?>
<xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform">

<xsl:output method="html" version="1.0" encoding="utf-8" indent="yes"/>

<xsl:param name="scale"/>
<xsl:param name="image"/>

<xsl:template match="/">
<xsl:apply-templates select="//OBJECT"/>
</xsl:template>

<xsl:template match="//OBJECT">
<div>
<xsl:attribute name="style">
<xsl:variable name="height" select="@height"/>
<xsl:variable name="width" select="@width"/>
<xsl:text>position:relative;</xsl:text>
<xsl:text>border:1px solid rgb(128,128,128);</xsl:text>
<xsl:text>width:</xsl:text>
<xsl:value-of select="$width * $scale"/>
<xsl:text>px;</xsl:text>
<xsl:text>height:</xsl:text>
<xsl:value-of select="$height * $scale"/>
<xsl:text>px;</xsl:text>
</xsl:attribute>

<img>
<xsl:attribute name="src">
<xsl:value-of select="$image"/>
</xsl:attribute>
<xsl:attribute name="style">
<xsl:variable name="height" select="@height"/>
<xsl:variable name="width" select="@width"/>
<xsl:text>margin:0px;padding:0px;</xsl:text>
<xsl:text>width:</xsl:text>
<xsl:value-of select="$width * $scale"/>
<xsl:text>px;</xsl:text>
<xsl:text>height:</xsl:text>
<xsl:value-of select="$height * $scale"/>
<xsl:text>px;</xsl:text>
</xsl:attribute>
</img>

<xsl:apply-templates select="//WORD"/>

</div>
</xsl:template>

<xsl:template match="//WORD">
<div>
<xsl:attribute name="style">
<xsl:text>position:absolute;</xsl:text>
<xsl:text>border:1px solid rgb(128,128,128);</xsl:text>
<xsl:variable name="coords" select="@coords"/>
<xsl:variable name="minx" select="substring-before($coords,',')"/>
<xsl:variable name="afterminx" select="substring-after($coords,',')"/>
<xsl:variable name="maxy" select="substring-before($afterminx,',')"/>
<xsl:variable name="aftermaxy" select="substring-after($afterminx,',')"/>
<xsl:variable name="maxx" select="substring-before($aftermaxy,',')"/>
<xsl:variable name="aftermaxx" select="substring-after($aftermaxy,',')"/>
<xsl:variable name="miny" select="substring-after($aftermaxy,',')"/>

<xsl:text>left:</xsl:text>
<xsl:value-of select="$minx * $scale"/>
<xsl:text>px;</xsl:text>
<xsl:text>width:</xsl:text>
<xsl:value-of select="($maxx - $minx) * $scale"/>
<xsl:text>px;</xsl:text>
<xsl:text>top:</xsl:text>
<xsl:value-of select="$miny * $scale"/>
<xsl:text>px;</xsl:text>
<xsl:text>height:</xsl:text>
<xsl:value-of select="($maxy - $miny) * $scale"/>
<xsl:text>px;</xsl:text>

</xsl:attribute>

<!-- actual text -->
<!-- <xsl:value-of select="." /> -->
</div>
</xsl:template>

</xsl:stylesheet>


Sunday, March 09, 2008

CrossRef adds more information to OpenURL resolver

Tom Pasley recently drew my attention to CrossRef's addition of a XML format parameter to their OpenURL resolver. Adding &format=xml to the OpenURL request retrieves bibliographic metadata in "unixref" format (for those who like this sort of thing, the XML schema is here). The biggest change is now the metadata lists more than one author for multi-author papers.

I tend to use JSON for my work now, so a common task is to convert XML data streams into JSON. I've modified my bioGUID OpenURL resolver to make use of the unixref format, which meant I had to write a XSLT file to convert unixref to JSON. If you're interested, you can grab a copy here. It's not pretty, but it seems to work OK.

For some years now I've relied on Marc Liyanage's excellent tool TestXSLT to develop XSLT files. If you have a Mac and work with XSLT, then I do yourself a favour and grab a copy of this free tool.