iPhylo: JATS

Roderic D. M. Page

Showing posts with label JATS. Show all posts

Wednesday, December 04, 2013

Towards BioStor articles marked up using Journal Archiving Tag Set

A while ago I posted BHL to PDF workflow which was a sketch of a work flow to generate clean, searchable PDFs from Biodiversity Heritage Library (BHL) content:

I've made some progress on putting this together, as well as expanded the goal somewhat. In fact, there are several goals:

BioStor articles need to be archived somewhere. At the moment they live on my server, and metadata is also served by BHL (as the "parts" you see in a scanned volume). Long term maybe PubMed Central is a possibility (BHL essentially becomes a publisher). Imagine PubMed Central becoming the primary archival repository for biodiversity literature.
BioStor articles could be more useful if the OCR text was cleaned up and marked up (e.g., highlighting taxon names, localities, extracting citations, etc.).
If BioStor articles were marked up to same extent as ZooKeys then we could use tools developed for ZooKeys (see Towards an interactive taxonomic article: displaying an article from ZooKeys) for a richer reading experience.
Cleaned OCR text could also be used to generate searchable PDFs, which are still the most popular way for people to read articles (see Why do scientists tend to prefer PDF documents over HTML when reading scientific journals?). BioStor already generates PDFs, but these are simply made by wrapping page images in a PDF. Searchable PDFs would be much friendlier.

For BioStor articles to be archived in PubMed Central they would need to be marked up using the Journal Archiving and Interchange Tag Suite (formerly the NLM DTDs). This is the markup used by many publishers, and also the tag suite that TaxPub build upon.

The idea of having BioStor marked up in JATS is appealing, but on the face of it impossible because the all we have is page scans and some pretty ropey OCR. But because the NLM has also been heavily involed in scanning the historical literature they are used to dealing with scanned literature, and JATS can accommodate articles ranging from scans to fully marked up text. For example, take a look at the article "Microsporidian encephalitis of farmed Atlantic salmon (Salmo salar) in British Columbia" which is in PubMed Central (PMC1687123). PMC has basic metadata for the article, scans of the pages, and two images extracted from those pages. This is pretty much what BioStor already has (minus the extracted images).

With this in mind, I dusted off some old code, put it together and created an example of the first baby steps towards BioStor and JATS. The code is in github, and there is a live example here.

Jats

The example takes BioStor article 65706, converts the metadata to JATS, links in the page scans, and also extracts images from the page scans based on data in the ABBYY OCR files. I've also generated HTML from the DjVu files, and this HTML includes hOCR tags that embed information about the OCR text. This format can be edited by tools such as (see Jim Garrison's moz-hocr-edit discussed in Correcting OCR using hOCR in Firefox). This HTML can be processed to output a PDF that includes the page scans but also has the OCR text as "hidden text" so the reader can search for phrases, or copy and paste the text (try the PDF for article 65706).

I've put the HTML (and all the XML and images) in github, so one obvious model for working on an article is to put it into a github repository, push any edits made to the repository, then push that to a web server that displays the articles.

There are still a lot of rough edges, and I think we can buld nicer interfaces than moz-hocr-edit (e.g., using the "contenteditable" attribute in the HTML), althogh moz-hocr-edit has the nice feature of being able to save the edits straight back to the HTML file (saving edited HTML to disk is a non-trivial task in most web browsers). I also need to add the code for building the initial JATS file (currently this is hidden on the BioStor server). There are also issues about PDF quality. At the moment I output black and white PNGs, which look nice and clean but can mangle plates and photos. I need to tweak that aspect of the process.

One application of these tools would be to take a single journal and convert all the BioStor articles into JATS, then make it available for people to further clean and markup as needed. There is an extraordinary amount of information locked away in this literature, it would be nice if we made better use of that treasure trove.

Wednesday, July 17, 2013

Augmenting ZooKeys bibliographic data to flesh out the citation graph

In a previous post (Learning from eLife: GitHub as an article repository) I discussed the advantages of an Open Access journal putting its article XML in a version-controlled repository like GitHub. In response to that post Pensoft (the publisher of ZooKeys) did exactly that, and the XML is available at https://github.com/pensoft/ZooKeys-xml.

OK, "now what?" I hear you ask. Originally I'd used the example of incorrect bibliographic data for citations as the motivation, but there are other things we can do as well. For example, when reading a ZooKeys article (say, using my eLife Lens-inspired viewer) I notice references that should have a DOI but which don't. With the XML available I could add this. This adds another link in the citation graph (in this case connecting the ZooKeys paper with the article it cites). If Pensoft were to use that XML to regenerate the HTML version of the article on their web site then the reader will be able to click on the DOI and read the cited article (instead of the "cut-and-paste-and-Google-it" dance). Furthermore, Pensoft could update the metadata they've submitted to CrossRef, so that CrossRef knows that the reference with the newly added DOI has been cited by the ZooKeys paper.

To experiment with this I've written some scripts that take ZooKeys XML, extract each citation from the list of literature cited, and look up DOIs for each reference that lacks them (using the CrossRef metadata search API). If a DOI is found then I insert it into the original XML. I then push this XML to my fork of Pensoft's repository (https://github.com/rdmpage/ZooKeys-xml). I can then ask Pensoft to update their repository (by issuing a "pull request"), and if Pensoft like what they see, they can accept my edits.

Automating the process makes this much more scalable, although manual editing will still be useful in some cases, especially where the original references haven't been correctly atomised into title, journal, etc.

So that the output is visible independently of Pensoft deciding whether to accept it, I've updated my Zookeys article viewer to fetch the XML not from the ZooKeys web site, but from my GitHub repository. This means you get the latest version of the XML, complete with additional DOIs (if any have been added).

Initial experiments are encouraging, but it's also apparent that lots of citations lack DOIs. However, this doesn't mean that they aren't online. Indeed, a growing number of articles are available through my BioStor repository, and through BioNames. Both of these sites have an API, so the next step is to add them to the script that augments the XML. This brings us a little closer to the ultimate goal of having every taxonomic paper online and linked to every paper that either cites, or is cited by, that paper.