iPhylo: URI

Roderic D. M. Page

Showing posts with label URI. Show all posts

Thursday, September 08, 2022

Local global identifiers for decentralised wikis

I've been thinking a bit about how one could use a Markdown wiki-like tool such as Obsidian to work with taxonomic data (see earlier posts Obsidian, markdown, and taxonomic trees and Personal knowledge graphs: Obsidian, Roam, Wikidata, and Xanadu).

One "gotcha" would be how to name pages. If we treat the database as entirely local, then the page names don't matter, but what if we envisage sharing the database, or merging it with others (for example, if we divided a taxon up into chunks, and different people worked on those different chunks)?

This is the attraction of globally unique identifiers. You and I can independently work on the same thing, such as data linked to scientific paper, safe in the knowledge that if we both use the DOI for that paper we can easily combine what we've done. But global identifiers can also be a pain, especially if we need to use a service to look them up ("is there a DOI for this paper?", "what is the LSID for this taxonomic name?").

Life would be easier if we could generate identifiers "locally", but had some assurance that they would be globally unique, and that anyone else generating an identifier for the same thing would arrive at the same identifier (this eliminates things such as UUIDs which are intentionally designed to prvent people genrrating the same identifier). One approach is "content addressing" (see, e.g. Principles of Content Addressing - dead link but in the Wayabck Machine, see also btrask/stronglink). For example, we can generate a cryptographic hash of a file (such as a PDF) and use that as the identifier.

Now the problem is that we have globally unique, but ugly and unfriendly identifiers (such as "6c98136eba9084ea9a5fc0b7693fed8648014505"). What we need are nice, easy to use identifiers we can use as page names. Wikispecies serves as a possible role model, where taxon names serve as page names, as do simplified citations (e.g., authors and years). This model runs into the problem that taxon names aren't unique, nor are author + year combinations. In Wikispecies this is resolved by having a centralised database where it's first come, first served. If there is a name clash you have to create a new name for your page. This works, but what if you have multiple databases un by different people? How do we ensure the identifiers are the same?

Then I remembered Roger Hyam's flight of fantasy over a decade ago: SpeciesIndex.org – an impractical, practical solution. He proposed the following rules to generate a unique URI for a taxonomic name:

The URI must start with "http://speciesindex.org" followed by one or more of the following separated by slashes.
First word of name. Must only contain letters. Must not be the same as one of the names of the nomenclatural codes (icbn or iczn). Optional but highly recommended.
Second word of name. Must only contain letters and not be a nomenclatural code name. Optional.
Third word of name. Must only contain letters and not be a nomenclatural code name. Optional.
Year of publication. Must be an integer greater than 1650 and equal to or less than the current year. If this is an ICZN name then this should be the year the species (epithet) was published as is commonly cited after the name. If this is an ICBN name at species or below then it is the date of the combination. Optional. Recommended for zoological names if known. Not recommended for botanical names unless there is a known problem with homonyms in use by non-taxonomists.
Nomenclatural code governing the name of the taxon. Currently this must be either 'icbn' or 'iczn'. This may be omitted if the code is unknown or not relevant. Other codes may be added to this list.
Qualifier This must be a Version 4 RFC-4122 UUID. Optional. Used to generate a new independent identifier for a taxon for which the conventional name is unknown or does not exist or to indicate a particular taxon concept that bears the embedded name.
The whole speciesindex.org URI string should be considered case sensitive. Everything should be lower case apart from the first letter of words that are specified as having upper case in their relevant codes e.g. names at and above the rank of genus.

Roger is basically arging that while names aren't unique (i.e., we have homonyms such as Abronia) they are pretty close to being so, and with a few tweaks we can come up with a unique representation. Another way to think about this if we had a database of all taxonomics, we could construct a trie and for each name find the shortest set of name parts (genus, species, etc), year, and code that gave us a unique string for that name. In many cases the species name may be all we need, in other cases we may need to add year and/or nomenclatural code to arrive at a unique string.

What about bibliographic references? Well many of us will have databases (e.g., Endnote, Mendeley, Zotero, etc.) which generate "cite keys". These are typically short, memorable identifiers for a reference that are unique within that database. There is an interesting discussion on the JabRef forum regarding a "Universal Citekey Generator", and source code is available cparnot/universal-citekey-js. I've yet to explore this in detail, but it looks a promising way to generate unique identifiers from basic metadata (echos of more elaborate schemes such as SICIs). For example,

Senna AR, Guedes UN, Andrade LF, Pereira-Filho GH. 2021. A new species of amphipod Pariphinotus Kunkel, 1910 (Amphipoda: Phliantidae) from Southwestern Atlantic. Zool Stud 60:57. doi:10.6620/ZS.2021.60-57.

becomes "Senna:2021ck". So if two people have the same, core, metadata for a paper they can generate the same key.

Hence it seems with a few conventions (and maybe some simple tools to support them) we could have decentralised wiki-like tools that used the same identifiers for the same things, and yet those identfiiers were short and human-friendly.

Wednesday, June 02, 2010

TreeBASE II RDF

One of the potentially powerful features of TreeBASE II is availability of a RDF version of a study. This means that, in principle, one could take the RDF for a TreeBASE study, combine it with RDF from other sources, and generate a richer view of a particular study. For example, if a TreeBASE study has a DOI, then we could link it to bibliographic details for the study, and through them to other information, such as GenBank sequences, specimens, etc. (see my little linked data browser for an example of some of this linking). If we added a phylogeny viewer, then we'd have a great tool for browsing the basic components of a phylogenetic study.

Unfortunately, we're not there yet. I've been trying to make sense of TreeBASE II RDF, and frankly, it's a mess. Here are some of the problems:

TreeBASE URIs aren't linked data compliant
The canonical URI for a study (e.g., http://purl.org/phylo/treebase/phylows/study/TB2:S10423) doesn't conform to the linked data approach. In fact, the URI crashes the linked data validator, so I tried another test.


curl --include 
   --header "Accept: application/rdf+xml" 
   http://purl.org/phylo/treebase/phylows/study/TB2:S10423

To be a valid linked data resource this request should return a 303 HTTP status code. Instead we get a 302 and some HTML. Linked data clients won't be able to extract information from this URI.

SKOS matching
There are some odd things going on in the RDF. It contains statements of the form:


<rdf:Description rdf:ID="otu1789319">
   <skos:closeMatch rdf:resource="http://purl.uniprot.org/taxonomy/76066.rdf">
</rdf:Description>

(I've tidied this up a little from the original, rather verbose RDF). This asserts that the TreeBASE OTU otu1789319 corresponds to the NCBI taxon with the taxonomy id 76066 (represented by the Uniprot URI). Except, it doesn't really. As far as I understand it, SKOS is about matching concepts, not documents. The URI http://purl.uniprot.org/taxonomy/76066.rdf is a document URI (specifically, a RDF document), the URI http://purl.uniprot.org/taxonomy/76066 is the taxon. The match should really be to http://purl.uniprot.org/taxonomy/76066. Then I've come across statements that match TreeBASE OTUs to http://purl.uniprot.org/taxonomy/0.rdf. This URI doesn't exist (we get a 404). This seems an odd way to say that we don't have a match -- if we don't have a match, don't include it in the RDF.

Local URIs for trees don't work
The RDF is full of local URIs such as http://purl.org/phylo/treebase/phylows/#tree1790755, which don't resolve. In fact they generate a rather spectacular Tomcat exception. I don't understand why we need local URIs. Everything in TreeBASE should have a global URI. Then we can avoid unnecessary statements such as:

http://purl.org/phylo/treebase/phylows/#tree1790755 owl:sameAs http://purl.org/phylo/treebase/phylows/tree/TB2:Tr7899

which links a local resource to a global one http://purl.org/phylo/treebase/phylows/tree/TB2:Tr7899. Incidentally, this URI doesn't resolve, despite claims that this bug has been fixed.

No links between tree and study
But the show stopper for me is that there is no link between a study and a tree! There is no triple in the RDF specifying any relationship between these two entities. To me this is just about the most important thing I need. I want to be able to query TreeBASE RDF using a study identifier (either from TreeBASE itself, or from an external identifier such as a DOI or a PubMed number). As it stands the TreeBASE II RDF is almost useless. I can't get it via a linked data client, it's full of URIs that don't resolve, and it lacks key triples that would glue things together.

RDF != XML

I can't help thinking that the RDF output hasn't been designed with end use in mind. I know from my own experience that it's not until you try to do something with the RDF that you realise how poor some design decisions may have been.

It's not enough to pump out RDF and hope for the best. RDF is not XML, which is just a verbose format for moving data around. RDF brings with it all sorts of expectations about how clients will resolve it, how they will interpret URIs, and the kinds of queries that will be performed. We are achingly close to being able to tie everything together, but not with RDF TreeBASE II is currently making available.

Thursday, May 29, 2008

When DOIs collide and then disappear: when is a unique, resolvable identifier a bad idea?

As much as I like the idea of a globally unique, resolvable identifier, my recent experience with JSTOR is making me wonder.

JSTOR has three identifiers for articles it archives, DOIs, SICIs, and stable URLs (the later being introduced with the new platform released April 4, 2008). Previously JSTOR would publish DOIs for many of its articles. However, not all of these work, and many are now embedded in the HTML (say, in Dublin Core meta elements) but not publicly displayed.

I suspect the issue is the moving wall:

Journals in JSTOR have "moving walls" that define the time lag between the most current issue published and the content available in JSTOR. The majority of journals in the archive have moving walls of between 3 and 5 years, but publishers may elect walls anywhere from zero to 10 years.

Now, imagine that a publisher has an article on its web site, complete with a DOI, and that article is then add to JSTOR, but is still displayed on the publisher's site.

To make this concrete, consider the article by Baum et al. . On the InformaWorld site this is displayed with doi:10.1080/106351598260879. The same article is also in JSTOR, with the URL http://www.jstor.org/pss/2585367. No DOI is displayed on the page, but if you look at the HTML source, we find:
<meta name="dc.Identifier" scheme="doi" content="10.2307/2585367">. The DOI prefix 10.2307 is used for all JSTOR DOIs, and some for Systematic Biology still work, e.g. 10.2307/2413524.

Now, what happens when the JSTOR moving wall overlaps with publisher's material? What happens if a publisher digitises back issues, then assigns them DOIs? Do the JSOR DOIs then die (as some of them seem to have already done)? And what happens to the poor sap like me, who has been linking to JSTOR DOIs in the naive belief that DOIs don't die?

Suddenly separating identity from resolution is starting to look very attractive...

Tuesday, July 17, 2007

LSID wars

Well, the LSID discussion has just exploded in the last few weeks. I touched on this in my earlier post Rethinking LSIDs versus HTTP URI, but since then the TDWG discussion has become more vigourous (albeit mainly focussed on technical issues, although I suspect that these are symptoms of a larger problem), while public-semweb-lifesci@w3.org list for July has mushroomed into a near slugfest of discussion about URIs, LSIDs, OWL, the goals of the Semantic Web, etc. There are also blog posts, such as Benjamin Good's The main problem with LSIDs, Mark Wilkinson's numerous posts on his blog, and Pierre Lindebaum's commentary.

I have no comment to make on this, I'm merely bookmarking them for when I find the time to wade through all this...