Showing posts with label RDF. Show all posts
Showing posts with label RDF. Show all posts

Friday, November 20, 2009

Putting Linked Data Boilerplate in a Box

Humans have always been digital creatures, and not just because we have fingers. We like to put things in boxes, in clearly defined categories. Our brains so dislike ambiguity that when musical tones are too close in pitch, the dissonance almost hurts.

The aesthetics of technical design frequently ask us to separate one thing from another. It's often said that software should separate code from content and that web-page mark-up should separate presentation from content. XML allows us to separate element content from attribute data; well designed XML schemas make clear and consistent decisions about what should go where.

In ontology design, the study of description logics has given us boxes for two types of information, which have been not-so-helpfully named the "A-Box" and the "T-Box". The T-Box is for terminology and the A-Box is for assertions. When you're designing an ontology, an important decision is how much information should be built into your terminology and how much should be left for users of the terminology to assert.

It's not always easy to decide where to draw the terminology vs. assertion line. For example, if you're building a dog ontology, you might want to have a BlackDog class for dogs that are black. Users of your ontology could then make a single assertion that Fido is a BlackDog, saving them the trouble of making the pair of assertions that Fido is a Dog and Fido is colored black. The audience, on the other hand, would have to understand the added terminology to be able to understand what you've said. In one case, the binding of color to dogs is done in the T-Box, in the second, the A-Box. The A/B box choice boils down to a question of whether users would rather have a concise assertion box and a complex terminology box, or a verbose assertion box and a simple terminology.

Although I designed my first RDF Schema over ten years ago, I had not had a chance to try out OWL for ontology design. Since OWL 2 has just just become a W3C Recommendation, I figured it was about time for me to dive in. I was also curious to find out what kind of ontology designs are preferred for linked data deployment, and I'd never even heard of description logic boxes.

Since I gave the New York Times an unfairly hard time for the mistakes it made in its initial Linked Data release, I felt somewhat obligated to do what I could to participate helpfully in their Linked Open Data Community. (Good stuff is going on there- if you're interested, go have a look!) The licensing and attribution metadata in the Times' Linked Data struck me as highly repetitive, and I wondered if this boilerplate metadata could be cleaned up by moving it into an OWL ontology. It could; if you're interested in details, go to the Times Data Community site and see.

It's not obvious which box this boilerplate information should be in. It's really context information, or assertions about other assertions. The Times wants people to know that it has licensed the data under a creative commons license, and that it wants attribution. If it's really the same set of assertions for everything the Times wants to express (i.e. it's boilerplate) then one would think there would be a better way than mindless repetition.

My ontology for New York Times assertion and licensing boilerplate had the effect of compacting the A-Box at the cost of making the T-Box more complex. I asked if that was a desirable thing or not, and the answer from the community was a uniform NOT. The problem is that there are many consumers of linked data who are reluctant to do the OWL reasoning necessary to unveil the boilerplate assertions embedded in the ontology. Since a business objective for the Times is to enable as many users as possible to make use of its data and ultimately to drive traffic to its topic pages, it makes sense to keep technical barriers as low as possible. Mindlessness is a feature.

I could only think of one reason that a real business would want to use my boilerplate-in-ontology scheme. Since handling an ontology may require some human intervention, the use of a custom ontology could be a mechanism to enforce downstream consideration of and assent to license terms, analogous to "click-wrap" licensing. Yuck!

The conclusion, at least for now, is that for most linked data publishing it is desirable to keep the terminology as simple as possible. Linked Data Pidgin is better than Linked Data Creole.

Thursday, November 5, 2009

The Blank Node Bother and the RDF Copymess

There were many comments on my post about the problems in the Linked Data released by the New York Times, including some back and forth by Kingsley Idehen, Glenn MacDonald, Cory Casanave and Tim Berners-Lee that many readers of this blog may have found to be somewhat inexplicable. On the surface, the comments appeared to be about how to deal with the potentially toxic scope of "owl:sameAs". At a deeper level, the comments surround the issue of how to deal with a limitation of RDF. A better understanding of this issue will also help you understand difficulties faced by the New York Times and other enterprises trying to benefit from the publication of Linked Data.

Let's suppose that you have a dataset that you want to publish for the world to use. You've put a lot of work into it, and you want the world to know who made the data. This can benefit you by enhancing your reputation, but you might also benefit from others who can enhance the data, either by adding to it or by making corrections. You also may want people to be able to verify the status of facts that you've published. You need a way to attach information about the data's source to the data. Almost any legitimate business model that might support the production and maintenance of datasets depends on having some way to connect data with its source.

One way to publish a dataset is to do as the New York Times did, publish it as Linked Data. Unfortunately, RDF, the data model underlying Linked Data and the Semantic Web, has no built-in mechanism to attach data to its source. To some extent, this is a deliberate choice in the design of the model, and also a deep one. True facts can't really have sources, so a knowledge representation system that includes connections of facts to their sources is, in a way, polluted. Instead, RDF takes the point of view that statements are asserted, and if you want to deal with assertions and how they are asserted in a clean logic system, the assertions should be reified.

I have previously ranted about the problems with reification, but it's important to understand that the technological systems that have grown up around the Semantic Web don't actually do reification. Instead, these systems group triples into graphs and keep track of data sets using graph identifiers. Because these identified graphs are not part of the RDF model they tend to be implemented differently from system to system and thus the portability of statements made about the graph as a whole, such as those that connect data to their source, is limited.

At last week's International Semantic Web Conference Pat Hayes gave an invited talk about how to deal with this problem. I've discussed Pat's work previously, and in my opinion, he is able to communicate a deeper understanding of RDF and its implications than anyone else in the world. In his talk (I wasn't there, but his presentation is available.) he argues that when an RDF graph is moved about on the Web, it loses its self-consistency.

To see the problem, ask yourself this: "If I start with one fact, and copy it, how many facts do I have?" The answer is one fact. "one plus one equals two" is a single fact no matter how many times you copy it! You can think of this as a consequence of the universality of the concepts labeled by the english words "one" and "two".

I haven't gotten to the problem yet. As Pat Hayes points out, the problem is most clearly exposed by blank nodes. Blank nodes are parts of a knowledge representation that don't have global identity; they're put in as a kind of glue that connects parts of a fact. For example, lets suppose that we're representing a fact that's a part of the day's semantic web numerical puzzle: "number x plus number y equals two". "number x" and "number y" are labels we're assigning to a number that semantic web puzzle solvers around the world might attempt to map to a univeral concept. Now suppose I copy this fact into another puzzle. How many facts do I have? This time, the answer is two, because "number x" might turn out to be a different number in the second puzzle. So what happens if I copy a graph with a blank node a hundred times? Do the blank nodes multiply while the universally identified node don't? Nobody knows!

I hope you can see that making copies of knowledge elements and moving them to different contexts is much trickier than you would have imagined. To be able to manage it properly you need more than just the RDF model. In his talk, Pat Hayes proposes something he calls "Blogic" which adds the concept of "surfaces" to provide the context for a knowledge representation graph. If we had RDF surfaces, or something like that, then the connections between data and its source would be much easier to express and maintain across the web. Similarly, it would be possible to limit the scope of potentially toxic but useful assertions such as "owl:sameAs".

There are of course other ways to go about "fixing up" RDF, but I'm guessing the main problem is a lack of enthusiasm from W3C for the project. The view of Kingsley Idehen and Tim Berners-Lee appears to be that existing machinery, perhaps bolstered by graph IDs or document IDs is good enough and that we should just get on with putting data onto the web. I'm not sure, but there may be a bit of "information just wants to be free" ideology behind that viewpoint. There may be a feeling that information should be disconnected from its source to avoid entanglements, particularly of the legal variety. My belief is a bit different- it's that knowledge just wants to be worth something. And that providing solid context for data is ultimately what gives it the most value.

P.S. Ironically, in the very first comment on my last post, Ed Summers hints at a very elegant way that the Times could have avoided a big part of the problem- they could have used entailed attribution. It's probably worth another post just to explain it.

Reblog this post [with Zemanta]

Friday, October 30, 2009

The New York Times Blunders Into Linked Data, Pillages Freebase and DBPedia

Notwithstanding Larry Lessig, when you you try to use the precision of code to express squishiness of the legal system, you are bound to run into problems, as I've explored in my posts on copyright.

This Thursday, the New York Times took advantage of the International Semantic Web Conference to make good on their previous promise to begin releasing the New York Times subject index as Linked Data. No matter how you look at it, this is a big advance for the semantic web and the Linked Data movement. It's also a potential legal disaster for the New York Times.

To understand the what the New York Times did wrong, you have to understand a little but about the workings of RDF, the data model underlying the semantic web. In particular, you have to understand about entailment. Entailments are the sets of facts that can be deduced from the meaning of semantic web data. The crucial difference between plain-old data and Linked Data is that Linked Data includes these entailments.

Consider the English-language statement "apples are red". Because it is expressed in a language, it has meaning in addition to the single fact that apples are red. If we also assert that a specific object is an apple, then there is an entailment that the object is also red.

The New York Times Linked Data is expressed in the RDF language and uses vocabularies called OWL, SKOS, Dublin Core, and Creative Commons (denoted here by the prefixes "owl:", "skos:", "dc:" or "dcterms:", and "cc:"). You can download it yourself at http://data.nytimes.com/people.rdf (11.9 MB download)

Here's a simplified bit of the New York Times Linked Data. It defines a concept about C. C. Sabathia, a baseball pitcher who lost a game on Wednesday for the New York Yankees:
<rdf:Description rdf:about="http://data.nytimes.com/N24334380828843769853">
<skos:prefLabel>Sabathia, C C</skos:prefLabel>
<owl:sameAs rdf:resource="http://dbpedia.org/resource/CC_Sabathia"/>
<owl:sameAs rdf:resource="http://rdf.freebase.com/rdf/en.c_c_sabathia"/>

<dc:creator>The New York Times Company</dc:creator>
<cc:License>http://creativecommons.org/licenses/by/3.0/us/</cc:License>
<dcterms:rightsHolder>The New York Times Company</dcterms:rightsHolder>
<cc:attributionName>The New York Times Company</cc:attributionName>
</rdf:Description>
The first thing this does is it creates an identifier, "http://data.nytimes.com/N24334380828843769853", for the "C. C. Sabathia" subject concept. The New York Times uses this set of subjects to create topic pages, and the main purpose of releasing this data set is to help people link concepts throughout the internet to the appropriate New York Times topic pages.

Next, it gives a label for this concept, "Sabathia, C C". So far so good. The next two statements say that the New York Times Topic labeled by "Sabathia, C C" is the same concept previously identified by DBPedia, a Linked Data version of Wikipedia, and by Freebase, another large collection of Linked Data. This is even better, because this tells us that we can use information from Wikipedia and Freebase to help us infer facts about the New York Times C. C. Sabathia topic. "sameAs" is term is defined as part of the "OWL" standard vocabulary, which defines how machines should process these assertions of sameness.

The last four lines, highlighted in red, assert that the C. C. Sabathia concept was created by "The New York Times Company", which is the rights holder for the C. C. Sabathia concept, and that if you want to use the C. C. Sabathia concept, the The New York Times Company will license the concept to you under the terms of a particular Creative Commons License.

There are two separate blunders made by the stuff in red. The first blunder is that the New York Times is attempting to say that the C. C. Sabathia concept is a work "PROTECTED BY COPYRIGHT AND/OR OTHER APPLICABLE LAW." This is complete rubbish. The information provided by the New York Times about the C. C. Sabathia concept consists of a few facts that cannot be protected by copyright or any other law that I know of. (The entire 5,000 entity collection, however, is probably protectable in countries other than the US).

The second blunder is much worse. Where the first blunder is merely silly, the second blunder is akin to attempted property theft. Because the New York Times has asserted that it holds the rights to the C. C. Sabathia topic, and further, that the C. C. Sabathia topic is the same as the Freebase "c_c_sabathia" topic and the Wikipedia "CC_Sabathia" topic, by entailment, the New York Times is asserting that is the rights holder for those concepts as well.

You might argue that this is a harmless error. But in fact, there is real harm. Computers aren't sophisticated enough to deal with squishy legal concepts. If you load the New York Times file into an OWL-aware data store, the resulting collection will report that the the New York Times Company is the rights holder for 4,770 concepts defined by Wikipedia and 4,785 concepts defined Freebase.

Now before you start bashing the New York Times, it's important to acknowledge that RDF and Linked Data don't make it particularly easy to attached licenses or attributions to semantic web data. The correct ways to do this are all ugly and not standardized. You would think that this would be a requirement for commercial viability of the semantic web.

People trying to use New York Times Linked Data can deal with this in three ways. They can decide not to use data from the New York Times, they can ignore all licensing and attribution assertions that the Times makes, or they can hope that the problem goes away soon.

A fourth way would be to sue the New York Times Company for damages. At long last there's a lucrative business model for Linked Open Data.

Update: I have two follow-up posts: The Blank Node Bother and the RDF CopyMess and The New York Times Gets It Right; Does Linked Data Need a Crossref or an InfoChimp?
Reblog this post [with Zemanta]

Saturday, September 5, 2009

RDF Properties on Magic Shelves

Book authors and politicians who go on talk shows, whether it's the Daily Show, Charlie Rose, Fresh Air, Oprah, Letterman, whatever, seem to preface almost every answer with the phrase "That's a really good question, (Jon|Teri|Stephen|Conan)". The Guest never says why it's a good question because real meaning of that phrase is "Thanks for letting me hit one out of the ballpark." Talk shows have so little in common with baseball games or even tennis matches. On the rare occasion when a guest doesn't adhere to form, the video goes viral.

I've been promising to come back to my discussion of Martha Yee's questions on putting bibibliographic data on the semantic web. Karen Coyle has managed to discuss all of them at least a little bit, so I'm picking and choosing just the ones that interest me. In this post, I want to talk about Martha's question #11:
Can a property have a property in RDF?
The rest of my post is divided into two parts. First, I will answer the question, then in the second part, I will discuss some of the reasons that it's a really good question.

Yes, a property can have a property in RDF. In the W3C Recommentation entitled RDF Semantics, it states: "RDF does not impose any logical restrictions on the domains and ranges of properties; in particular, a property may be applied to itself." So not only can a property have a property in RDF, it can even use itself as a property!

OK, that's done with. Not only is the answer yes, but it's yes almost to the point of absurdity. Why would you ever want a property to be applied to itself? How can a hasColor property have a hasColor property? If you read and enjoyed Gödel, Escher, Bach, you're probably thinking that the only use for such a construct is to define a self-referential demonstration of Gödel's Incompleteness Theorem. But there actually are uses for properties which can be applied to themselves. For example, if you want to use RDF properties to define a schema, you probably want to have a "documentation" property, and certainly the documentation property should have its own documentation.

If you're starting to feel queasy about properties having properties, then you're starting to understand why Yee question 11 is a good one. Just when you think you understand the RDF model as being blobby entities connected by arcs, you find out that the arcs can have arcs. Our next question to consider is whether properties that have properties accomplish what someone with a library metadata background intends them to accomplish, and even if they do so, is it the right way to accomplish it?

In my previous post on the Yee questions, I pointed out that ontology development is a sort of programming. One of most confusing concepts that beginning programmers have to burn into their brains is the difference between a class and an class instance. In the library world, there are some very similar concepts that have been folded up into a neat hierarchy in the FRBR model. Librarians are familiar with expressions of works that can be instantiated in multiple manifestations, each of which can be instantiated in multiple items. Each layer of this model is an example of the class/instance relationship that is so important for programmers to understand. This sort of thinking needs to be applied to our property-of-a-property question. Are we trying to apply an property to an instance of a property, or do we want to apply properties to property "classes"?

Here we need to start looking at examples, or else we will get hopelessly lost in abstraction-land. Martha's first example is a model where the dateOfPublication is a property of a publishedBy relationship. In this case, what we really want is a property instance from the class of publishedBy properties that we modify with a dateOfPublication property. Remember, there is a URI associated with the property piece of any RDF triple. If we were to simply hang a dateOfPublication on a globally defined publishedBy we would have made that modification for every item in our database using the publishedBy attribute. That's not what we want. Instead, for each publishedBy relation we wanted to assert, we need to create a new property, with a new URI, related to publishedBy using the RDF Schema property subPropertyOf.

Let's look at Martha's other example. She wants to attach a type to her variantTitle property to denote spine title, key title, etc. In this case, what we want to do is create global properties that retain variantTitleness while making the meaning of the metadata more specific. Ideally, we would create all our variant title properties ahead of time in our schema or ontology. As new cataloguing data entered our knowledgebase, our RDF reasoning machine would use that schema to infer that spineTitle is a variantTitle so that a search on variantTitle would automatically pick up the spineTitles.

Is making new properties by adding a property to a subproperty the right way to do things? In the second example, I would say yes. The new properties composed from other properties make the model more powerful, and allow the data expression to be simpler. In the first example, where a new property is composed for every assertion, I would say no. A better approach might be to make the publication event a subject entity with properties including dateOfPublication, publishedBy, publishedWhat, etc. The resulting model is simpler, flatter, and more clearly separates the model from the data.

We can contrast the RDF approach of allowing new properties to be created and modified by other properties to that of MARC. MARC makes you to put data in fields and subfields and subfields with modifiers, but the effect is sort of like having lots of dividers on lots shelves on a bookcase- there's one place for each and every bit of data- unless there's no place. RDF is more like a magic shelf that allows things to be in several places at once and can expand to hold any number of things you want to put there.

"Thanks for having me, Martha, it's been a real pleasure."
Reblog this post [with Zemanta]

Tuesday, August 4, 2009

Can Librarians Be Put Directly Onto the Semantic Web?


The professor who taught "Introduction to Computer Programming" my freshman year of college told us that it was easier to teach a (doctor, lawyer, architect) to program a computer than it was to teach a computer programmer to be a (doctor, lawyer, architect). I was never really sure whether he meant that it was easy to teach people programming, or whether he meant that it was impossible to teach programmers anything else. Many years later, I met the doctor he collaborated a lot with, and decided that my professor's conclusion was based on an unrepresentative data set, because the doctor had the personality of a programmer who accidentally went to medical school.

I was reminded of that professor by one of Martha Yee's questions in her article "Can Bibliographic Data Be Put Directly Onto the Semantic Web?":
Do all possible inverse relationships need to be expressed, or can they be inferred? My model is already quite large, and I have not yet defined the inverse of every property as I really should to have a correct RDF model. In other words, for every property there needs to be an inverse property; for example, the property isCreatorOf needs to have the inverse property isCreatedBy; thus "Twain" has the property isCreatorOf , while "Adventures of Tom Sawyer" has the property isCreatedBy. Perhaps users and inputters will not actually have to see the huge, complex RDF data model that would result from creating all the inverse relationships, but those who maintain the model will need to deal with a great deal of complexity. However, since I'm not a programmer, I don't know how the complexity of RDF compares to the complexity of existing ILS software.
Although there are many incorrect statements in this passage, the most important one to correct here is in the last sentence. Whether she likes it or not, Martha Yee has become a programmer. Congratulations, Martha!

In many respects, the most important question for the library world in examining semantic web technologies is whether librarians can successfully transform their expertise in working with metadata into expertise in working with ontologies or models of knowledge. Whereas traditional library metadata has always been focused on helping humans find and make use of information, semantic web ontologies are focused on helping machines find and make use of information. Traditional library metadata is meant to be seen and acted on by humans, and as such has always been an uncomfortable match with relational database technology. Semantic web ontologies, in contrast, are meant to make metadata meaningful and actionable for machines. An ontology is thus a sort of computer program, and the effort of making an RDF schema is the first step of telling a computer how to process a type of information. Martha Yee's development of an RDF class to represent an Author is precisely analogous to a Java programmer's development of a Java class to do the same thing.

RDF is the first layer of the program; OWL (Web Ontology Language) is the next layer. In OWL, you can describe relationships and constraints on classes and properties. For example, an ontology could contain the statement:
<owl:ObjectProperty rdf:ID="isCreatorOf">
<owl:inverseOf rdf:resource="#isCreatedBy" />
</owl:ObjectProperty>
which defines isCreatorOf as the inverse of isCreatedBy. With this definition, a reasoning engine that encounters an isCreatorOf relationship will know that it can simplify the data graph by replacing it with the inverse isCreatedBy relationship. This does NOT MEAN that a good ontology should have inverses of all properties that it defines- in fact quite the opposite is true. The OWL ObjectProperty inverseOf (and sameAs) are meant to make it easier to link separate ontologies, not to encourage ontologies to have redundant property definitions.

I'm not sure where the notion that "for every property there needs to be an inverse property" came from, but I'll venture two guesses. It's true that if you want to browse easily in both directions from one entity to a related entity, you need to have the relationship expressed at both ends, particularly in a distributed data environment. Most application scenarios for RDF data involve gathering the data into large datastores for this reason. But you don't need an inverse property to be defined for this purpose.

Another possible source for the inverse property confusion comes from the way that relational databases work. In order to efficiently display sorted lists using a relational databases, you need to have prepared indices for each field you want to use. So if you want to display authors alphabetically by book title, and also books alphabetically by author name, you need to have relationships defined in both directions. If you're using an RDF tuple store by contrast, all the data goes in a single table, and thus indices are all predefined.

The fact that ontologies are programs that encode domain knowledge should remove a lot of mechanical drudgery for "users and inputters". To take a trivial example, the cataloguer of a new version of "Adventures of Tom Sawyer" would not have to enter "Samuel Clemens" as an alternate author name for "Mark Twain" once the isCreatedBy relationship has been made. In fact, if the ontology contained a relationship "isVersionOf", then the cataloguer wouldn't even need to enter the title or create a new isCreatedBy relationship. A library catalog that used semantic web technologies wouldn't need separate programming to make these relationships, they would be come directly from the ontology being used.

To some extent, the success of the semantic web in any domain is predicated on the successful embodiment of that domain's knowledge in ontological code. Either coders need to learn the domain knowledge, or domain experts need to learn to code. People need to talk.

Friday, July 31, 2009

Ignition Timing for Semantic Web Library Automation Engines

Last weekend, I had a chance to learn how to drive a 1915 Model-T Ford. It's not hard, but a Model-T driver needs to know a bit more about his engine and drivetrain than the driver of a modern automobile. There is a clutch pedal that puts the engine into low gear when you press it- high gear is when the pedal is up and neutral is somewhere in between. The brake is sort of a stop gear, and you need to make sure the clutch is in neutral before you step on the brake. The third pedal is reverse.

There are a lot more engine controls than on a modern car. In addition to the throttle and a choke, there is another lever that controls the ignition timing. A modern Model-T driver doesn't have to worry much about the timing once the engine has started, because modern fuel has much higher octane than fuel had in 1915. I would not have understood this except that I recently got a new car whose manual says you should use only premium fuel, and so I did some wikipedia research to find out what octane had to do with automobile engines. But I could have lived blissfully in ignorance. Believe it or not, I have opened the hood of my new car only once since I got it in December.

It occurs to me that in many ways, the library automation industry is still in the Model-T era, particularly in regards to the relationship of the technology to its managers. Libraries still need to keep a few code mechanics on staff, and the librarians who use library automation to deliver services still need to know a lot more about their data engines than I know about my automobile engine. The industry as a whole is trying to evaluate changes roughly analogous to the automobile industry switching to diesel engines.

I've been reading Martha Yee's paper entitled "Can Bibliographic Data Be Put Directly Onto the Semantic Web?" and Karen Coyle's commentary on this paper. I greatly admire Martha Yee's courage to say, essentially, "I don't understand this as well as I need to, here are some questions I would really appreciate help with". When I worked at Bell Labs, I noticed that the people who asked questions like that were the people who had won or would later win Nobel prizes. Karen has done a great job with Martha's queries, but also expresses a fair amount of uncertainty.

I was going to launch into a few posts to help fill in some gaps, but I find that I have difficulty knowing which things are important to explain. Somehow I don't think that Model-T drivers really needed to know about the relationship between octane and ignition timing, for example. But I think that people running trucking companies need to know some of the differences between Diesel engines and internal combustion engines as they built their trucking fleets, just as community leaders like Martha Yee and Karen Coyle probably need to know the important differences between RDF tuple-stores and relational databases. But the more I think about it, the less I'm sure about which of the differences are the important ones for people looking to apply them in libraries.

Another article I've been reading has been Greg Boutin's article "Linked Data, a Brand with Big Problems and no Brand Management", which suggests that the technical community that is pushing RDF Linked Data has not been doing a good job of articulating the benefits of RDF and Linked Data principles in a way that potential customers can understand clearly and consistently.

Engineers tend to have a different sort of knowledge gap. I have a very good friend who designs the advanced fuel injectors. He is able to do this because he has specialized so that he knows everything there is to know about fuel injectors. He doesn't need to know anything about radial tires or airbag inflators or headlamps. But to make his business work, he needs to be able to articulate to potential customers the benefits of his injectors in the context of the entire engine and engine application. Whether the technology Linked Data or fuel injectors, that can be really difficult.

My first guess was that it would be most useful for librarians to understand how indexing and searching are almost the same thing, and that indexing done quite differently in RDF tuple-stores and in relational databases. But on second thought, that's more like telling the trucking company that diesel engines don't need spark plugs. It's good to know, but the higher-level fact that diesels burn less fuel is a lot more relevant? Isn't it more important to know that an RDF tuple-store trades off performance for flexibility? How do you ask the right questions to ask, when you don't know where to start? We find ourselves working across many disciplines each of which are more and more specialized, and we need more communications magic to make everything work together.

I'll try to do some gap-filling next week.

Thursday, June 18, 2009

Triple Stores Aren't

Once a thing has acquired a name, it's rare that it can escape that name even if the underlying concept has changed so as to make the name inaccurate. Only when the name causes misunderstandings will people start to adopt a new, more accurate name. I am trained as an engineer, but I know very little about engines; so far that has never caused any problems for me. It's sometimes funny when someone worries about getting lead poisoning from a pencil lead but it doesn't cause great harm. It's no big deal that there's hardly any nickel in nickels. Columbus "discovered" the "indians" in 1492; We've known that these people were not in India for a long time, but it's only recently that we've started using the more respectful and more accurate term "Native Americans".

I'm going to see some old friends this evening, and I'm sure they'll be pretty much how I remember them, but I'll really notice how the kids have grown. That's what this week has been like for me at the Semantic Technology Conference . I've not really worked in the semantic technology area for at least 7 years (though I've been making good use of its ideas), but a lot of the issues and technologies were like old friends, wiser and more complex. But being away for a while makes me very aware of things have changed- things that people who have been in the field for the duration might not have been conscious of, because the change has occurred gradually. One of the things I've noticed also involves a name that's no longer accurate. It might confuse newcomers to the field, and may even cause harm by lulling people into thinking they know something that isn't true. It's the fact that triple stores are no longer triple stores.

RDF (subject,predicate,object) triples are the "atom" of knowledge in a semantic-technology information store. One of the foundational insights of semantic technology is that there is great flexibility and development efficiency to be gained by moving data models out of relational database table designs and into semantic models. Once you've done that, you can use very simple 3-column tables to store the three pieces of the triples. You need to do much more sophisticated indexing, but it's the same indexing for any data model. Thus, the triple store.

As I discussed in my "snotty" rants on reification, trying to rely on just the triples keeps you from doing many things that you need to do in many types of problems. It's much more natural to treat the triple as a first-class object, either by reification or by objectification (letting the triple have its own identifier). What I've learned an this conference is that all the triple stores in serious use today use more that 3 columns to store the triples. Instead of triples, RDF atoms are now stored as 4-tuples, 5-tuples, 6-tuples or 7-tuples.

Essentially all the semantic technology information stores use at least an extra column for graph id (used to identify a graph that a particlar triple is part of). At the conference, I was told that this is needed in order to implement the contextual part of SPARQL. (FROM NAMED, I assume. Note to self: study SPARQL on the plane going home!) In addition, some of the data stores have a triple id column. In a post on the Freebase Blog, Scott Meyer reported that Freebase uses tuples which have IDs, 6 "primitives "and a few odds and ends" to store an RDF "triple" (the pieces which stor the triple are called left, right, type and value). Freebase is an append-only data store, so it needs to keep track of revisions, and it also tracks the creator of the tuple.

Is there anything harmful with the misnomerization of "triple", enough for the community to try their best to start talking about "tuples"? I think there is. Linked Data is the best example of how a focus on the three-ness of triples can fool people into sub-optimal implementations. I heard this fear expressed several times during the conference, although not in those words. More than once, people expressed concern that once data had been extracted via SPARQL and gone into the Linked Data cloud, there was no way to determine where the data had come from, what its provenance was, or whether is could be trusted. He was absolutely correct- if the implementation was such that the raw triple was allowed to separate from its source. If there was a greater understanding of the un-three-ness of real rdf tuplestores, then implementers of linked data would be more careful not to obliterate the id information that could enable trust and provenance. I come away from the conference both excited by Linked Data and worried that the Linked Data promoters seemed to brush-off this concern.

I'll write some more thoughts from the conference tomorrow, after I've googled a few things with Bing.

Wednesday, June 17, 2009

Is Semantic Web Technology Scalable?

"Scalable" is a politician of a word. It has attractiveness to obtain solid backing from diverse factions- it has something to offer both the engineericans and the marketerists. At the same time it has the dexterity to mean different things to different people, so that the sales team can always argue that the competition's product lacks "scalability". The word even supports multiple mental images- you can think of soldiers scaling a wall or climbers scaling a mountain; a more correct image is that of scaling a picture to making it bigger. Even technology cynics can get behind the word "scalable": if a technology is scalable, they would argue, that means it hasn't been scaled.

The fact is that scalability is a complex attribute, more easily done in the abstract than in the concrete. I've long been a cynic about scalability. A significant fraction of engineers who worry about scalability end up with solutions that are too expensive or too late to meet the customer problems at hand, or else they build systems that scale poorly along an axis of unexpected growth. Another fraction of engineers who worry too little about scalability get lucky and avoid problems by the grace of Moore's Law and its analogs in memory storage density, processor power and bandwidth. On the other hand, ignorance of scalability issues in the early phases of a design can have catastrophic effects if a system or service stops working once it grows beyond a certain size.

Before considering the scalability of the Semantic Technology, let's define terms a bit. The overarching definition of scalability in information systems is that the resources needed to solve a problem should not grow much faster than the size of the problem. From the business point of view, it's a requirement that 100 customers should cost less to serve than 100 times what it would cost to serve one customer (the scaling should be less than linear). If you are trying to build a Facebook, for example, you can tolerate linear scaling in number of processors needed per million customers if you have sublinear costs for other parts of the technology or significantly superlinear revenue per customer. Anything superlinear will eventually kill you. If there are any bits of your technology which scale quadratically or even exponentially, then you will very quickly "run into a brick wall".

In my post on curated datasets, I touched on an example where a poorly designed knowledge model could "explode" a semantic database. This is one example of how the Semantic Web might fail the scalability criterion. My understanding of the model scaling issue is that it's something that can be addressed, and is in fact addressed in the best semantic technology databases. The semantic analysis component of semantic technology can quite easily be parallelized, so that appears to pose no fundamental problems. What I'd like to address here is whether there are scalability issues in the semantic databases and inference engines that are at the core of Semantic Web technology.

Enterprise-quality semantic databases (using triple-stores) are designed to to scale well in the sense that the number of RDF triples they can hold and process scales linearly with the amount of memory available to the CPU. So if you have a knowledge model that has 1 Billion triples, you just need to get yourself a box with 8GB of RAM. This type of scaling is called "vertical scaling". Unfortunately if you wanted to build a Semantic Google or a Semantic Facebook, you would probably need a knowledge model with trillions of triples. You would have a very hard time to do it with a reasoning triple store, because you can't buy a CPU with that much RAM attached. The variety of scaling you would want to have to solve a bigger problems is called "horizontal scaling". Horizontal scaling distributes a problem across a farm of servers, and the scaling imperative is that the number of servers required should scale with the size of the problem. At this time, there is NO well-developed capability for semantic databases with inference engines to distribute problems across multiple servers. (Mere storage is not a problem.)

I'll do my best to explain the difficulties of horizontal scaling in semantic databases. If you're an expert in this, please forgive my simplifications (and please comment if I've gotten anything horribly wrong.) Horizontal scaling in typical web applications uses partitioning. Partitioning of a relational database typically takes advantage of the structure of the data in the application. So for example, if you're building a Facebook, you might chose to partition your data by user. The data for any particular user would be stored on one or two of a hundred machines. Any request for your information is routed to the particular machine that holds your data. That machine can make processing decisions very quickly if all your data is stored on the same machine. So instead of sharing one huge Facebook web application with 100 million other Facebook users, you might be sharing one of a hundred identical Facebook application servers with "only" a million other users. this works well if the memory size needed for 1 million users is a good match to that available on a cheap machine.

In a semantic (triplestore) database, information is chopped up into smaller pieces (triples) with the result that much of information will be dispersed into multiple pieces. A partitioned semantic database would need to intelligently distribute the information across machines so that closely related information will reside on the same machine. Communication between machines is typically 100 times slower than communication within the same machines, so the consequences of doing a bad job of distributing information can be disastrous. Figuring out how to build partitioning into a semantic database is not impossible, but it's not easy.

I'm getting ahead of myself a bit, because a billion triples is nothing to sneeze at. Semantic database technology is exciting today in applications where you can put everything on one machine. But if you read my last post, you may remember my argument that the Semantic Web is NOT loading information into a big database of facts. It's a social construct for connections of meaning between machines. Current semantic database technology is designed for reasoning on facts loaded onto a single machine; it's capable of building semantic spaces up to a rather large size; but it's not capable of building a semantic Google, for example.

I've learned a lot at the Semantic Technology Conference around this analysis. What I see is that there is a divergence in the technologies being developed. One thread is to focus on the problems that can be addressed on single machines. In practice, that technology has advanced so that the vast majority of problems, particularly the enterprise problems, can be addressed by vertically scaled systems. This is a great achievement, and is one reason for the excitement around semantic technologies. The other thread is to achieve horizontal scaling by layering the semantic technologies on top of horizontally scaled conventional database technologies.

I've been going around the conference provoking people into interesting conversations by asserting that there is no such thing (today) as Semantic Web Technology- there are only Semantic Technology and Web Technology, and combinations thereof. The answer to the question in the title is then that if there was such a thing as Semantic Web technology, then it would be scalable.

Friday, June 5, 2009

When are you collecting too much data?

Sometimes it can be useful to be ignorant. When I first started a company, more than 11 years ago, I decided that one thing the world needed was a database, or knowledgebase, of how to link to every e-journal in the world, and I set out to do just that. For a brief time, I had convinced an information industry veteran to join me in the new company. One day, as we were walking to a meeting in Manhattan, he turned to me and asked "Eric, are you sure you understand how difficult it is to build and maintain a big database like that?" I thought to myself, how hard could it be? I figured there were about 10,000 e-journals total, and we were up to about 5000 already. I figured that 10,000 records was a tiny database- I could easily do 100,000 records even on my Powerbook 5400. I thought that a team of two or three software developers could do a much better job sucking up and cleaning up data than the so-called "database specialists" typically used by the information industry giants. So I told him "It shouldn't be too hard." But really, I knew it would be hard, I just didn't know WHAT would be hard.

The widespread enthusiasm for Linked Data has reminded me of those initial forays into database building. Some important things have changed since then. Nowadays, a big database has at least 100 million records. Semantic Web software was in its infancy back then; my attempts to use RDF in my database 11 years ago quickly ran into hard bits in the programming, and I ended up abandoning RDF while stealing some of its most useful ideas. One thing that hasn't changed is something I was ignorant of 11 years ago- maintaining a big database is a big, difficult job. And as a recently exed ex-physicist, I should have known better.

The fundamental problem of maintaining a large knowledgebase is known in physics as the second law of thermodynamics, which states that the entropy of the universe always increases. An equivalent formulation is that perpetual motion machines are impossible. In terms that non-ex-physicist librarians and semantic websperts can understand, the second law of thermodynamics says that errors in databases accumulate unless you put a lot of work into them.

This past week, I decided to brush off my calculus and write down some formulas for knowledgebase error accumulation so that I wouldn't forget the lessons I've learned, and so I could make some neat graphs.

Let's imagine that we're making a knowledgebase to cover something with N possible entities. For example, suppose we're making a knowledgebase of books, and we know there are at most one billion possible books to cover. (Make that two billion, a new prefix is now being deployed!) Let's assume that we're collecting n records at random, so for any entity, each record has a 1/N chance of covering any specific entity. At some point, we'll start to get records that duplicate information we've already but into the database. How many records, n, will we need to collect to get 99% coverage? It turns out this is an easy calculus problem, one that even a brain that has spent 3 years as a middle manager can do. The answer is:
Coverage fraction, f = [1- exp(-n/N)]
So to get 99% of a billion records, you'd need to acquire about 4.6 billion records. Of course there are some simplifications in this analysis, but the formula gives you a reasonable feel for the task.

I haven't gotten to the hard part yet. Suppose there are errors in the data records you pull in. Let's call the the fraction of records with errors in them epsilon, or ε. Then we get a new formula for the errorless coverage fraction, F
Errorless coverage fraction, F = [exp(-εn/N) - exp(-n/N)]
This formula behaves very differently from the previous one. Instead of rising asymptotically to one for large n, it rises to a peak, and then drops exponentially to zero for large n. That's right, in the presence of even a small error rate, the more data you pull in, the worse your data gets. There's also a sort of magnification effect on errors- a 1% error rate limits you to a maximum of 95% errorless coverage at best; a 0.1% error rate limits you to 99.0% coverage at best.

I can think of three strategies to avoid the complete dissipation of knowledgebase value caused by accumulation of errors.
  1. stop collecting data once you're close to reaching the maximum. This is the strategy of choice for collections of information that are static, or don't change with time.
  2. spend enough effort detecting and resolving errors to counteract the addition of errors into the collection.
  3. find ways to eliminate errors in new records.
A strategy I would avoid would be to pretend that perpetual motion machines are possible. There ain't no such thing as a free lunch.

Tuesday, June 2, 2009

Who's the Boss, Steinbrenner or Springsteen?

As I started playing with Mathematica when it first came out (too late for me to use it for the yucky path integrals in my dissertation), I just had to try Wolfram|Alpha. The vanity search didn't work; assuming that's what most people find, its probably the death knell for W|A as a search engine. Starting with something more appropriately nerdy, I asked W|A about "Star Trek"; it responded with facts about the new movie, and suggested some other movies I might mean, apparently unaware that there was a television show that preceded it. Looking for some subtlety with a deliberately ambiguous query, I asked about "House" and it responded "Assuming "House" is a unit | Use as a surname or a character or a book or a movie instead". My whole family is a big fan of Hugh Laurie, so I clicked on "character" and was very amused to see that to Wolfram|Alpha, the character "House" is Unicode character x2302, "⌂". Finally, not really expecting very much, I asked it about the Boss.

In New Jersey, where I live, there's only one person who is "The Boss", and that's Bruce Springsteen. If you leave off the "The", and you're also a Yankees fan, then maybe George Steinbrenner could be considered a possible answer, and Wolfram|Alpha gets it exactly right. Which is impressive, considering that somewhere inside Wolfram|Alpha is Mathematica crunching data. The hype around Wolfram|Alpha is that it runs on a huge set of "curated data", so this got me wondering what sort of curated dataset knows who "The Boss" really is. To me, "curated" implies that someone has studied and evaluated each component item and somehow I doubt that anyone at Wolfram has thought about the boss question

The Semantic Web community has been justifiably gushing about "Linked Data", and the linked datasets available are getting to be sizable. One of the biggest datasets is "DBpedia". DBpedia is a community effort to extract structured information from Wikipedia and to make this information available on the Web. According to its "about" page, the dataset describes 2.6 million "things", and is currently comprised of 274 million RDF triples. It may well be that Wolfram Alpha has consumed this dataset and entered facts about Bruce Springsteen into its "curated data" set. (The Wikimedia Foundation is listed as a reference on its "the boss" page.) If you look at Bruce's Wikipedia page, you'll see that "The Boss" is included as the "Alias" entry in the structured information block that you see if you pull up the "edit this page" tab, so the scenario seems plausible.

Still, you have to wonder how any machine can consume lots of data and make good judgments about who is "The Boss". Wikipedia's "Boss" disambiguation page lists 74 different interpretations of "The Boss". Open Data's Uriburner has 1327 records for "the boss" (1676 triples, 1660 properties), but I can't find the Alias relationship to Bruce Springsteen. How can Wolfram|Alpha, or indeed any agent trying to make sense of the web of Linked Data, deal with this ever-increasing flood of data?

Two weeks ago, I had the fortune to spend some time with Atanas Kiryakov, the CEO of Ontotext, a Bulgarian company that is a leading developer of core semantic technology. Their product OWLIM is claimed to be "the fastest and most scalable RDF database with OWL inference", and I don't doubt it, considering the depth of understanding that Mr. Kiryakov displayed. I'll write more about what I learned from him, but for the moment I'll just focus on a few bits I learned about how semantic databases work. The core of any RDF- based database is a triple-store; this might be implemented as a single huge 3 column data table in a conventional database management software; I'm not sure exactly what OWLIM does, but it can handle a billion triples without much fuss. When a new triple is added to the triple store, the semantic database also does "inference". In other words, it looks at all the data schemas related to the new triple, and from them, it tries to infer all the additional triples implied by the new triple. So if you were to add a triples ("I", "am a fan of", "my dog") and ("my dog", "is also known as", "the Boss"), then a semantic database will add these triples, and depending on the knowledge model used, it might also add a triple for ("I", "am a fan of", "the Boss"). If the data base has also consumed "is a fan of" data for millions of other people, then it might be able to figure out with a single query that Bruce Springsteen, with a million fans, is a better answer to the question "Who is known as 'the Boss'" than your dog, who, though very friendly, has only one fan.

As you can imagine, a poorly designed data schema can result in explosions of data triples. For example, you would not want your knowledge model to support a property such as "likes the same music" because then the semantic database would have to add a triple for every pair of persons that like the same music- if a million people liked Bruce Springsteen's music, you would need a trillion triples to support the "likes the same music" property. So part of the answer to my question about how software agents can make sense of linked data floods is that they need to have well thought-out knowledge models. Perhaps that's what Wolfram means when they talk about "curated datasets".

Thursday, May 28, 2009

Part 3: Reification Considered Harmful

About two years ago, we had some landscaping done on our yard. Part of the work was to replace the crumbling walkways. I suggested making some of the walkways curved, to create more esthetic shapes for the garden beds in front of our house. The landscape designer we were working with suggested that I reconsider, because it is well known in the landscape design world that people never, ever, follow a curved path. Studies have been made using hidden cameras showing that people always walk in straight paths, no matter what the landscape design tries to coax them into doing. The usual result of a curved pathway is the creation of a footworn path that makes the curved path straight. As you can see, we took our designer's advice.

This is the third part in a series of posts on reification. In Part 1, I tried to explain what reification is; in my second post I gave some examples of how to use reification using RDFa. In a philosophical interlude on truth on the internet, I made it pretty clear why I think it's really important to include and retain sourcing and provenance information whenever you try to collect information from the internet. In this part 3, I promised to discuss the pros and cons of reification. I lied. RDF Reification has been nothing but disastrous for the semantic web. The problem is that RDF tries to lead implementers along a strangely curved path if they want to do the "right" thing and keep track of sourcing and provenance of the knowledge loaded into a triple-store. I have a strong suspicion that no one, anywhere, ever in the history of RDF, has made significant use of the reification machinery. I have asked a fair number of semantic web implementers and none of them have ever used reification.

Semantic Web implementers certainly don't ignore the imperatives of sourcing and provenance, but what they do instead of using reification is to make the equivalent of straight worn dirt paths. Typically they won't use pure triple stores, instead treating triples as first class data objects that can be joined to separate tables with provenance information, or else they build knowledge models which make the provenance and source explicit, as do Google's models for reviews that they are supporting in RDFa.

Alternatively, Semantic Web implementers may choose to ignore the retention of provenance and sourcing and treat their RDF triple-store as a pristine, never-changing, collection of truth. For many applications, this works quite well. It rapidly becomes unworkable if it is required to merge many sources of information. RDF works great for the collection, transmission and processing of unchanging, unpolluted, uncontroversial knowledge; on this blog, I will from now on refer to this sort of information as UnKnowledge.

To my mind, there is a deeper problem with reification. and that relates to what an RDF triple really means. My view is that an RDF triple means absolutely nothing, and that it is only the action of asserting a triple that has meaning. The deep problem with reification is that it's hard to do, and thus nobody does it. It also forces implementers to think too much about semantics, and thinking too much about semantics is always a bad thing. Too often you end up dizzy like a dog chasing its tail.

The RDF working group has produced an entire document trying to clarify what the semantics of RDF are. Here is an example paragraph to study:
The semantic extension described here requires the reified triple that the reification describes - I(_:xxx) in the above example - to be a particular token or instance of a triple in a (real or notional) RDF document, rather than an 'abstract' triple considered as a grammatical form. There could be several such entities which have the same subject, predicate and object properties. Although a graph is defined as a set of triples, several such tokens with the same triple structure might occur in different documents. Thus, it would be meaningful to claim that the blank node in the second graph above does not refer to the triple in the first graph, but to some other triple with the same structure. This particular interpretation of reification was chosen on the basis of use cases where properties such as dates of composition or provenance information have been applied to the reified triple, which are meaningful only when thought of as referring to a particular instance or token of a triple.
I've read that sentence over and over again; I've finally concluded that it is an example of steganography. Here is how I have decoded it:
the semantic extension described HEre requires the reified tripLe that the reification describes - i(_:xxx) in the above example - to be a Particular token or InstAnce of a triple in a (real or notional) rdf docuMent, rAther than an 'abstract' triPle consideRed as a grammatIcal form. there could be Several such entities which have the same subject, predicate and Object properties. although a graph is defiNed as a set Of tRiples, several such tokens wIth the same triple structure might occur i different documents. thus, it would be meNAningful to Claim thAt the blank node in the second Graph abovE does not refer to the triPLE in the first grAph, but to Som other tERiplE with the Same struCtUrE. THIS Particular interpretatiOn Of Reification was choSen On the basis of Use cases where properties such as dates of composition or provenance information have been appLied to the reified triple, whicH are mEaningfuL only when thought of as referring to a Particular instance or token of a triple.
I'll try to suggest some ways that we might rescue RDF and the Semantic Web in a future post.

Tuesday, May 26, 2009

There is no truth on the internet

In his retirement, my father took up genealogy as a hobby, and after he died, his database of thousands of ancestors (most of them in northern Sweden) passed to me. If you're interested, you can browse through them on the hellman.net website. Having all this data up on the web has been rather entertaining. Every month or so, I get an e-mail from some sixth cousin or such who has discovered a common ancestor through a google search, and the resulting exchanges of data allow me to make occasional corrections and additions.

Since I've taken the database on, huge amounts of genealogic information has become available on the internet. When I first started finding this information, I made the mistake of trying to suck it into my database, since I had become more a less a professional data sucker and spewer in my work life. Once I had spent hour after hour pulling data in, I started to wonder what the point of it all was. Could I relly determine, and did I really care whether Erik Eriksson, born 1837 in Backfors, was really my fourth cousin thrice removed or not? What is the relationship between the data I sucked in and the truth about all the real people listed in the database? I quickly regretted my data gluttony.

Traditional genealogists focusing on Sweden use a variety of material as primary sources of information. Baptismal records typically give a childs name and birthdate along with the names of their parents; burial and marriage records similarly give names and dates. The genealogist's job is to connect names on different records to construct a family tree. But things are not always simple. Probably 20% of males in the Backfors region were named Erik, and since patronymics were used, 20% of those males were also named Eriksson, though the name might be abbreviated in the records as "Ersson". To judge whether a girl named Hanna listed on a birth record from 1877 which lists "Erik Eriksson" as the father is really the daughter of the Erik Eriksson born in 1837 in Backfors, the genealogist must consider all the information available together with conditional probabilities.

The internet genealogist (e.g., me) has a different task. Rather than looking at the birth records and assessing the likelihood of name coincidences, the internet genealogist looking at the same question searches the internet and finds that the web site "sikhallan.se" lists Hanna as Erik's daughter. The internet genealogist then makes a judgement about the reliability of the Sikhallan website. For example, how do we know that Sikhallan's source for Erik's birthdate isn't just the hellman.net website? If the two databases disagree, who should be believed? In my case, I just look at my father's meticulous notes about where his information comes from and if he noted some uncertainty, then I'm much more likely to believe the other sources available to me. Unless of course my data has come from one of my data sucking binges, in which case the source of my data has been lost and I can no longer judge its reliability.

In my last two posts on reification (Part 1, Part 2), I promised that I would have a third post evaluating whether the reification machinery in RDF was worth the trouble. This is not that third post, this is more of a philosophical interlude. You see, another way to look at genealogic information on the internet is to think of it as a web of RDF triples. For example, imagine if Sikhallan made its data available as a set of triples, e.g. (subject: Erik Eriksson; predicate: had daughter; object Hanna). Then we could load up all the triples into an RDF-enabled genealogy database, and all our problems would be solved, right? Well, yes, unless of course we wanted to retain all the supporting information behind the data, the data provenance, all the extra care in citation of source taken by my Father and and ignored by me in my data-sucking orgies. In reality, the triple itself is worthless, devoid of assessable truth. If the triple were associated with provenance information its truth would become assessable, and thus valuable. The mechanism that RDF provides for doing things like this is... reification.

Wikipedia is the most successful knowledge aggregation on the internet today and is also, not coincidentally, the best example of the value of comprehensive retention of provenance and attribution. Wikipedia keeps track of the data and author of every change in its database, and relentlessly purges anything which is not properly cited. Wikipedia is, in my opinion the best embodiment of my view that there is no truth on the internet- there are only reified assertions.

Wednesday, May 20, 2009

Reif#&cation Part 2: The Future of the RDF, RDFa, and the Semantic Web is Behind Us

In Reif#&cation Part 1, I introduced the concept of reification and its role in RDF and the Semantic Web. in Part 3, I'll discuss the pros and cons of reification. Today, I'll show some RDFa examples.

I've spent the last couple of days catching up on lots of things that have happened over the last few years while the semantic web part of my brain was on vacation. I was hoping to be able to give some examples of reification in RDFa using the vocabulary that Google announced it was supporting, but I'm not going to be able to do that, because the Google vocabulary is structured so that you can't do anything useful with reification. There are some useful lessons to draw from this little fact. First of all, you can usually avoid reification by designing your domain model to avoid it. You should probably avoid it too if you can. In the Google vocabulary, a Review is a first-class object with a reviewer property. The assertion that a product has rating 3 stars is not made directly by a reviewer, but indirectly by a review created by a reviewer.

Let's take a look at the html snippet presented by Google on their help page for RDFa (It's permissible to skip past the code if you like.):


<div xmlns:v="http://rdf.data-vocabulary.org/#"
typeof="v:Review">
<p><strong><span property="v:itemReviewed">
Blast 'Em Up</span>
Review</strong></p>
<p>by <span rel="v:reviewer">
<span typeof="v:Person">
<span property="v:name">Bob Smith</span>,
<span property="v:title">Senior
Editor</span> at ACME Reviews
</span>
</span></p>
<p><span property="v:description">This is a great
game. I enjoyed it from the opening battle to the final
showdown with the evil aliens.</span></p>

</div>

(Note that I've corrected a bunch of Google's sloppy mistakes here- the help page erroneously had "v:person", "v:itemreviewed" and "v:review" where "v:Person", "v:itemReviewed" and "v:Review" would be been correct according to their published documentation. I've also removed an affiliation assertion that is hard to fix for reasons that are not relevant to this discussion, and I've fixed the non-well-formedness of the Google example. )

The six RDF triples embedded here are:

subject: this block of html (call it "ThisReview")
predicate: is of type
object: google-blessed-type "Review"

subject: ThisReview
predicate: is reviewing the item
object: "Blast 'Em Up"

subject: ThisReview
predicate: has reviewer
object: a google-blessed-type "Person"

subject: a thing of google-blessed-type "Person"
(call it BobSmith)
predicate: is named
object: "Bob Smith"

subject: BobSmith
predicate: has title
object: "Senior Editor"

subject: ThisReview
predicate: gives description
object: "This is a great game. I enjoyed it from the
opening battle to the final showdown with the evil
aliens."

Notice that in Google's favored vocabulary, Person and Review are first-class objects and the item being reviewed is not (though they defined a class that might be appropriate). An alternate design would be to make the item a first class object and the review a predicate that could be applied to RDF statements. The seven triples for that would be

subject: a thing of google-blessed-type "Product"
(call it BlastEmUp)
predicate: is named
object: "Blast 'Em Up"

subject: BobSmith
predicate: is named
object: "Bob Smith"

subject: BobSmith
predicate: has title
object: "Senior Editor"

subject: an RDF statement (call it TheReview)
predicate: has creator
object: BobSmith

subject: TheReview
predicate: has subject
object: BlastEmUp

subject: TheReview
predicate: has predicate
object: gives description

subject: TheReview
predicate: has object
object: "This is a great game. I enjoyed it from the
opening battle to the final showdown with the evil
aliens."

To put those triples in the same HTML, I do this:


<div xmlns:v="http://rdf.data-vocabulary.org/#"
xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
xmlns:dc="http://purl.org/dc/elements/1.1/"
typeof="rdf:Statement"
rel="dc:creator"
href="#BobSmith">
<p><strong>
<span property="rdf:subject">
<span typeof="v:Product">
<span property="v:name">Blast 'Em Up</span>
</span>
</span> Review</strong></p>
<p>by <span typeof="v:Person" id="BobSmith">
<span property="v:name">Bob Smith</span>,
<span property="v:title">Senior Editor</span>
at ACME Reviews
</span></p>
<p><span property="rdf:predicate"
resource="v:description"/>
<span property="rdf:object">This is a great
game. I enjoyed it from the opening battle
to the final showdown with the evil
aliens.</span></p>
</div>

I've drawn one extra bit of vocabulary from the venerable "Dublin Core" vocabulary, "dc:creator", to do this.

Some observations:
  1. Reification requires a bit of gymnastics even for something simple; if I wanted to reify more than one triple, it would start to look really ugly.
  2. Through use of a thought-out knowledge model, I can avoid the need for reification.
  3. The Knowledge model has a huge impact on the way I embed the information.

This last point is worth thinking about further. It means for you and me to exchange knowledge using RDFa or RDF, we need to share more than a vocabulary, we need to share a knowledge model. It reminds me of another story I heard on NPR, about the Aymara people of the Andean highlands, whose language expresses the future as being behind them, whereas in English and other western languages the future is thought of as being in front of us. We can know the vocabulary for front and back in Aymarian, but because we don't share the same knowledge model, we wouldn't be able to successfully speak to an Aymarian about the past and the future.

Friday, May 15, 2009

Reif#&cation Part 1: RDF and the dry martini

A man walks into a bar. The bartender asks him what he wants. "Nothing," he says.
"So why did you come in here for nothing?" asks the bartender.
"Because nothing is better than a dry martini."

This joke is an example of reification. An abstract concept, "nothing", is linguistically twisted into a real object, resulting in a humorous absurdity. I first encountered the concept when, 10 years ago, I learned RDF, (resource description framework) the data model which was designed to be the fundamental underpinning of the semantic web. At that time, I was sure that "reification" was a completely made up word used as a jargon stolen from the knowledge representation community. It's only this week that I learned that in fact, "reification" is a "macaronic calque" translation of a completely made up German word used prominently by Karl Marx, "Verdinglichung". Somehow that doesn't make me feel much better about the word. If you learn nothing else from reading this, remember that you can use "reification" as a code word to gain admittance to any gathering of the Semantic Webnoscenti.

In RDF, reification is necessary so that stores of triples can avoid self-contradiction. Let me translate that into English. RDF is just a way to say things about other things so that machines can understand. The model is simple enough that machines can gather together large numbers of RDF statements, apply mathematical machinery to the lot of them, and then spit out new statements that make it seem as though the machines are reasoning. The problem is that machines are really stupid, so if you tell them that the sky is blue, and also that the sky is not blue, they can't resolve the contradiction and they start emitting greenhouse gases out the wazoo and millions of people in low-lying countries lose their homes to flooding. What you need to do instead is to "reify" the contradictory statements and tell the machine "Eric said the 'the sky is blue'" and "Bruce said 'the sky is not blue'". RDF, as a system, can't talk about the assertions that it contains without doing the extra step of reifying them.

So let's see how the RDF model accomplishes this (remember, RDF represents assertions as a set of (subject,predicate,object,) triples. We start with:
Subject: The sky
Predicate: is colored
Object: blue
And after reification, we have:
Subject: statement x
Predicate: has Subject
Object: The sky

Subject: statement x
Predicate: has Predicate
Object: is colored

Subject: statement x
Predicate: has Object
Object: blue

Subject: Eric
Predicate: said
Object: statement x
So now the statement about the color of the sky has become a real thing within the RDF model, and I can do all sorts of things with it, such as compare it to a dry martini. The downside is that this comes at the cost of turning one triple into 3 triples.

Reification has analogs in other disciplines. Software developers familiar with object-oriented programming may want to think of reification as making the assertion into a first-class object. Physicists and people who just want their minds blown may want to compare reification to "second quantization". At this point, I'll don my ex-physicist hat (even though I never wore a hat while doing physics!) and tell you that second quantization is the mathematical machinery of field theory that allows field theory to treat bundles of waves as if they were real particles that can be created and annihilated.

Whether you're doing linked open data or quantum field theory, it's a good idea to focus on things that behave as if they were real. Otherwise, no dry martinis for you!

This is the first part of three articles on reification. In Part 2, I'll show how reification is applied in a real example, using the newly trendy RDFa. In Part 3, I'll write about whether reification is a good idea.

Tuesday, May 12, 2009

Google, RDFa, and Reusing Vocabularies

Yesterday, I wrote about one difficulty of having machines talk to other machines- propagation and re-use of vocabularies is not something that machines being used today know how to do on their own. I thought it would be instructive to work out a real example of how I might find and reuse vocabulary to express that a work has a certain ISBN (international standard book number). What I found (not to my great surprise) was that it wasn't that easy for me, a moderately intelligent human with some experience at RDF development, to find RDF terminology to use. I tried Knoodl, Google, and SchemaWeb to help me.

Before I complete that thought, I should mention that today Google announced that they've begun supporting RDFa and microformats in what they call "rich snippets". RDFa is a mechanism for embedding RDF in static HTML web pages, while microformats are a simpler and less formalized way to embed metadata in web pages. Using either mechanism, this means that web page authors can hide information in structures mean to be read by machines in the same web pages that humans can read.

Concentrating on just the RDFa mechanism, it's interesting to see how Google expects that vocabulary will be propagated to agents that want to contribute to the semantic web: Google will announce the vocabulary that it understands, and everyone else will use that vocabulary. Resistance is futile. Not only does Google have the market power to set a de facto standard, but it has the intellectual power to do a good job of it- one of the engineers on the Google team working on "rich snippets" is Ramanathan V. Guha, who happens to be one of the inventors of RDF.

You would think that It would be easy to find an RDF property that has been declared to use in assertions like "the ISBN of 'digital Copyright' is 1-57392-889-5". No such luck. Dublin Core, a schema developed in part by the library community, has an "identifier" element which can be modified to indicate the element contains an isbn, but no isbn property. Maybe I just couldn't find it. Similarly, MODS, which is closely related to library standards, has an identifierType element type that can contain an ISBN, but you have to add type=isbn to the element to make it an ISBN. Documentation for RDFa wants you to use the ISBN to make a urn and to make this the subject of your assertion, not an attribute (ignoring the fact the ISBN identifies things that you sell in a bookstore (for example, the paperback version of a book) rather than what most humans think of as books. I also found entries for isbn in schemes like The Agricultural Metadata Element Set v.1.1 and a mention in the IMS Learning Resource Meta-Data XML Binding. Finally I should note that while OpenURL (a standard that I worked on) provides an XML format which includes an ISBN element, it's defined in such a way that it can't be used in other schemas.

The case of ISBN illustrates some of the barriers to vocabulary reuse, and although there are those who are criticizing Google for not reusing vocabulary, you can see why Google thinks it could work better if they just define vocabulary by fiat.