Posts

Showing posts with the label data set

2013-07-15: Temporal Intention Relevancy Model (TIRM) Data Set

Image
In the third anniversary of the Haiti earthquake, president Barack Obama held a press conference and discussed the need to keep helping the Haitian community and to invest more in rebuilding the economy. A user was watching the press conference tweeted about it on the 14th of January, and provided a link to the streamed news.  A couple of days later when I read this tweet and clicked on the link and instead of seeing anything related to the press conference, Haiti, or President Obama, I got a stream feed of the Mercedes-Benz Super Dome in New Orleans in preparation for the 2013 Super Bowl. It is worth mentioning that at the time of writing this blog the tweet above was actually deleted, proving that social posts don't persist throughout time as we discussed in our earlier post . This scenario illustrates the problem we are trying to detect, model, and solve. The inconsistency between what is intended at the time of sharing and what the reader sees at the time of c...

2011-06-17: The "Book of the Dead" Corpus

Image
We are delighted to introduce the "Book of the Dead" , a corpus of missing web pages. The corpus contains 233 URIs all of which are dead meaning they result in a 404 "Page not Found" response. The pages were collected during a crawl conducted by the Library of Congress for web pages related to the topics of federal elections and terror between 2004 and 2006. We created the corpus to test the performance of our methods to rediscover missing web pages introduced in the paper " Evaluating Methods to Rediscover Missing Web Pages from the Web Infrastructure " published at JCDL 2010 . In addition we now thankfully have Synchronicity , a tool that can help overcome the 404 detriment to everyone's browsing experience in real time. To the best of our knowledge the Book of the Dead is the first corpus of this kind. It is publicly available and we are hopeful that fellow researchers can benefit from it by conducting related work. The corpus can be downloaded at:...