Posts

Showing posts with the label web page title

2010-02-17: Using Web Page Titles to Rediscover Lost Web Pages

Image
The object of my project was to glean from a web page's title whether the title could be used to find the resource within the yahoo search engines caches. Lost pages for this project are pages that return a 404. A 404 response code is an error message indicating that the client was able to communicate with the server but the server could not find what was requested. There are a multitude of possibilities why a page or an entire web site may disappear. These pages may reside only in the cache’s of search engines, or web archives, or just moved from one URI to another. In the context of this experiment Titles are denoted by the TITLE element within a web page. There can only be one title in a web page. The title may not contain anchors, highlighting, or paragraph marks. What would be most desirable for this experiment would be to take all URIs as our collection set. Regrettably, using the entire web as our test set is unrealistic. Capturing a representative sample set of web-sites...

2009-07-17: Technical Report "Evaluating Methods to Rediscover Missing Web Pages from the Web Infrastructure"

Image
This week I uploaded the technical report which is co-authored by Michael L. Nelson to the e-print service arxiv.org . The underlying idea of this research is to utilize the web infrastructure (search engines, their caches, the Internet Archive, etc) to rediscover missing web pages - pages that return the 404 "Page not Found" error. We apply various methods to generate search engine queries based on the content of the web page and user created annotations about the page. We then compare the retrieval performance of all methods and introduce a framework to combine such methods to achieve the optimal retrieval performance. The applied methods are: 5- and 7-term lexical signatures of the page the title of the page tags users annotated the page with on delicious.com 5- and 7-term lexical signatures of the page neighborhood (up to 50 pages linking to the missing page) We query the big three search engines (Google, Yahoo and MSN Live) with the outcome of all methods and analyze t...