Posts

Showing posts with the label Wikipedia

2023-10-10: In appreciation of the "ridiculous and unworkable" projects that make the Internet great and research possible

Image
  https://xkcd.com/2085/   The Internet Archive is hosting their annual celebration this week ( October 12, 2023 ), and I wanted to take this opportunity to both 1) encourage your attendance (virtual for most of us, but if you're in San Francisco, you can attend in person), and 2) express my appreciation and gratitude for continued existence of the Internet Archive, their evolving products and services, and their support of the research community.    The ongoing devolvement of Twitter into 4chan has caused me to reflect on the platforms, services, and corpuses on which I have built a research program over the last 20+ years.  Discussing the Twitter situation will be the topic of a future post, but here I want to laud the Internet Archive, specifically the Wayback Machine, and by extension, the suite of other public web archives, such as Archive.Today , Arquivo.pt , and the many members of IIPC .  In the past I've referred to the Internet Archive as th...

2019-06-05: Wikis Are Archives: Integrating Memento and Mediawiki

Image
Since 2013, I have been a principal contributor to the Memento MediaWiki Extension . We recently released version 2.2.0 to support MediaWiki versions of 1.31.1 and greater. During the extension's development, I have detailed some of its concepts on this blog , I have presented it at WikiConference USA 2014 , and I have even helped the W3C adopt it . It became the cornerstone of my Master's Thesis , where I showed how the Memento MediaWiki Extension could help people avoid spoilers on fan wikis . Why do Memento and MediaWiki belong together? The "dimensions of genericity" table from " Web Architecture: Generic Resources " by Tim Berners-Lee in 1996, annotated to display the RFCs that implemented these dimensions for the Web. Memento is not limited to web archives. When Tim Berners-Lee was developing the Web, he identified four dimensions of genericity : time, language, content-type, and target medium . HTTP enthusiasts will recognize that three of thes...

2018-12-03: Using Wikipedia to build a corpus, classify text, and more

Image
Wikipedia is an online encyclopedia, available in 301 different languages , and constantly updated by volunteers. Wikipedia is not only an encyclopedia, but it also has been used as an ontology to build a corpus, classify entities, cluster documents, create an annotation, recommend documents to a user, etc. Below, I review some of the significant publications in these areas. Using Wikipedia as a corpus: Wikipedia has been used to create corpora that can be used for text classification or annotation. In “ Named entity corpus construction using Wikipedia and DBpedia ontology ” (LREC 2014), YoungGyum Hahm et al. created a method to use Wikipedia, DBpedia , and SPARQL queries to generate a named entity corpus. The method used in this paper can be accomplished in any language. Fabian Suchanek used Wikipedia, WordNet , and Geonames to create an ontology called YAGO, which contains over 1.7 million entities and 15 million facts. The paper “ YAGO: A large ontology from Wikipedia ...