Posts

Showing posts with the label public web archives

2025-06-10: Comparing the Archival Rate of Arabic and English News Stories Published Between 1999 and 2022

Image
About 0.5% of websites publish their content in Arabic , occupying the 20th place among other languages; however, Arabic is the 6th most spoken language in the world at 3.4%. A considerable portion of Arabs live in English speaking countries. For example, Arabs make up roughly 1.2% of the U.S. population . Some of them, mainly first generation, are able to consume news in Arabic in addition to English. Second, third, and fourth generation Arabs might be interested in the Arabic narrative of news stories, but they prefer the English language since it is their first language. In this post, we present a quantitative study for the archival rate of news webpages published in Arabic as compared to news pages published in English by Arabic media from 1999 to 2022. We reveal that, contrary to the general conjecture which is that web archives favor English webpages, the archival rate of Arabic webpages in increasing more rapidly than the archival rate for English webpages. The Dataset Our d...

2025-06-10: The Wayback Machine is now much larger than the sum of all other web archives

Image
The overlap in web archives holdings of URIs in our sample In this post, we summarize our study on the archiving rate of news stories published in Arabic and English from four major news outlets, Aljazeera Arabic , Aljazeera English , Alarabiya , and Arab News . We found that 45% of news stories' URIs published between 1999 and 2022 were not archived at all. Furthermore, for news stories published between 1999 and 2013, 65% of them were not archived.  For stories published between 2013 and 2022, only 21% of them are not archived. Our findings line up with  Ainsworth et al. (2011) . who found that between 30% and 90% of the web is archived. Our results indicate a notable improvement in web archiving within the last decade, however, we found that improvement to be limited to the Internet Archive. An earlier study by Alsum et al. (2014) , on a different dataset, found that it is possible to retrieve full TimeMaps for 55% of their dataset using the top three web archives excl...