DSHR's Blog: twitter

Showing posts with label twitter. Show all posts

Thursday, February 11, 2021

More On Archiving Twitter

Himarsha Jayanetti from Michael Nelson's group at Old Dominion follows up on the work I discussed in Michael Nelson's Group On Archiving Twitter with Twitter rewrites your URLs, but assumes you’ll never rewrite theirs: more problems replaying archived Twitter:

Source

URLs shared on Twitter are automatically shortened to t.co links. Twitter does this to track its engagements and also protect its users from sites with malicious content. Twitter replaces these t.co URLs with HTML that suggests the original URL so that the end-user does not see the t.co URLs while browsing. When these t.co URLs are replayed through web archives, they are rewritten to an archived URL (URI-M) and should be rendered in the web archives as in the live web, without displaying these t.co URI-Ms to the end-user.

But, as the screen-grab from the Wayback Machine shows, they may not be. Below the fold, a look at Jayanetti's explanation.

Michael Nelson's Group On Archiving Twitter

The rise and fall of the Trump administration has amply illustrated the importance of Twitter in the historical record. Alas, Twitter has no economic motivation to cater to the needs of historians. As they work to optimize Twitter's user experience, the engineers are likely completely unaware of the problems they are causing the Web archives trying to preserve history. Even if they were aware, they would be unable to justify the time and effort necessary to mitigate them.

Over the last six months Michael Nelson's group at Old Dominion University have continued their excellent work to evaluate exactly how much trouble future historians will have to contend with in three new blog posts from Kritika Garg and Himarsha Jayanetti:

Below the fold, some commentary on each of them.

Twitter Fails Security 101 Again

Source

On July 15 the New York Times reported on the day's events at Twitter:

It was about 4 in the afternoon on Wednesday on the East Coast when chaos struck online. Dozens of the biggest names in America — including Joseph R. Biden Jr., Barack Obama, Kanye West, Bill Gates and Elon Musk — posted similar messages on Twitter: Send Bitcoin and the famous people would send back double your money.

Two days later Nathaniel Popper and Kate Conger's Hackers Tell the Story of the Twitter Attack From the Inside was based on interviews with some of the perpetrators:

Mr. O'Connor said other hackers had informed him that Kirk got access to the Twitter credentials when he found a way into Twitter’s internal Slack messaging channel and saw them posted there, along with a service that gave him access to the company’s servers. People investigating the case said that was consistent with what they had learned so far. A Twitter spokesman declined to comment, citing the active investigation.

Below the fold, some commentary on this and other stories of the fiasco.

The 47 Links Mystery

Nearly a year ago, in All Your Tweets Are Belong To Kannada, I blogged about Cookies Are Why Your Archived Twitter Page Is Not in English. It describes some fascinating research by Sawood Alam and Plinio Vargas into the effect of cookies on the archiving of multi-lingual web-sites.

Sawood Alam just followed up with Cookie Violations Cause Archived Twitter Pages to Simultaneously Replay In Multiple Languages, another fascinating exploration of these effects. Follow me below the fold for some commentary.

Josh Marshall on Facebook

Last September in Josh Marshall on Google, I wrote:

a quick note to direct you to Josh Marshall's must-read A Serf on Google's Farm. It is a deep dive into the details of the relationship between Talking Points Memo, a fairly successful independent news publisher, and Google. It is essential reading for anyone trying to understand the business of publishing on the Web.

Marshall wasn't happy with TPM's deep relationship with Google. In Has Web Advertising Jumped The Shark? I quoted him:

We could see this coming a few years ago. And we made a decisive and longterm push to restructure our business around subscriptions. So I'm confident we will be fine. But journalism is not fine right now. And journalism is only one industry the platform monopolies affect. Monopolies are bad for all the reasons people used to think they were bad. They raise costs. They stifle innovation. They lower wages. And they have perverse political effects too. Huge and entrenched concentrations of wealth create entrenched and dangerous locuses of political power.

Have things changed? Follow me below the fold.

All Your Tweets Are Belong To Kannada

Gerd Badur CC BY-SA 3.0, Source

Sawood Alam and Plinio Vargas have a fascinating blog post documenting their investigation into why:

47% of mementos of Barack Obama's Twitter page were in non-English languages, almost half of which were in Kannada alone. While language diversity in web archives is generally a good thing, in this case though, it is disconcerting and counter-intuitive.

Kannada is an Indian language spoken by only about 38 million people. Below the fold, some commentary.

New Yorker on Web Archiving

Do not hesitate, do not pass Go, right now please read Jill Lepore's really excellent New Yorker article Cobweb: can the Web be archived?

Monday, April 7, 2014

What Could Possibly Go Wrong?

I gave a talk at UC Berkeley's Swarm Lab entitled "What Could Possibly Go Wrong?" It was an initial attempt to summarize for non-preservationistas what we have learnt so far about the problem of preserving digital information for the long term in the more than 15 years of the LOCKSS Program. Follow me below the fold for an edited text with links to the sources.

Annotations

Caroline O'Donovan at the Nieman Journalism Lab has an interesting article entitled Exegesis: How early adapters, innovative publishers, legacy media companies and more are pushing toward the annotated web. She discusses the way media sites including The New York Times, The Financial Times, Quartz and SoundCloud and platforms such as Medium are trying to evolve from comments to annotations as a way to improve engagement with their readers. She also describes the work hypothes.is is doing to build annotations into the Web infrastructure. There is also an interesting post on the hypothes.is blog from Peter Brantley on a workshop with journalists. Below the fold, some thoughts on the implications for preserving the Web.

Winston Smith Lives!

Three years ago I wrote a post on the importance of a tamper-resistant system for government documents, and another a year ago. Governments cannot resist the temptation to re-write history to their advantage, and every so often they get caught, which is an excuse for me to repeat the message. Below the fold, this year's version of the message.

DAWN vs. Twitter

I blogged three weeks ago about the Library of Congress ingesting the Twitter feed, noting that the tweets were ending up on tape. It is over 130TB and growing 190GB/day. The Library is still trying to work out how to provide access to this collection; for example they cannot afford the infrastructure that would allow readers to perform keyword searches. This leaves the 400-odd researchers who have already expressed a need for access to the collection stymied. The British Library is also running into problems providing access to large collections, although not as large as Twitter. They are reduced to delivering 30TB NAS boxes to researchers, the same approach as Amazon and other services have taken to moving large amounts of data.

I mentioned this problem in passing in my earlier post, but I have come to understand that this observation has major implications for the future of digital preservation. Follow me below the fold as I discuss them.

Go Library of Congress!

Carl Franzen at talkingpointsmemo.com pointed me to the report from the Library of Congress on the state of their ingest of the Twitter-stream. Congratulations to the team for two major achievements:

getting to the point where they have caught up with ingesting the past, even though some still remains to be processed into its final archival form,
and having an automated process in place capable of ingesting the current tweets in near-real-time.

The numbers are impressive:

On February 28, 2012, the Library received the 2006-2010 archive through Gnip in three compressed files totaling 2.3 terabytes. When uncompressed the files total 20 terabytes. The files contained approximately 21 billion tweets, each with more than 50 accompanying metadata fields, such as place and description.

As of December 1, 2012, the Library has received more than 150 billion additional tweets and corresponding metadata, for a total including the 2006-2010 archive of approximately 170 billion tweets totaling 133.2 terabytes for two compressed copies.

Notice the roughly 10-to-1 compression ratio. Each copy of the archive would be in the region of 1.3PB uncompressed. The average compressed tweet takes up about 130*10¹²/2*170*10⁹ = 380 bytes, so the metadata is far bigger than the 140 or less characters of the tweet itself. The library is ingesting about 0.5*10⁹ tweets/day at 380 bytes/tweet, or 190GB/day, or about 2.2Mb/s bandwidth (ignoring overhead). These numbers will grow as the flow of tweets increases. The data ends up on tape:

Tape archives are the Library’s standard for preservation and long-term storage. Files are copied to two tape archives in geographically different locations as a preservation and security measure.

The scale and growth rate of this collection explain the difficulties the library has in satisfying the 400-odd requests they already have from scholars to access it for research purposes:

The Library has assessed existing software and hardware solutions that divide and simultaneously search large data sets to reduce search time, so-called “distributed and parallel computing”. To achieve a significant reduction of search time, however, would require an extensive infrastructure of hundreds if not thousands of servers. This is cost-prohibitive and impractical for a public institution.

This is a huge and important effort. Best wishes to the Library as they struggle with providing access and keeping up with the flow of tweets.

DSHR's Blog

Thursday, February 11, 2021

More On Archiving Twitter

Tuesday, December 29, 2020

Michael Nelson's Group On Archiving Twitter

Tuesday, July 21, 2020

Twitter Fails Security 101 Again

Thursday, March 28, 2019

The 47 Links Mystery

Monday, July 2, 2018

Josh Marshall on Facebook

Tuesday, April 24, 2018

All Your Tweets Are Belong To Kannada

Wednesday, January 21, 2015

New Yorker on Web Archiving

Monday, April 7, 2014

What Could Possibly Go Wrong?

Tuesday, August 20, 2013

Annotations

Tuesday, August 13, 2013

Winston Smith Lives!

Tuesday, January 29, 2013

DAWN vs. Twitter

Friday, January 4, 2013

Go Library of Congress!