Library of Congress Compiling Huge Twitter Archive

Dennis Faas's picture

The U.S. Library of Congress has completed its archive of every Twitter post during the first four years following the site's launch. Unfortunately, making that archive useful has proven extremely difficult.

Back in April 2010, the Library of Congress signed a deal to archive Twitter posts, known as 'tweets.' The Library's Gayle Osterberg says those tweets are an important and valid research source.

"As society turns to social media as a primary method of communication and creative expression, social media is supplementing, and in some cases supplanting, letters, journals, serial publications and other sources routinely collected by research libraries," Osterberg said. (Source: loc.gov)

Under the terms of the deal, Twitter agreed to provide all tweets, both past and present, to the Library for archiving. The Library would then be allowed to make any tweet more than six months old available to legitimate researchers.

However, the deal does not allow the Library of Congress to make the Twitter archive available online for download by the general public.

170 Billion Tweets Already Archived

Collating and organizing those tweets by date has turned out to be a major technological challenge. The Library has only just completed archiving those first four years of tweets, totaling approximately 170 billion messages.

That leaves the Library archive nearly three years behind the current activity on Twitter. Furthermore, the number of tweets posted daily has more than tripled since the Library of Congress project began.

Osterberg says the Library has already received more than 400 requests for access to the archive. Those interested in the Twitter archive include researchers investigating topics ranging from vaccination of diseases to stock market activity.

Unfortunately, the Library hasn't been able to fulfill any of these requests, as it's still trying to figure out how to make the data available in a useful manner.

Officials have taken the approach of breaking the archive into individual files, each covering one hour's worth of Twitter posts from around the world.

Single Search Could Take 24 Hours

The problem now is that simply searching through each file, one at a time, would be extremely time consuming. Officials believe that, as things stand, a single search could require about 24 hours to complete. (Source: digitaltrends.com)

The most practical solution would be to provide hundreds or even thousands of machines for use by searchers. That way, a single search request could be done on multiple archive files simultaneously.

Insiders believe such a tactic could help speed up the research process. Sadly, the Library claims that strategy "is cost-prohibitive and impractical for a public institution."

Officials say they are now looking at ways to partner with private firms, as a way to use outside technology or financing to make archive access faster and more efficient.

Rate this article: 
No votes yet