Twitter to Release All Tweets to Scientists

chbrown · on May 27, 2014

I've heard that before.

* Library of Congress: http://blogs.loc.gov/loc/2013/01/update-on-the-twitter-archi...

* Twitter Data grants: https://blog.twitter.com/2014/introducing-twitter-data-grant...

I'll admit, I haven't applied for access through either one, but neither have I seen any papers cite access through those venues—and I read quite a few NLP + Twitter papers.

tchalla · on May 27, 2014

This article is just talking about Twitter Data Grants for which 6 universities were decided as winners [0]. You won't see papers through these grants as yet because well, the winners were announced about 40 days back!

[0] https://blog.twitter.com/2014/twitter-datagrants-selections

etiam · on May 27, 2014

from http://www.loc.gov/today/pr/2013/files/twitter_report_2013ja...

"Transfer of Data to the Library

In December, 2010, Twitter named a Colorado-based company, Gnip, as the delivery agent for moving data to the Library. Shortly thereafter, the Library and Gnip began to agree on specifications and processes for the transfer of files - "current" tweets - on an ongoing basis.

In February 2011, transfer of "current" tweets was initiated and began with tweets from December 2010.

On February 28, 2012, the Library received the 2006-2010 archive through Gnip in three compressed files totaling 2.3 terabytes. When uncompressed the files total 20 terabytes. The files contained approximately 21 billion tweets, each with more than 50 accompanying metadata fields, such as place and description.

As of December 1, 2012,the Library has received more than 150 billion additional tweets and corresponding metadata, for a total including the 2006-2010 archive of approximately 170 billion tweets totaling 133.2 terabytes for two compressed copies."

I find the quantities hilarious. But since they haven't been able to cope with providing access yet I get pessimistic about their prospects of doing so at all any time soon.

Can we do something to help them?

I've been thinking maybe GPU-accelerated databases like MapD, could mitigate the cost issue for them, but I'm pretty sure that doesn't go all the way to solving the problem...

stokedmartin · on May 27, 2014

Twitter had initiated granting of datasets some time back (now closed)[0] on the merits of a short proposal. The number of groups who eventually got access to the data were very few[1]. I hope in the future they increase the number of grants.

[0] https://blog.twitter.com/2014/introducing-twitter-data-grant...

[1] https://blog.twitter.com/2014/twitter-datagrants-selections

alexleavitt · on May 27, 2014

Yes, I am pretty sure this article is just rehashing the Twitter grants (I believe there were only 6 to 8 rewards), rather than announcing full open data to any researchers (thereby making the title misleading).

apetresc · on May 27, 2014

This is exciting to me; does anyone know how Twitter will go about this? Will there be a public dataset available for download? A research contract through the recently-acquired GNIP? Or just firehose access for future streams?

beejiu · on May 27, 2014

Considering there's at least 400GB of data generated per day, I don't think it'll be readily available for the public as a download.

stavros · on May 27, 2014

Jeez, 400 GB of text per day? How the hell?

EDIT: That's 11.5k tweets/sec. How do you get eleven thousand people to tweet every second?

nevinera · on May 27, 2014

Most of that data is not the content of the tweet itself, but the metadata associated with it. When I last checked, we were storing about a kilobyte of data for every tweet.

Also, tweets are limited to 140 characters, not bytes - chinese tweets typically take about 200-250 bytes, for example.

stavros · on May 29, 2014

Yeah, I figured 300 bytes per Tweet, to be generous, but didn't realize it would take 1 kb of metadata. Thanks for that detail.

JamesMcMinn · on May 27, 2014

I have access to the Twitter gardenhose (which is equal to slightly less than 10% of the full volume). These are the RX and TX statistics from the machine that I've been using to gather data for a few months now:

RX bytes:18053724505080 (16.4 TiB) TX bytes:2686623557042 (2.4 TiB)

It works out at around 70GB/day, so I'd actually think that the full firehose would use considerably more data than 400GB/day (likely closer to 800GB).

vschiavoni · on May 27, 2014

Have you ever heard about the 'Human Brain Project' ?

Well, they produce 100 GB of data _per second_

Welcome to the real big data world.

vijayr · on May 27, 2014

Woah, 100 GB per second???!! What exactly do they do?

Edit: Never mind, this is what they are doing

"We would like to develop some kind of ‘google' brain where we can zoom in and out, see it from different perspectives and understand how brain structure and function is related."

maxerickson · on May 27, 2014

Pick a topic you are familiar with. Open up a twitter search for it. Wait a while. See the inevitable storm of tweet-spam that sort of looks like social sharing.

Thousands of fake accounts are tweeting out nonsense all the time. Another example, there are multiple accounts that tweet items from HN (and presumably lots of other rss feeds).

Wilya · on May 27, 2014

Depends how much metadata there is with every tweet.

I took a random tweet on my timeline, and the JSON representation, with added metadata, weighs 8740 bytes.

With this figure, you would "only" need 500 tweets/sec to get 400GB a day.

There's a lot of people on Twitter, and a lot of bots, so that doesn't seem unreasonable to me.

Houshalter · on May 27, 2014

Maybe it includes images.

cmyr · on May 27, 2014

it does not include images. A tweet object does include a bunch of metadata though: https://dev.twitter.com/docs/platform-objects/tweets

shahocean · on May 27, 2014

hell lot of data! First they have to find some alternative to zip these much of data!

umanwizard · on May 27, 2014

Doesn't Twitter make a fair bit of money from selling access to various slices of their data? I'd be surprised if they released it all to the general public. I imagine scientists would have to be under some sort of NDA.

JamesMcMinn · on May 27, 2014

The Twitter terms of service prohibit sharing the data in the Tweets. Researchers are allowed to share tweet IDs and User IDs which can be used to identify a Tweet. Currently, Twitter collections are shared using this method -- I recently released a dataset of 120 million Tweet IDs which cover a sample of a months worth of data, and numerous other researchers have used these IDs to crawl Twitter and obtain the same dataset as I used in my experiments.

JoshTriplett · on May 27, 2014

This would likely make a great natural language data set for compression algorithms.

hyperbovine · on May 27, 2014

@JoshTriplett tweets r alrdy #compressed. hth

loceng · on May 27, 2014

Upvoted for much needed humour on HN.

pipeep · on May 27, 2014

True, but doesn't Twitter already provide an API for access to a fraction of the firehose? Surely that would be enough data. If Twitter doesn't have a good API, Reddit allows full access to all comments through their API (although Reddit has orders of magnitude less data).

minimaxir · on May 27, 2014

Twitter's API is too limited for historical data. (you'll hit the rate limits quickly for any meaningful volume). Reddit's rate limits, however, let you process a million comments every day.

Smulv · on May 27, 2014

It appears as if the data is only available to those scientists who apply for the data grant and win it. Furthermore, applications for the grant have been closed since midway through March. Yea, I'm not surprised Twitter isn't making its historical data public. That would literally end Gnip, which is a revenue source for Twitter not based on advertising to users.

jebus989 · on May 27, 2014

How about just loosening the API rate limits, or making a better token request process with resource allocation e.g. I'd like 1500 requests per 15 min window (as opposed to 15, for some things) for 72 hours. I guess this could be limited to those with a academic email address if they insist.

uptown · on May 27, 2014

One thing I've wondered. Is it possible to follow "everyone" on Twitter? If-not, what type of cap does Twitter enforce on the number of accounts you're allowed to follow? I realize it'd be difficult to know which new accounts to add as people join, but how far could you push a roll-your-own stream of the Twitter firehose?

freehunter · on May 27, 2014

Hypothetically: I'm sure you could do it algorithmically; if your program sees a retweet from someone who is not on your following list, you then follow them. You might miss a few, but you would get most everyone.

jonknee · on May 27, 2014

http://lmgtfy.com/?q=twitter+maximum+follow+limit&l=1

theg2 · on May 27, 2014

A release for journalists would be nice too...

NamTaf · on May 27, 2014

I can't wait to see someone legitimately design a better sewerage system by using twitter's geolocation.

nevinera · on May 27, 2014

Their geo-data is utter crap. The vast majority of it is based on 'profile location', which means that there are almost a million people tweeting from the exact center of Atlanta. It's a crowded spot, must be a Starbucks there or something.

Nicholas_C · on May 27, 2014

Just find those mass locations and remove them from the data set.

nevinera · on May 27, 2014

You end up with about 0.01% of tweets having locations after that. It's basically just iphone users.

izzydata · on May 27, 2014

It is all available to the public to begin with anyway. I don't see the dilemma here.

namenotrequired · on May 27, 2014

Not all tweets are public.

jonknee · on May 27, 2014

Surely they aren't coughing up private tweets?

namenotrequired · on May 27, 2014

That was my interpretation of "all", but reading back it seems there's another difference:

"Although a majority of tweets are public, if scientists want to freely search the lot, they do it through Twitter's application programming interface, which currently scours only 1 percent of the archive. But that is about to change: in February the company announced that it will make all its tweets, dating back to 2006, freely available to researchers."

marincounty · on May 27, 2014

I think all the data should be available.

unclesaamm · on May 27, 2014

Can anyone find a primary source?

mike415 · on May 27, 2014

There is a link to apply for a "Data Grant" here: https://engineering.twitter.com/research. Unfortunately, it looks like submissions are closed.

_RPM · on May 27, 2014

So much for the "protected" tweet illusion.

of · on May 27, 2014

Who cares? Isn't it already available?

callesgg · on May 27, 2014

Yes it is, however not in excel. (written so a non tech person could understand)

extesy · on May 27, 2014

Article date is Jun 1, 2014. Is the author from the future?

flycaliguy · on May 27, 2014

There is no magic discovery about the nature of man hidden away in that data. Nothing you're average stand up comedian hasn't already written a bit about.

dirtyaura · on May 27, 2014

That was a good one.

On a more serious tone, there is one area of research that Twitter data is very valuable for: how information and disinformation is created and spread during major news events: wars, catastrophes, uprisings, school shootings etc.

My gut feeling is that today a lot of quality journalism happens outside of the traditional journalistic organisations. The downside is that also a lot of wild speculation and rumours are spread, but it would be valuable to see how good this modern "crowd" journalism is. A skilled research group can use Twitter data and Internet Archive to track down the original sources of information pretty well.

marincounty · on May 27, 2014

And momentum stock trades. I've always felt a adept programmer who knew how to data mine and worked for Twitter could make a fortune in the stock market?

coherentpony · on May 27, 2014

That's a little cynical. I like to think we can predict big historical social events (regime changes, major protests, climate change?) based on Twitter's data.

You never know unless you try.

TallGuyShort · on May 27, 2014

Are you familiar with Atrocity Watch? If not, you should look them up - that's exactly what they do.

coherentpony · on May 27, 2014

I am not. Checking them out now; looks pretty cool.

scalene · on May 27, 2014

Not to be skeptical, but I'm pretty sure one of these scientists may happen to work for the NSA.