I'll admit, I haven't applied for access through either one, but neither have I seen any papers cite access through those venues—and I read quite a few NLP + Twitter papers.
This article is just talking about Twitter Data Grants for which 6 universities were decided as winners [0]. You won't see papers through these grants as yet because well, the winners were announced about 40 days back!
In December, 2010, Twitter named a Colorado-based company, Gnip, as the delivery agent for moving data to the Library.
Shortly thereafter, the Library and Gnip began to agree on specifications and processes for the transfer of files - "current" tweets - on an ongoing basis.
In February 2011, transfer of "current" tweets was initiated and began with tweets from December 2010.
On February 28, 2012, the Library received the 2006-2010 archive through Gnip in three compressed files totaling 2.3 terabytes. When uncompressed the files total 20 terabytes. The files contained approximately 21 billion tweets, each with more than 50 accompanying metadata fields, such as place and description.
As of December 1, 2012,the Library has received more than 150 billion additional tweets and corresponding metadata, for a total including the 2006-2010 archive of approximately 170 billion tweets totaling 133.2 terabytes for two compressed copies."
I find the quantities hilarious.
But since they haven't been able to cope with providing access yet I get pessimistic about their prospects of doing so at all any time soon.
Can we do something to help them?
I've been thinking maybe GPU-accelerated databases like MapD, could mitigate the cost issue for them, but I'm pretty sure that doesn't go all the way to solving the problem...
Twitter had initiated granting of datasets some time back (now closed)[0] on the merits of a short proposal. The number of groups who eventually got access to the data were very few[1]. I hope in the future they increase the number of grants.
Yes, I am pretty sure this article is just rehashing the Twitter grants (I believe there were only 6 to 8 rewards), rather than announcing full open data to any researchers (thereby making the title misleading).
This is exciting to me; does anyone know how Twitter will go about this? Will there be a public dataset available for download? A research contract through the recently-acquired GNIP? Or just firehose access for future streams?
Most of that data is not the content of the tweet itself, but the metadata associated with it. When I last checked, we were storing about a kilobyte of data for every tweet.
Also, tweets are limited to 140 characters, not bytes - chinese tweets typically take about 200-250 bytes, for example.
I have access to the Twitter gardenhose (which is equal to slightly less than 10% of the full volume). These are the RX and TX statistics from the machine that I've been using to gather data for a few months now:
It works out at around 70GB/day, so I'd actually think that the full firehose would use considerably more data than 400GB/day (likely closer to 800GB).
Woah, 100 GB per second???!!
What exactly do they do?
Edit: Never mind, this is what they are doing
"We would like to develop some kind of ‘google' brain where we can zoom in and out, see it from different perspectives and understand how brain structure and function is related."
Pick a topic you are familiar with. Open up a twitter search for it. Wait a while. See the inevitable storm of tweet-spam that sort of looks like social sharing.
Thousands of fake accounts are tweeting out nonsense all the time. Another example, there are multiple accounts that tweet items from HN (and presumably lots of other rss feeds).
Doesn't Twitter make a fair bit of money from selling access to various slices of their data? I'd be surprised if they released it all to the general public. I imagine scientists would have to be under some sort of NDA.
The Twitter terms of service prohibit sharing the data in the Tweets. Researchers are allowed to share tweet IDs and User IDs which can be used to identify a Tweet. Currently, Twitter collections are shared using this method -- I recently released a dataset of 120 million Tweet IDs which cover a sample of a months worth of data, and numerous other researchers have used these IDs to crawl Twitter and obtain the same dataset as I used in my experiments.
True, but doesn't Twitter already provide an API for access to a fraction of the firehose? Surely that would be enough data. If Twitter doesn't have a good API, Reddit allows full access to all comments through their API (although Reddit has orders of magnitude less data).
Twitter's API is too limited for historical data. (you'll hit the rate limits quickly for any meaningful volume). Reddit's rate limits, however, let you process a million comments every day.
It appears as if the data is only available to those scientists who apply for the data grant and win it. Furthermore, applications for the grant have been closed since midway through March. Yea, I'm not surprised Twitter isn't making its historical data public. That would literally end Gnip, which is a revenue source for Twitter not based on advertising to users.
How about just loosening the API rate limits, or making a better token request process with resource allocation e.g. I'd like 1500 requests per 15 min window (as opposed to 15, for some things) for 72 hours. I guess this could be limited to those with a academic email address if they insist.
One thing I've wondered. Is it possible to follow "everyone" on Twitter? If-not, what type of cap does Twitter enforce on the number of accounts you're allowed to follow? I realize it'd be difficult to know which new accounts to add as people join, but how far could you push a roll-your-own stream of the Twitter firehose?
Hypothetically: I'm sure you could do it algorithmically; if your program sees a retweet from someone who is not on your following list, you then follow them. You might miss a few, but you would get most everyone.
Their geo-data is utter crap. The vast majority of it is based on 'profile location', which means that there are almost a million people tweeting from the exact center of Atlanta. It's a crowded spot, must be a Starbucks there or something.
That was my interpretation of "all", but reading back it seems there's another difference:
"Although a majority of tweets are public, if scientists want to freely search the lot, they do it through Twitter's application programming interface, which currently scours only 1 percent of the archive. But that is about to change: in February the company announced that it will make all its tweets, dating back to 2006, freely available to researchers."
There is no magic discovery about the nature of man hidden away in that data. Nothing you're average stand up comedian hasn't already written a bit about.
On a more serious tone, there is one area of research that Twitter data is very valuable for: how information and disinformation is created and spread during major news events: wars, catastrophes, uprisings, school shootings etc.
My gut feeling is that today a lot of quality journalism happens outside of the traditional journalistic organisations. The downside is that also a lot of wild speculation and rumours are spread, but it would be valuable to see how good this modern "crowd" journalism is. A skilled research group can use Twitter data and Internet Archive to track down the original sources of information pretty well.
And momentum stock trades. I've always felt a adept programmer who knew how to data mine and worked for Twitter
could make a fortune in the stock market?
That's a little cynical. I like to think we can predict big historical social events (regime changes, major protests, climate change?) based on Twitter's data.
* Library of Congress: http://blogs.loc.gov/loc/2013/01/update-on-the-twitter-archi...
* Twitter Data grants: https://blog.twitter.com/2014/introducing-twitter-data-grant...
I'll admit, I haven't applied for access through either one, but neither have I seen any papers cite access through those venues—and I read quite a few NLP + Twitter papers.