I always ask there if I can't find what I'm looking for.
Here are more and more data sets. These are general data sets. Email me if you have a specific data set in mind (e.g. web-as-corpus, spam, images, social, reviews, etc.). I have a big file of information.
Good instructions:
http://corpus.leeds.ac.uk/internet.html#description
http://sslmit.unibo.it/~baroni/bootcat.html
http://www.drni.de/wac-tk/index.php/Documentation
The wikipedia dump is great, but I've started using http://wiki.dbpedia.org/ which has an API to query the dumps.
Thanks for these, iisbum. I wish more public data was available in db, xml or similar structures - too often I find myself scraping government sites or pdfs to get the tables I need
I looked at the site, and I see some data but I didn't find what I would have hoped for. I couldn't find yield curves, and historical exchange rates <i> up to <i/> today (available on the ecb site in xml format). Certainly I would have thought yield curves were a front page item.
Things that would be very cool would be 1. financial statements in a database format. I know you can scrape this but I don't know if they are available legitimately?
2. Historial Implied volatilities and historical observed volatilities.
Their write that most the data is available for download. I can't find it anywhere though, only the various APIs. Have they remove the possibility of downloading the data?
For those interested in transit data, check out the GTFS Data Exchange, a directory of many agencies' scheduling and map data, following the Google Transit Feed Specification.
http://www.naturalearthdata.com/
From the website : Natural Earth is a public domain map dataset available at 1:10m, 1:50m, and 1:110 million scales as tightly integrated vector and raster data ...
Does anybody have precinct-level election results for the USA? A set for recent elections would be great for public access redistricting apps that will become relevant this year.
anyone know of a dataset that has dates for when companies when companies registered or announced in the news? For example I would like to see the data hackernews was launched.
i've had trouble finding geographical boundaries on neighborhoods in U.S. cities (e.g. downtown areas and residential neighborhoods). anyone know where i can find this?
It's not exactly neighborhoods, but the US Census TIGER database has block and blockgroup boundaries with associated demographic data. You could probably synthesize that into "neighborhood" definitions. http://www.census.gov/geo/www/tiger/tgrshp2010/tgrshp2010.ht...
Thank you all for posting links and links to links to datasets, I have an unrelenting interest in data aggregation and machine learning, and didn't even know where to start. So helpful, and I am no longer stuck. :)
do all of them have some uniformed api? that would be great, ideally. query and cache all of them on demand from your own app without additional programming.
http://datasets.reddit.com
http://opendata.reddit.com
and
http://www.quora.com/Where-can-I-get-large-datasets-open-to-...
for some good lists of available stuff.