There are a few services popping up with aim to provide data repositories for analysis/ML (Kaggle, data.world, /r/datasets)
As someone who likes making analyses from random datasets, I have a few issues with these types of services:
1) There is often no indication of the distribution rights of the data, or whether the data was obtained ethically from the source (i.e. following the ToS). I made this mistake when I used an OKCupid dataset released on an Open Data Repository; turns out it was scraped with a logged-in account and the dataset was taken down by DMCA
2) There is no indication of the quality of the data, and as a result, it may take an absurd amount of time cleaning the data for accuracy. Some datasets may not be salvageable.
3) Bandwidth. Good datasets have lots of data for better models, which these sites may not be able to support. (BigQuery public datasets solve this problem however)
Good points.
We are soon going to release a p2p system for sharing datasets, backed by Hadoop clusters. You install the Hadoop stack (localhost or distributed), then you can free-text search for datasets that have been made 'public' on any hadoop cluster that participates in the 'ecosystem'. We expect it to be self-policing, but there will be a way to report illegal distribution of datasets.
The solution is based on a variant of Bittorrent where files are downloaded in-order (not randomly due to rarest piece selection in Bittorrent). Files can be downloaded to either HDFS or to a Kafka topic. We will demo it in 2 weeks here:
https://fosdem.org/2017/schedule/event/democratizing_deep_le...
The system will be bootstrapped with lots of interesting big datasets: imagenet, 10m images, youtube 8m, reddit comments, hn comments, etc. Our experience is that we need a central point for researchers to get easy access to open datasets that doesn't require a AWS or GCE account.
Could you explain what problem you're trying to solve here? Are there really that many researchers who have access to modern (and expensive) GPU hardware that don't have bandwidth or disk space available? Or are there many researchers who are putting in lots of time assembling a dataset but don't have the bandwidth to distribute it?
It's more a case of providing a quick and easy way to share large datasets, backed by HDFS. So, researchers don't have a good way to share datasets (apart from AWS/GCE).
We work with climate science researchers who have multi-TB datasets, and they have no efficient way to share them. Same goes for genomics researchers who routinely pay lots of money for Aspera licenses just to download datasets faster than TCP allows. We are using a Ledbat protocol tuned to give good bandwidth over high latency links, but only scavange available b/w as it is lower priority than TCP.
For the machine learning researcher:
i'd like to test this RNN on the reddit comments dataset....3 days later after finding a poor quality torrent...oh, now i can do it.
On our system, search, find, click to download. We will move towards downloading (random) samples of very large datasets (even to Kafka from where they can be processed as they are downloaded).
Sounds nice. Could you consider to make it more general than sharing datasets for ML? I mean, it sounds like a really generic solution that anyone could benefit from, not just researchers.
Just say yours does lower bound of Wilson score confidence interval for a Bernoulli parameter [0]. It has been used for sorting shopping items by ratings for years.
If a person figured out a way to apply it usefully to some other area - which doesn't seem hard at all - (job skill ranking? :D), MS is the kind of company that would attempt to collect $$$ from it regardless. :(
That's why we're working on a ranking/feedback system (a la piratebay) so that users can see other user comments on the datasets.
Regarding decentralization - we are starting with centralized search, and download times are much better for popular datasets by downloading in parallel from many peers. We are also using a congestion control protocol (Ledbat) that is lower priority than TCP - you will not notice it using your b/w to share data if you are downloading/uploading using TCP.
I’m one of the cofounders at data.world and you definitely make good points… We encourage all users to post a license with their datasets but, like open source, not everyone will maintain or honor these. This is definitely something that we feel we can help the open data community with and encourage even more. (As an aside, in general you should note is that a lot of data scraped from websites is something that you’ll need to be careful with).
When it comes to data quality, the world is a messy place and the data that comes from it is messy too. Most professional data scientists spend an inordinate amount of time cleaning datasets, doing feature engineering etc… it’s part of their job description. We’re trying to eliminate some of that repetitive work by making sure that people can comment on, contribute to and give some signal back on the quality of the dataset. We also think a dataset is more than just the data... on data.world you can upload code, Notebooks, images, etc... anything that helps add context to the data.
Finally, when it comes to size... ML definitely needs it. However, there's a lot of interesting data out there thats still very complicated, very useful but not that big. Most datasets in the world are well under the terabyte size (or even 100s of GBs in size). We're rapidly expanding the size of datasets we support because we want that stuff too but we really want to help people understand all the data in the world!
As someone who likes making analyses from random datasets, I have a few issues with these types of services:
1) There is often no indication of the distribution rights of the data, or whether the data was obtained ethically from the source (i.e. following the ToS). I made this mistake when I used an OKCupid dataset released on an Open Data Repository; turns out it was scraped with a logged-in account and the dataset was taken down by DMCA
2) There is no indication of the quality of the data, and as a result, it may take an absurd amount of time cleaning the data for accuracy. Some datasets may not be salvageable.
3) Bandwidth. Good datasets have lots of data for better models, which these sites may not be able to support. (BigQuery public datasets solve this problem however)