There are a few services popping up with aim to provide data repositories for an...

jamesblonde · on Jan 24, 2017

Good points. We are soon going to release a p2p system for sharing datasets, backed by Hadoop clusters. You install the Hadoop stack (localhost or distributed), then you can free-text search for datasets that have been made 'public' on any hadoop cluster that participates in the 'ecosystem'. We expect it to be self-policing, but there will be a way to report illegal distribution of datasets. The solution is based on a variant of Bittorrent where files are downloaded in-order (not randomly due to rarest piece selection in Bittorrent). Files can be downloaded to either HDFS or to a Kafka topic. We will demo it in 2 weeks here: https://fosdem.org/2017/schedule/event/democratizing_deep_le...

The system will be bootstrapped with lots of interesting big datasets: imagenet, 10m images, youtube 8m, reddit comments, hn comments, etc. Our experience is that we need a central point for researchers to get easy access to open datasets that doesn't require a AWS or GCE account.

Cacti · on Jan 24, 2017

Could you explain what problem you're trying to solve here? Are there really that many researchers who have access to modern (and expensive) GPU hardware that don't have bandwidth or disk space available? Or are there many researchers who are putting in lots of time assembling a dataset but don't have the bandwidth to distribute it?

jamesblonde · on Jan 24, 2017

It's more a case of providing a quick and easy way to share large datasets, backed by HDFS. So, researchers don't have a good way to share datasets (apart from AWS/GCE).

We work with climate science researchers who have multi-TB datasets, and they have no efficient way to share them. Same goes for genomics researchers who routinely pay lots of money for Aspera licenses just to download datasets faster than TCP allows. We are using a Ledbat protocol tuned to give good bandwidth over high latency links, but only scavange available b/w as it is lower priority than TCP.

For the machine learning researcher: i'd like to test this RNN on the reddit comments dataset....3 days later after finding a poor quality torrent...oh, now i can do it. On our system, search, find, click to download. We will move towards downloading (random) samples of very large datasets (even to Kafka from where they can be processed as they are downloaded).

amelius · on Jan 25, 2017

Sounds nice. Could you consider to make it more general than sharing datasets for ML? I mean, it sounds like a really generic solution that anyone could benefit from, not just researchers.

bussiere · on Jan 24, 2017

Will you make a ranking system for noting user and dataset ?

I recommand using true skill for noting user and dataset.

Keep up the good job.

jamesblonde · on Jan 24, 2017

It appears TrueSkill is patented - https://en.wikipedia.org/wiki/TrueSkill

justinclift · on Jan 25, 2017

Ugh, that's a fairly egregious patent. It's literally of a mathematical method. :(

_v7gu · on Jan 25, 2017

Just say yours does lower bound of Wilson score confidence interval for a Bernoulli parameter [0]. It has been used for sorting shopping items by ratings for years.

[0] http://www.evanmiller.org/how-not-to-sort-by-average-rating....

justinclift · on Jan 25, 2017

Thanks, didn't know about that. :)

Chris2048 · on Jan 25, 2017

Doesn't the language tie it necessarily to the application? i.e., player skill determination?

justinclift · on Jan 25, 2017

I'm not a lawyer, so no idea.

If a person figured out a way to apply it usefully to some other area - which doesn't seem hard at all - (job skill ranking? :D), MS is the kind of company that would attempt to collect $$$ from it regardless. :(

bussiere · on Jan 24, 2017

Or a ELO system

minimaxir · on Jan 24, 2017

That only addresses issue #3, which is much less important than issues #1 and #2. (and arguably worse: decentralization makes proper sourcing harder)

jamesblonde · on Jan 24, 2017

That's why we're working on a ranking/feedback system (a la piratebay) so that users can see other user comments on the datasets. Regarding decentralization - we are starting with centralized search, and download times are much better for popular datasets by downloading in parallel from many peers. We are also using a congestion control protocol (Ledbat) that is lower priority than TCP - you will not notice it using your b/w to share data if you are downloading/uploading using TCP.

jonloyens · on Jan 29, 2017

I’m one of the cofounders at data.world and you definitely make good points… We encourage all users to post a license with their datasets but, like open source, not everyone will maintain or honor these. This is definitely something that we feel we can help the open data community with and encourage even more. (As an aside, in general you should note is that a lot of data scraped from websites is something that you’ll need to be careful with).

When it comes to data quality, the world is a messy place and the data that comes from it is messy too. Most professional data scientists spend an inordinate amount of time cleaning datasets, doing feature engineering etc… it’s part of their job description. We’re trying to eliminate some of that repetitive work by making sure that people can comment on, contribute to and give some signal back on the quality of the dataset. We also think a dataset is more than just the data... on data.world you can upload code, Notebooks, images, etc... anything that helps add context to the data.

Finally, when it comes to size... ML definitely needs it. However, there's a lot of interesting data out there thats still very complicated, very useful but not that big. Most datasets in the world are well under the terabyte size (or even 100s of GBs in size). We're rapidly expanding the size of datasets we support because we want that stuff too but we really want to help people understand all the data in the world!

akshaynathr · on Jan 24, 2017

Thanks for the valuable feedback.

whyileft · on Jan 24, 2017

Is there actual case law that prohibits the use of copyrighted material for corpora and other training data?

Sure distribution can have issues, but do you have any references for simple possession as training and test data?

minimaxir · on Jan 24, 2017

If you're gathering data for your own business, it would be difficult for anyone to know, sure.

But if the data/analysis is published, then the data source would need to be disclosed.

See more on the OKCupid case I mentioned above: http://www.vox.com/platform/amp/2016/5/12/11666116/70000-okc...