Hacker News new | past | comments | ask | show | jobs | submit login

There are a few services popping up with aim to provide data repositories for analysis/ML (Kaggle, data.world, /r/datasets)

As someone who likes making analyses from random datasets, I have a few issues with these types of services:

1) There is often no indication of the distribution rights of the data, or whether the data was obtained ethically from the source (i.e. following the ToS). I made this mistake when I used an OKCupid dataset released on an Open Data Repository; turns out it was scraped with a logged-in account and the dataset was taken down by DMCA

2) There is no indication of the quality of the data, and as a result, it may take an absurd amount of time cleaning the data for accuracy. Some datasets may not be salvageable.

3) Bandwidth. Good datasets have lots of data for better models, which these sites may not be able to support. (BigQuery public datasets solve this problem however)




Good points. We are soon going to release a p2p system for sharing datasets, backed by Hadoop clusters. You install the Hadoop stack (localhost or distributed), then you can free-text search for datasets that have been made 'public' on any hadoop cluster that participates in the 'ecosystem'. We expect it to be self-policing, but there will be a way to report illegal distribution of datasets. The solution is based on a variant of Bittorrent where files are downloaded in-order (not randomly due to rarest piece selection in Bittorrent). Files can be downloaded to either HDFS or to a Kafka topic. We will demo it in 2 weeks here: https://fosdem.org/2017/schedule/event/democratizing_deep_le...

The system will be bootstrapped with lots of interesting big datasets: imagenet, 10m images, youtube 8m, reddit comments, hn comments, etc. Our experience is that we need a central point for researchers to get easy access to open datasets that doesn't require a AWS or GCE account.


Could you explain what problem you're trying to solve here? Are there really that many researchers who have access to modern (and expensive) GPU hardware that don't have bandwidth or disk space available? Or are there many researchers who are putting in lots of time assembling a dataset but don't have the bandwidth to distribute it?


It's more a case of providing a quick and easy way to share large datasets, backed by HDFS. So, researchers don't have a good way to share datasets (apart from AWS/GCE).

We work with climate science researchers who have multi-TB datasets, and they have no efficient way to share them. Same goes for genomics researchers who routinely pay lots of money for Aspera licenses just to download datasets faster than TCP allows. We are using a Ledbat protocol tuned to give good bandwidth over high latency links, but only scavange available b/w as it is lower priority than TCP.

For the machine learning researcher: i'd like to test this RNN on the reddit comments dataset....3 days later after finding a poor quality torrent...oh, now i can do it. On our system, search, find, click to download. We will move towards downloading (random) samples of very large datasets (even to Kafka from where they can be processed as they are downloaded).


Sounds nice. Could you consider to make it more general than sharing datasets for ML? I mean, it sounds like a really generic solution that anyone could benefit from, not just researchers.


Will you make a ranking system for noting user and dataset ?

I recommand using true skill for noting user and dataset.

Keep up the good job.


It appears TrueSkill is patented - https://en.wikipedia.org/wiki/TrueSkill


Ugh, that's a fairly egregious patent. It's literally of a mathematical method. :(


Just say yours does lower bound of Wilson score confidence interval for a Bernoulli parameter [0]. It has been used for sorting shopping items by ratings for years.

[0] http://www.evanmiller.org/how-not-to-sort-by-average-rating....


Thanks, didn't know about that. :)


Doesn't the language tie it necessarily to the application? i.e., player skill determination?


I'm not a lawyer, so no idea.

If a person figured out a way to apply it usefully to some other area - which doesn't seem hard at all - (job skill ranking? :D), MS is the kind of company that would attempt to collect $$$ from it regardless. :(


Or a ELO system


That only addresses issue #3, which is much less important than issues #1 and #2. (and arguably worse: decentralization makes proper sourcing harder)


That's why we're working on a ranking/feedback system (a la piratebay) so that users can see other user comments on the datasets. Regarding decentralization - we are starting with centralized search, and download times are much better for popular datasets by downloading in parallel from many peers. We are also using a congestion control protocol (Ledbat) that is lower priority than TCP - you will not notice it using your b/w to share data if you are downloading/uploading using TCP.


I’m one of the cofounders at data.world and you definitely make good points… We encourage all users to post a license with their datasets but, like open source, not everyone will maintain or honor these. This is definitely something that we feel we can help the open data community with and encourage even more. (As an aside, in general you should note is that a lot of data scraped from websites is something that you’ll need to be careful with).

When it comes to data quality, the world is a messy place and the data that comes from it is messy too. Most professional data scientists spend an inordinate amount of time cleaning datasets, doing feature engineering etc… it’s part of their job description. We’re trying to eliminate some of that repetitive work by making sure that people can comment on, contribute to and give some signal back on the quality of the dataset. We also think a dataset is more than just the data... on data.world you can upload code, Notebooks, images, etc... anything that helps add context to the data.

Finally, when it comes to size... ML definitely needs it. However, there's a lot of interesting data out there thats still very complicated, very useful but not that big. Most datasets in the world are well under the terabyte size (or even 100s of GBs in size). We're rapidly expanding the size of datasets we support because we want that stuff too but we really want to help people understand all the data in the world!


Thanks for the valuable feedback.


Is there actual case law that prohibits the use of copyrighted material for corpora and other training data?

Sure distribution can have issues, but do you have any references for simple possession as training and test data?


If you're gathering data for your own business, it would be difficult for anyone to know, sure.

But if the data/analysis is published, then the data source would need to be disclosed.

See more on the OKCupid case I mentioned above: http://www.vox.com/platform/amp/2016/5/12/11666116/70000-okc...




Join us for AI Startup School this June 16-17 in San Francisco!

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: