Kaggle Datasets – Discover and analyze open data

benhamner · on March 26, 2018

Our goal with Kaggle Datasets is to provide the best place to publish, collaborate on, and consume public data.

As a data publisher, you have an easy way to publish data online, see how it's used, and interact with the users of the data. You can create the dataset via a simple web interface, and update it through the interface or an API. We automatically version these updates under the hood.

As a data consumer, you can browse the data online and download it (through the web or an API). You can see the code and insights others have generated on the data through Kaggle Kernels (hosted, versioned IPython notebooks that run in Docker containers). You can fork their code to get started on the data, or start coding from scratch on your own analysis. If you find improvements that could be made to the metadata (dataset/file/column-level descriptions), you can make those directly.

We're rapidly iterating on this product and expanding it's functionality, and would love any feedback and suggestions.

lgierth · on March 26, 2018

First of all, this looks like a great tool for datasets, thank you.

Do you have plans for adding file hashes to the datasets, e.g. sha256? This would make it a lot easier to integrate with other systems.

amrrs · on March 26, 2018

Sorry for a noob, could you please explain how adding hashes would help in better integration?

lgierth · on March 26, 2018

They mainly help in four ways:

- avoid data corruption when downloading/transferring/copying datasets

- notice changes/updates in the original dataset

- dataset versioning (think how e.g. git turns directories and files into hash trees -- also called content-addressing)

- most importantly: stable names without a naming authority

prepend · on March 26, 2018

How does this apply when you can filter / conditional exports? Is the idea that the csv has a fixed hash and if you trust that, you can trust anything else?

benhamner · on March 26, 2018

Thanks and great point! Added this to our list

QasimK · on March 26, 2018

How about you let me download them without creating an account before calling them “public”?

benhamner · on March 26, 2018

Thanks for the feedback. This is likely a "not quite yet" vs. "never".

Definitely understand the motivations from a user standpoint for not needing to login to download.

There's some non-obvious benefits we get as a small team by requiring login, in addition to new user growth. Bandwidth for hosting data can be large, and it's easier to reason about and prevent abuse in the context of authenticated users.

We do enable previewing the dataset while logged out, and the preview functionality will become more full-featured.

QasimK · on March 26, 2018

Thanks for your response. I spoke slightly evocatively because this has caused issues for me and hence I didn’t like to see it advertised as “public”.

The biggest issue is that I cannot share my work with others (easily) if I use a dataset from Kaggle. For example, ideally I just want to have a notebook online somewhere which can be instantly downloaded and run by anyone. Having to (automatically) download a dataset is a hinderance, and having to create a kaggle account first is an outright blocker.

On the other hand using, for example, IPFS or a torrent would be better, because you can reference the dataset using a global identifier and anyone can easily get access to it.

hyperocular · on March 26, 2018

A torrent option would go a long way to offsetting hosting costs on more popular items. Also I appreciate the preview option, it's good to see what is in a dataset before committing to downloading and extracting hundreds of gigabytes.

prepend · on March 26, 2018

Part of the best practices of public data involve unrestricted downlod - https://project-open-data.cio.gov/principles/

There are certainly benefits to you as a host if you restrict access to as a host. But if you have bandwidth concerns, perhaps just link to the source.

As it is now, I may be confused that you are using open data as a way to drive user growth.

shepardrtc · on March 26, 2018

I disagree with the parent. You've taken the time to organize and host these datasets. The least we can do is create an account to download them.

diggan · on March 26, 2018

Sure, but then maybe don't call it "public" as people will think it can be downloaded without creating an account. Calling something public but require account creation is misleading.

omg_ketchup · on March 26, 2018

I'm not sure I'd agree- the data is accessible to anyone who wants it for free. There's no restrictions on creating the account.

Radio is free, but you hear advertisements. Probably better to create an account than to have product placement in the actual data.

diggan · on March 26, 2018

The restriction I'm talking about is creating the account. "Public" (at least for me) does not mean that I need to agree to some lengthy "terms of service", "privacy policy" and create an account. Public means it's public and can be accessed from curl or my browser of choice without signing a contract.

Not sure were you are from, but where I'm from (Sweden), public radio (not free but public) and public TV is free of advertisement and does not require me to sign up for an account to be able to listen to/watch it. That's what I call public.

- https://www.kaggle.com/terms

- https://www.kaggle.com/about/privacy

shepardrtc · on March 26, 2018

When Kaggle is saying "public" dataset, they're implying the origin of the dataset is public. Meaning, the datasets were created by various groups/companies/institutions and made available for the general public. Kaggle is simply hosting them again. They're doing us a service by organizing them all into one location and eating the bandwidth costs. My argument is that in return for that service to us, the least we can do is create an account with them.

diggan · on March 26, 2018

I have no problem with them offering datasets to the public and just requiring to sign up for an account. But call then Kaggle-Public, Semi-Public or anything else, public data has a meaning that is not what they are doing.

For example, the government where I live (Catalunya) has public data. So I can go to the website and click download, no account required. If that data was distributed via Kaggle and requires account signup to get, I would not consider what they are providing public.

gertye · on March 26, 2018

AFAIK, Kaggle is a part of Google, therefore is not operated independently by a "small team". You just actively try not to be associated with the behemoth.

solarkraft · on March 26, 2018

Kaggle has been a part of Google for about a year. https://techcrunch.com/2017/03/07/google-is-acquiring-data-s... I do wonder why they're only giving it a "small team" when they could give them lots of resources and advanced abuse protection - though this could just be an allocation issue, it seems like a wasted PR opportunity.

mynewtb · on March 26, 2018

> new user growth

That sounds like you are inflating your user counts with garbage accounts.

has2k1 · on March 26, 2018

As a company that deals with data analysis, Kaggle can surely tell between different levels of user participation.

I interpret it charitably as "some fraction of users forced to register in order to download data go on to become active users of the platform".

prepend · on March 26, 2018

I’m sure they can. But they can also choose not to distinguish a high quality user from a low quality one when promoting their site.

For example, this post - http://blog.kaggle.com/2017/06/06/weve-passed-1-million-memb... - talks about a million users. But does not say how many are low quality, inactive ones. Or even describe how they filtered out low quality users from their analysis.

omg_ketchup · on March 26, 2018

Welcome to marketing?

random4369 · on March 26, 2018

> As a company that deals with data analysis, Kaggle can surely tell between different levels of user participation.

Kaggle outsources data analysis to bright students willing to work for a tiny fraction of their actual worth. Their business model has nothing to do with their own analytics talent.

has2k1 · on March 26, 2018

> Their business model has nothing to do with their own analytics talent

I recall — some years ago — when they were advertising for Jobs, some of the criteria for potential technical fitness was a high placement in some of the competitions. Plus, I have engaged in a few of the competitions and you have to have a team of capable data scientists to design them. They are not clueless people.

alxvio · on March 26, 2018

This is highly unreasonable. If I'm hosting large files, you bet I'm going to require the public to create an account so I can protect my quality of service for others. They're not restricting who can create accounts... they're not requiring a payment...

antirez · on March 26, 2018

This is gold. When I wrote the NeuralRedis module I had so much fun downloading a few random datasets from Kaggle and wrap it in a few lines of Ruby script to check what the results were in terms of predictions. Normally the data is very high quality, the format well documented, and so forth. However make sure to check the license for the details depending on what use you plan to do.

Radim · on March 26, 2018

What happens when the company changes direction? If there's a shift of priorities, an internal restructuring, a "strategic startup pivot", an acquisition?

Not to assume bad faith on Kaggle's part, but we got burned one too many times with private companies pushing their proprietary ("open") platforms for gobbling up data. The "it's free! just create an account — data lock-in — gap after project death/monetization" pattern leaves me a little cynical.

It's awesome that resources like these exist, but I'd be more comfortable paying attention if this was hosted as raw data somewhere (Github?), with a clear licensing and access model.

benhamner · on March 26, 2018

We joined Google via acquisition one year ago, and Kaggle Datasets has grown from 450 datasets to over 13,000 in that timeframe. We are firmly committed to supporting and growing this platform.

neuromantik8086 · on March 27, 2018

The Awesome Public Datasets Github repo [1] also constitutes a good effort at organizing all of the open data out there that people can play around with.

[1] https://github.com/awesomedata/awesome-public-datasets

metakermit · on March 26, 2018

Wonderful, thanks for sharing this! It's useful that the kernels people have submitted are there as well and that there is a HN-style upvoting mechanism.

As an aside – I'm really curious to explore the datasets with "fake" in the title :)

https://www.kaggle.com/datasets?sortBy=relevance&group=publi...

cosmic_ape · on March 26, 2018

It would help if the datasets were categorized by data type. Timeseries, multilabel, etc...

benhamner · on March 26, 2018

Not all the datasets are ML specific, but hopefully this helps:

- https://www.kaggle.com/tags

- https://www.kaggle.com/tags/linguistics

- https://www.kaggle.com/tags/multiclass-classification

- https://www.kaggle.com/tags/text-data

socksy · on March 26, 2018

Is there an announcement of some kind of change? Are they still owned by Google? Or is this the thing where sometimes existing solutions will hit the front page of HN? :)

benhamner · on March 26, 2018

I shared this because another public data portal that I don't think has changed in years ended up at the top of HN. Kaggle Datasets has grown by over an order of magnitude in the past year, and jumps in scale fundamentally changes the utility of community products like this

naushit · on March 26, 2018

Any plan to share same data/files using IPFS?