Hacker News new | past | comments | ask | show | jobs | submit login
Udacity open-sources additional driving data (techcrunch.com)
208 points by olivercameron on Oct 6, 2016 | hide | past | favorite | 29 comments



We have much, much more coming soon, in addition to LIDAR/radar data. You can join in with the fun here: https://udacity.com/self-driving-car

It's worth pointing out some other awesome datasets, such as:

http://research.comma.ai

https://devblogs.nvidia.com/parallelforall/deep-learning-sel...

http://data.selfracingcars.com

Any questions I can answer?


What is the core purpose of the new Self - Driving Nanodegree - to crowdsource knowledge or to graduate students funneling them into this sector? It seems like there are two uniquely different value propositions at play here and the advantage is entirely on Udacity's end.


This was my first thought as well. Does Udacity own or have rights to the code developed by students taking this course?


What's the datasets' license? Public Domain?



Might be a good idea to mention that in the readme.md next to the datasets since they're hosted separately. The wording of the MIT license doesn't really help avoid confusion here.

edit: or in the .tar with the dataset


Public domain isn't a license. It's legally problematic to release something as such.


Sincerely curious -- why is that? I was under the impression that public domain was well-established & well-understood, legally…?


In many countries you can not "release" works as public domain. E.g. here in Germany, a work can become public domain by expiring copyright, but creators have rights that can not be removed (not even voluntarily) before that.

(In such countries, a court might recognize the intent behind a claim of public domain from a country where it is possible, but that requires interpretation of the law of the country the creator is from. And it's likely not an option for locals, since the concept does not exist in local law)

Creative Commons created the CC0 license to work around this: it clearly lays out that a work is intended to be released as public domain, and failing that, all possible rights are granted and the creator doesn't intend to limit them in any way with his remaining ones.

https://creativecommons.org/publicdomain/zero/1.0/legalcode


That is exactly what I was curious to know -- thanks!


> creators have rights that can not be removed (not even voluntarily) before that.

Doesn't that defeat the purpose of licenses? The principle of a license is exactly giving up rights on copyrighted work.


The purpose of work contracts doesn't rule out any laws that prevent you from signing up as a slave...

(the comparison is appropriate because that non-revocable subset of rights in Germany's Urheberrecht - and other European countries and probably elsewhere - is based on a notion of human rights)


Hi Oliver! I am getting started with this data. Please share pointers\direction about how can I start consuming this data.

Update 1 - I see there are lot of details here - https://medium.com/udacity/challenge-2-using-deep-learning-t...


Is there a full curriculum posted somewhere



This is just a database of normal driving. That's useful for learning how to follow the car in front and avoid stationary objects, but not much else. It's going to result in systems that drive like humans right up to the point they do something really bad.

A more useful database would be the one Nexar is accumulating.[1] They collect dashcam imagery of events where the driver did a hard brake or the system detected some other hazardous condition. That database could be used to train a system which recognizes trouble before braking starts.

Both systems need a much wider field of view. Probably at least 160 degrees, so cross traffic shows up before the collision.

[1] http://spectrum.ieee.org/cars-that-think/transportation/sens...


Definitely good points. We have three cameras that are arranged colinearly along the whole width of the windshield, so this dataset has a pretty big effective field of view. And while it is currently limited in a lot of ways, it's just the beginning of the types of data we will be releasing. Everything will start scaling up to cover more use cases, as this data is mainly meant to support the training of a visual network for steering wheel predictions. For the moment, we actually do want to train the networks to drive like "normal" humans in normal situations. Thanks for your thoughts!


Well, they just started. With a stated mission to build an open self driving car I expect them to release more data as they collect it.


I'm here as a core contributor to the project and will be answering questions as well (in between making cable extensions for our lidar units).


A great, widely-used, dataset that teams can benchmark against is a superb start. Kudos to Udacity on this. I'd love to have a blind test set as well that teams can test and rank against.


There will be a blind test set for the challenge itself! Including a public leaderboard. We are asking the world to compete on building the best vision based network for predictive steering.


A friend of mine once commented that if you're not Google or Facebook, you don't have big data.

Is this adding a tier to that? A gigabyte a second, per car?

(I know, Google has self-driving cars, but forget about that for a moment)


The world is bigger than web companies.

For contrast, one experiment at CERN produces 40TB/sec of sensor data, before downsampling and filtering: https://en.wikipedia.org/wiki/Compact_Muon_Solenoid#Collecti...


Holy crap. And yes, it definitely is; if you haven't looked up Industry 4.0 or Industrial Internet, that entire sector is making a push to sensorize.

As a rearguard defense: Yes, but, what percent of the time is CERN running an experiment [that generates that data]?

According to a quick Google search, average time spent driving is 101 minutes / day.

Totally makes sense that CERN (and, likely, any large Science! efforts) produces that level of burst data, but wouldn't these cars produce more data over time?

As a different topic, a different friend of mine is of the opinion that AI is dependent on the throughput of data through the system; think about the amount of information your body feeds to your brain, and how much time it was doing that before you were capable of communication.


Similarly, I've worked on robotics platforms where we had to down sample incoming sensor data, otherwise the perception algorithms (which are very statistical) wouldn't be able to spin fast enough.


Genomics datasets can be several hundreds of petabytes. I've heard about an exabyte+ dataset too.


Could you explain what type of data it is? Is it camera, lidar, engine controls, input controls, or all combined?

What time length does the data cover? One hour of driving? Two?

Do you have open map data related to the data set?


Great questions. it is lidar data.


Thanks for clearing that up! :)




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: