How to build a robot that “sees” with $100 and TensorFlow

bernardopires · on Sept 22, 2016

Just a nit, but the author keeps talking about object recognition while what he was actually doing is image classification. Object recognition actually consists of two tasks, one is classifying the object (this is a beer bottle) and the other is also says where in the image the object is. Additionally it can/should detect multiple objects in the image. This is a more complex than classification, which only associates one category with the image.

bonoboTP · on Sept 22, 2016

I don't think there's a consistent terminology. In my computer vision class we called it "object recognition" when it was about recognizing one specific object (this particular car) and "object classification" when deciding the category of the object in the image (in general, like 'car', 'bottle').

One may also call the localization of the object as object detection and subsequent classification.

But I don't think it's too important how we call it as long as we understand what the task is.

dharma1 · on Sept 22, 2016

http://pjreddie.com/darknet/yolo/, https://github.com/daijifeng001/MNC, https://bitbucket.org/aquariusjay/deeplab-public-ver2 or similar should do the job. Choose depending on how fast it needs to be, and how accurate the segmentation boundaries need to be

oxymoron · on Sept 22, 2016

Here's another one: https://github.com/facebookresearch/deepmask

AndrewKemendo · on Sept 22, 2016

Actually the tensorflow implementation he uses does both segmentation and classification and returns a probabilistic graph of objects. For his application, it's only returning the top result, so it looks more basic than it is.

bernardopires · on Sept 22, 2016

No, it doesn't and there is no graph returned whatsoever. It's just a list of the top classification labels for the image (see example at the tutorial he cited https://github.com/tensorflow/tensorflow/tree/master/tensorf...). This is not the result of a segmentation but is rather a list of the top labels the model believes this could be. If you look at the top results you'll see they're usually similar/in the same family (again, refer to the example in the linked tutorial, the top 3 labels are: military uniform, suit, academic gown). This is literally the normalized output of the nodes of the last layer in the neural network (where each node corresponds to one category). If you added all probabilities together it'd sum to 1.

AndrewKemendo · on Sept 22, 2016

That's my point. With these OTS modules they are only returning on known classifiers.

The system has to segment before it classifies. That isn't returned to the user, but gradient descent is happening in the background. Like I said, it's a nitpick but important if you're trying to really build novel CV applications.

One of my gripes with people implementing pre-built modules from TF is that you don't really build any of the hard stuff, and it's pre-trained so not much learning is happening. You can't for example build RL systems with off the shelf TF implementations.

bernardopires · on Sept 22, 2016

Do you understand how convolutional neural networks work? There is no segmentation involved here at all. The input are the raw pixels of the image. The output is the probability this image belongs to one of the categories the network is capable of predicting.

Also gradient descent has nothing to do with segmentation at all, I don't understand what you're talking about. Gradient descent is used to find the set of weights that minimizes the error. This is standard in training neural networks of any kind using backpropagation.

rbanffy · on Sept 22, 2016

> recognizing arbitrary objects within a larger image has been the Holy Grail of artificial intelligence

The Holy Grail is general AI. Recognizing objects is a side quest, perhaps a required step, but, by no means, the end goal.

taneq · on Sept 22, 2016

Like so many other things in the field of AI, general object recognition was the "holy grail" because it was assumed that it required AGI. Now we've figured out a way to do general object recognition without AGI.

argonaut · on Sept 22, 2016

Is there a story somewhere of AI researchers concluding general object recognition was the holy grail of AI?

I get that a lot of people downplay achievements in machine learning by saying it's nothing like AGI, but it's almost a meme now that "once upon a time everyone thought that was the holy grail and they're moving the signposts" even when 1) nobody thought that, or 2) some people thought that and some people didn't think that.

bonoboTP · on Sept 22, 2016

In fact, the first thing you learn in an introductory computer vision class is that Marvin Minsky assigned "computer vision" to an undergrad as a summer project in 1966 (wire up a camera to a computer and write a program to understand images). It was the opposite of a holy grail.

Even today, it's hard to make laymen appreciate the advancements in computer vision because for them "seeing" doesn't seem like a difficult thing. Even stupid chickens can see. A chess program is much more impressive to laypeople.

MrQuincle · on Sept 23, 2016

I - for what it is worth - would still say general object recognition, with the emphasis on general, is indeed the holy grail.

The ability to recognize objects like people do is not properly represented by current benchmarks. I can imagine that you can built a perfect robotic "bird spotter" but if you put that in a self-driving car I would not be surprised if it stops for something that's just a shadow, or if you put it on a humanoid it's unable to distinguish its own hand from that of its clone. Imagine two of them cleaning out the dishwasher. :-)

A lot of AI is still working only in lab conditions or restricted application domains. That's why I consider robots and cars so important in driving AI towards the "general" dimension.

MrQuincle · on Sept 23, 2016

Now a nice article that addresses some limitations of vision: https://medium.com/@andrewt3000/tesla-mobileeye-and-deep-lea...

rbanffy · on Sept 22, 2016

What problems wouldn't be solved in this space by an AGI?

Houshalter · on Sept 22, 2016

Well I can't find any specific references, but I definitely recall getting that impression from old machine vision work. Decades of work to get models that were incredibly complex and hand crafted and barely worked. I don't know if I thought it would require AGI, but it would definitely require significant progress towards general AI.

Swizec · on Sept 22, 2016

Whenever we figure out how to do something, we stop calling it AI or AGI. If the trend continues, will we eventually have a general AI, but won't consider it anything special? Will it have been just a small incremental step by then?

bbctol · on Sept 22, 2016

I think deep, natural language processing will be unambiguous: if you create a machine that says "Yes, I am intelligent, thanks for asking" in a way indistinguishable from a human, it would be hard to disagree. On the other hand, it's entirely possible that that goal will take so much longer than others we'll have incredibly strong AIs affecting our lives before we notice.

extesy · on Sept 22, 2016

Yes, this is called AI effect: https://en.wikipedia.org/wiki/AI_effect

tiborsaas · on Sept 22, 2016

I think it will need some kind of breakthrough. Current advancements are probably incremental as you stated, but having an AGI might need some new theory we don't have currently.

xiphias · on Sept 22, 2016

Deep learning is the opposite of incremental. For a long time it was not clear whether/how we can learn multi layer networks efficiently. ImageNet changed everything.

bonoboTP · on Sept 22, 2016

Machine learning people basically agree that there weren't any big breakthroughs in deep learning. The success and the hype is mostly a combination of more computing power and more data. The algorithms (convolutional neural network etc.) were invented back in the 1980s and even earlier.

There have been some improvements but they are incremental indeed. More use of ReLU, dropout etc. But it's not a new paradigm at all.

ebalit · on Sept 22, 2016

They weren't any recent breakthroughts. But LSTM and ConvNet are breakthroughts. It just took a lot of times to prove it.

bonoboTP · on Sept 22, 2016

Convnets follow pretty naturally from multilayer perceptrons. Perhaps backpropagation was a breakthrough, enabling the training of ANNs on data, instead of hand-tuning.

But the idea of neural nets is very old, going back to Rosenblatt and connectionism.

bonoboTP · on Sept 22, 2016

I think a "holy grail" could be understanding complicated intentions and social reasoning. Like "he's only doing that so that it seems that he thinks that the other girl doesn't know that he could otherwise not do the etc. etc."

And general common sense reasoning.

cscurmudgeon · on Sept 22, 2016

Let me assure you: Only CV researchers thought so :)

I was at one time guilty of this. But hey, I was young and stupid :D

Florin_Andrei · on Sept 22, 2016

> Now we've figured out a way to do general object recognition without AGI.

I'm pretty sure we'll eventually learn to do anything without AGI, as a narrow task.

The trick with AGI is putting all those little things together. Perhaps that's the actual recipe for it, somehow. Turtles all the way down, who knows how many levels.

bottled_poe · on Sept 22, 2016

Are you sure? I'm convinced this time is different. Sure, it's been applied to vision first, but I believe the techniques can be applied to almost any sensory input.

tiborsaas · on Sept 22, 2016

Just the recognition part won't cut it. Look at AlphaGo. It too several different techniques combined to beat the best human Go player in the world. That's a step beyond object classification, but still not enough for AGI.

ryandamm · on Sept 22, 2016

I think we're going to get better vision before we get hard AI. Just my two cents.

(Then again, I'm biased.)

lukas · on Sept 22, 2016

Author here - I guess "the holy grail" is overstated, but recognizing arbitrary objects is a tough problem that people have put a ton of time and energy into. I think to really get it right in all cases you may need general AI.

icemelt8 · on Sept 22, 2016

This was amazing, I am amazed at your command of both hardware and software technology. Even as a Software Engineer, I have a hard time trying to make TensorFlow do something for me.

dharma1 · on Sept 22, 2016

To do the task in the article (classify images, pretrained model) - it's pretty easy - just follow the tutorial here:

https://www.tensorflow.org/versions/r0.10/tutorials/image_re...

sidarape · on Sept 22, 2016

I tried that (https://www.tensorflow.org/versions/r0.9/how_tos/image_retra...) which seems to work well.

dharma1 · on Sept 22, 2016

yep, that's good if you want to retrain a pre-trained model for specific categories on new image data (that are pretty close to imagenet type images)

sidarape · on Sept 22, 2016

What do you mean by "pretty close"? What would be "not close"?

dharma1 · on Sept 22, 2016

Something that looks very different to the ImageNet dataset. Microscope, ultrasound, satellite, x-ray etc images

sidarape · on Sept 23, 2016

Ok, I see. Thank you.

Omnipresent · on Sept 22, 2016

Is this available in keras as well?

dharma1 · on Sept 22, 2016

Classification with pre-trained weights? Yes.

https://keras.io/applications/

urvader · on Sept 22, 2016

I would like to know how long it "thinks"- it is clear the camera is paused for a while while the robot parses the image..

lukas · on Sept 22, 2016

It thinks for around 3 seconds. It doesn't use the PI's GPU - I bet that if it did it could get a lot faster. I bet you could monkey around with the compiler flags and speed it up. If anyone working on TensorFlow has some ideas, I'd love to hear them.

salex89 · on Sept 22, 2016

My biggest current question is which keyboard is this, on the image in the article?!

https://d3ansictanv2wj.cloudfront.net/Figure_2-985cd20ea0c0b...

selckin · on Sept 22, 2016

http://gadgetsin.com/flexible-bluetooth-mni-wireless-keyboar...

lukas · on Sept 22, 2016

This was the keyboard - https://www.amazon.com/gp/product/B0179N39KS/ref=oh_aui_sear... - it's incredibly light and cheap but one month later I'm not sure I'd recommend it. A bunch of the keys were randomly mapped incorrectly and it's started to flake out on me a little bit.

cezor · on Sept 22, 2016

Looks like Karnotech Foldable Silicone Keyboard

salex89 · on Sept 22, 2016

Looks like you are right. Based on the image I thought it was something mechanical. Looks nice, but I'm not fond of foldable keyboards :( .

visarga · on Sept 22, 2016

Great project. Locomotion and vision are pretty advanced compared to grasping and complex handling of objects. If we could have a workable arm, it would be much more interesting in applications.

Animats · on Sept 22, 2016

There are some affordable robot arms. uArm has a hobbyist-level arm. I have one, and I made a 6DOF force sensor out of a 3Dconnexion SpaceNavigator on the end and got it to talk to ROS. I want to hook a classifier system up to it so it can use tools like end wrenches and get them onto a bolt by feel.

Robot manipulation in unstructured situations still sucks, though. Willow Garage was making progress, but didn't last.

dharma1 · on Sept 22, 2016

Dis the author publish a repo for this? It's easy getting tensorflow going for basic image classification but the hard part is actually making the robot move in a way that makes sense - using the camera and the sonar data to make decisions and then drive the motors. Or is this not autonomous?

OilDerek · on Sept 22, 2016

"I then built a simple Python webserver to spin the wheels of the robot based on keyboard commands that made for a nifty remote control car."

So, not autonomous it would seem. With that and an arm, though, you could eventually get it to play fetch...

lukas · on Sept 22, 2016

I made an autonomous mode using the sonar sensors and a driving mode. I also made a public repo, but I didn't want to link to it, since the code is so awful (it was just a fun hobby project and I wasn't expecting people to look at it).

If you promise not to judge me, you can use anything at your own risk here: https://github.com/lukas/robot

dharma1 · on Sept 22, 2016

Ah, right. You could do this project with a cheap RC car and a phone running TF stuck on it

jd20 · on Sept 22, 2016

The mention of sonar sensors, used to prevent the car from running into obstacles, made me think the author was planning to make the robots autonomous too, but maybe just hasn't gotten that part worked out yet. I was also curious how the robot would know when to stop and try classifying something, without the operator explicitly telling it so.

ultrasounder · on Sept 22, 2016

Since when did TF started to run on stock phones?

dharma1 · on Sept 22, 2016

A few months ago.

https://github.com/tensorflow/tensorflow/tree/master/tensorf...

criddell · on Sept 22, 2016

This reminds me of a low res vision system I read about 20 years ago:

    http://www.seattlerobotics.org/encoder/jan97/lowresv.html

I've always been kind of intrigued by what is possible with very simple hardware.

gugagore · on Sept 22, 2016

Do you know what ever happened to the Encoder? I used to be so excited as a kid when a new issue came out.

criddell · on Sept 22, 2016

No I don't.

It was a fantastic resource and I bet it was a lot of work to put one together. I sure appreciated the people that took the time to make it.

I always wanted to find some way to get to one of their meetings but it never worked out.

nojvek · on Sept 23, 2016

Oh my god. You are trying to build the exact thing I am trying to build. Albeit you've made much more progress.

I'm still soldering wires into the motors. You should take off the paper from acrylic. The transparent effect makes it look awesome.

My goal is to make a raspberry pi bot that plays indoor fetch. I would love to have a chat with you.

forgotAgain · on Sept 22, 2016

Sorry for the off topic but is anyone else getting very high cpu usage from O'Reilly websites? Any known resolution or work around?

With Chrome developer tools I see one error: "Uncaught SecurityError: Failed to read the 'localStorage' property from 'Window': Access is denied for this document."