How the backpropagation algorithm works

dave_sullivan · on April 14, 2014

This book is really coming together. It's been a while since I've put together a (100% not comprehensive) list of good places to start if you're looking to learn more and/or use deep learning in your projects.

Open source

Pylearn2 (used to win kaggle galaxies competition): http://deeplearning.net/software/pylearn2/

Theano (symbolic math library used by Pylearn2): http://deeplearning.net/software/theano/

Deep learning tutorials with theano (build your own neural networks): http://www.deeplearning.net/tutorial/

Demos

Convnet JS: http://cs.stanford.edu/people/karpathy/convnetjs/

Sentiment Analysis: http://nlp.stanford.edu:8080/sentiment/rntnDemo.html

3d word cloud (webgl): http://wordcloud.ersatz1.com/

Commercial

Ersatz (I'm a co-founder, it's a PaaS providing neural network software with cloud GPU servers): http://www.ersatzlabs.com

Good Reading

Deep learning of representations: looking forward http://arxiv.org/pdf/1305.0445v2.pdf

Zero-Shot Learning Through Cross-Modal Transfer: http://arxiv.org/pdf/1301.3666v2.pdf <-- C'mon, that's pretty amazing...

Solution for the Galaxy Zoo challenge: http://benanne.github.io/2014/04/05/galaxy-zoo.html

Pylearn2 in practice: http://fastml.com/pylearn2-in-practice/

ma2rten · on April 15, 2014

I while ago I posted a comment here on HN which got number of upvotes but no answer. Could you maybe take stab at it?

I've followed the developments in Neural Networks somewhat, but have never applied deep learning so far. This is seems like a good place to ask a couple of question I've been having for a while.

1. When does it make sense to apply deep learning? Could it potentially be applied successful applied to any difficult problem given enough data? Could it also be good at the type of problems that Random Forest, Gradient Boosting Machines are traditionally good at versus the problems that SVMs are traditionally good at (Computer Vision, NLP)? [1]

2. How much data is enough?

3. What degree of tuning is required to make it work? Are we at the point yet where deep learning works more or less out the box?

4. Is it fair to say that dropout and maxout always work better in practice? [2]

5. What is the computational effort? How long e.g. does it take to classify an ImageNet image (on a CPU / GPU)? How long does it take train a model like that?

6. How on earth does this fit into memory? Say in ImageNet your have (256 pixels * 256 pixels) * (10,000 classes) * 4 bytes = 2.4 GB, for a NN without any hidden layers.

[1] I am overgeneralizing somewhat, I know. It's my way to avoid overfitting.

[2] My lunch today was free.

dave_sullivan · on April 15, 2014

Sure, I'll give it a shot -- feel free to email me if you have further questions, email is in my profile.

1. I think it makes sense to try them for any classification, regression, or feature extraction problem. They don't work all the time, sometimes you really don't need the extra depth--one hidden layer can be fine, and they can be pretty slow to train (even with GPU). I've also seen people try to build their own, implement it wrong, get bad results, then complain NNs don't work. So test for yourself, just make sure you're not doing it wrong.

2. It really depends. More is almost always better.

3. Training a bunch of models using Bayesian optimization to optimize the model hyperparameters (so you don't have to pick them) and putting the last few in an ensemble and averaging results is pretty close to out of the box. This is the workflow we use with ersatz.

4. Despite lunches not being free, you should probably use dropout. It's ridiculously good at preventing over fitting but can take longer to train (although there's been some work w/ "fast dropout" to speed it up)

5. GPU gets you ~40x speed up over CPU. So if you're using CPU and I'm using GPU, I can do in 1 day what would take you a month and a half. And then I might train for a week or more on GPU (I think the imagenet models were trained for a week or two, but not sure how many GPUs used). Otherwise, computational effort varies.

6. You use mini batches, so you load on as many samples as fit in GPU memory (with the model params) and then pull those into smaller batches. You rotate the "large batch" periodically. Neural networks can continue taking in new data and updating their model (online learning) and are particularly attractive for very large data sets.

General points: use GPU, don't build your own unless as an academic exercise, use dropout, test empirically on your own data. And check out Bayesian optimization of hyperparameters, I'm becoming more and more convinced it's better at picking them than human experts anyway.

ma2rten · on April 15, 2014

Thanks.

agibsonccc · on April 14, 2014

This is a great explanation of backpropagation. For those who just want the formulas, my personal favorite has been stanford's ufldl resource:

http://ufldl.stanford.edu/wiki/index.php/Backpropagation_Alg...

The general intuition behind backprop is that, taking prediction error in to account (think how many labels it got wrong) How far off were the predictions? Based on that go back and penalize the weights that caused the error by that much.

Multi layer perceptrons (as well as multi layer deep nets) have multiple layers whereupon you send the input through the network and make a guess.

Then you basically keep updating the weights (iteratively via gradient descent, conjugate gradient, LBFGS,...) till it doesn't change much. It does this by conducting a search navigating using the cost: or objective function. For more in depth, obviously the above book covers this quite in depth.

For those who want to just use deep learning, I will be giving talks at both OSCon and Hadoop Summit this year on distributed deep learning using 2 different frameworks I commit to [1] and [2]. Happy to answer questions!

[1]: http://deeplearning4j.org/ [2]: http://github.com/jpatanooga/Metronome/

avaku · on April 14, 2014

Can I come up with a bit of criticism? This book does provide a great description of the details of the algorithm inner workings (very cute demons too). However, after reading this chapter (sorry I haven't looked at the other ones), there is still a feel of a bit of mystery about why it works, and even more why it might not work. Possibly is is covered in other parts of the book, so I apologise if this criticism is not justified. I am personally a big fan of Christopher Bishop's book Pattern Recognition and Machine Learning, where backprop is described as an architecture for efficient computation of multiple stochastic gradient descents... I was involved with NNs before, but only after understanding where the algorithm for individual neurons comes from, I could properly appreciate the benefits of backprop (and understand the drawbacks).

oskarth · on April 14, 2014

This is chapter two of Michael Nielsen's book on Neural Networks and Deep Learning [1].

If you haven't heard about it before, I highly suggest you check it out.

http://neuralnetworksanddeeplearning.com/about.html

zackmorris · on April 15, 2014

I can't emphasize enough how great it is the he actually provides CODE right from chapter one that implements a network with an arbitrary number of neurons at each layer using NumPy. I've read (arguably) better explanations of how neural networks function, but the code was always either archaic or nonexistent.

To me, the ability to learn artificial intelligence concepts but also pass them on to others in a way that they can be tinkered with signals a tipping point in the field.

I would like to see all of the basic building blocks of AI (such as Bayes classifiers, genetic algorithms, etc) packaged up this way into an API that is as approachable as OpenGL. Then I would like to see multiprocessing libraries like OpenCL/CUDA incorporated internally so that training can happen in milliseconds instead of minutes or hours. With enough eyeballs looking at these things, we might be able to get from heuristics and rules of thumb for training values to something more concrete. It seems like every time I learn a new paradigm it devolves into wishy-washiness because there are just not enough years in a researcher's life to discover the subtle rules at work behind the scenes. The progression seems to always be the same: failure to achieve success over 50%, then reaching 95% after some hours/days of tinkering, then finding the model hits a maximum of 99% and another model must be learned. Rinse, repeat. If that changes, and we’re able to link up various models without having to choose arbitrary constants, machine learning will have arrived IMHO. We could throw hardware at it and let an array of agents evolve in parallel without human intervention until we see which arrangements work best. Eventually that could lead to a theory of mind that actually works because it could learn anything a human could learn, for the most part unsupervised.

I’ve learned just enough about this stuff to wonder about the endgame. I have a hunch that it will involve something akin to version control, so that an AI can try different approaches until it finds a solution, but with the ability to roll back in case it goes off the rails. Does anyone have a starting point for things like imagination in AI, or trial runs that happen in simulation before the AI acts in real life? And maybe how to merge new solutions into existing ones?

michael_nielsen · on April 15, 2014

Author here. I'd love if you could give me pointers to your favourite explanations of how neural networks function (even better, if you can say what you particularly liked about them).

(I really enjoyed reading the remainder of your comment.)

zackmorris · on April 15, 2014

Wow cool, sorry I didn't mean to sound critical, I should have used (perhaps) instead of (arguably). I had to search my bookmarks but honestly can't remember if one of these was the explanation I remember (it could have been in a book when Borders bookstore was still around):

http://natureofcode.com/book/chapter-10-neural-networks/

http://zerkpage.tripod.com/index.htm

http://www.ai-junkie.com/ann/evolved/nnt1.html

Then I used to skim this compilation before it went away:

https://web.archive.org/web/20030601074826/http://www.emsl.p...

Mostly what I remember about the explanation that stood out was that it was concise. So the way it described backpropagation just "clicked" in a way that the previous articles that used tons of summation/matrix math/probability and calculus had not. I’ve read some positively atrocious papers where it was like the authors were showing their notation prowess rather than conveying information.

Your chapters have a good balance, although for someone who’s been out of college for 15 years like me, it might be nice to have a tiny refresher about derivatives before you dive in, especially over several variables/dimensions. I mean like a paragraph, to explain the notation, because the knowledge is in my head but foggy. Everything else was very good, for example when you explained how bias comes from the threshold factor on the other side but makes notation easier. I don’t remember seeing it explained quite that way before. Oh and the sigmoid function always seemed arbitrary too (and frankly turned me off from neural nets because it seemed too analog), but explaining how it simplifies derivatives makes perfect sense now.

Unfortunately with other books, I was also never able to find off-the-shelf code that was approachable or up to date, so I ended up forgetting everything I had learned and had to start over each time until I got tired of trying. I very much like your 72 line code example, where you provide a backpropagation function without explaining it yet. That’s ok, as programmers we encounter that all of the time, and it’s actually kind of fun to decipher the algorithm before reading how it works later. I believe it even allows for multiple hidden layers which is just fantastic.

As for the other stuff, well, I just look at it like this:

Ask a developer to write an iOS app that say, interacts with the twitter or facebook API and presents the user’s N closest friends in a list view and then submit that to the app store, and they can do it from start to finish with no mystery (other than a little help from stack overflow). Each of those steps replies on some pretty heavy lifting like REST APIs, possibly SQL, message passing in objective-c, even a rudimentary understanding of security and encryption for networking and app submission, yet those things have become mainstream.

But ask a developer to do even the simplest machine learning task, like a little data mining or spam filtering, heaven forbid weak AI like image recognition, and it’s a whole different can of worms. Why hasn’t anyone standardized this stuff in the languages that developers use every day, as opposed to MATLAB or prolog or whatnot? Why can’t I literally read a text file and pipe it through a shell script that does a Bayes classifier? Why doesn’t iOS have an AI.framework to go along with its Social.framework? To me these aren’t all that much to ask.

I really feel that we’re still working at the assembly language or hand rolled affine texture mapping level from the 80s and 90s with respect to AI today. We don’t have something like the web metaphor yet to catapult it into the mainstream. That said, I’m extremely optimistic that concurrent (or at least parallelized) languages like Octave/NumPy/Go, and the more approachable functional languages like Haskell/Scala/Clojure might help deliver APIs that we can interact with from more mainstream languages. To me, most weak AI problems have been solved now, and it’s time that programmers have an intuitive understanding of how the algorithms work so that they can combine them in novel ways and get to strong AI in, well, my lifetime hah. Thanks for your efforts and keep it up!

michael_nielsen · on April 17, 2014

Belated thanks, that's extremely helpful. I'll take a look at those links which are unfamiliar to me (I've already read the Nature of Code link, it's a great book.) Tidbits:

+ Yes, the code presented "works" for multiple hidden layers. But it converges too slowly to be useful, except for very small networks. Later chapters introduce new ideas which help backprop converge faster, and which start to make multiple hidden layers quite feasible.

+ On the broader machine learning front, I think scikit learn is a pretty nice general library: it's approachable, easy to use, and has reasonable docs. It'll be interesting to see how it develops.

+ For neural nets in particular, elsewhere in this thread Dave Sullivan has done a nice job distilling out a list of good libraries. I think it'll be a while before there's a real out-of-the-box solution for neural nets, though, since setting parameters for neural nets is something of an art.

cjauvin · on April 14, 2014

This seems like it will be a very interesting book. If anyone is interested, I have written a short and compressed intro to NNs, using very simple Python code:

http://cjauvin.blogspot.ca/2013/10/neural-network-101.html

jey · on April 15, 2014

What are some problems where Deep Learning shines? Does it outperform other algorithms on those problems? Is there an understanding of why?

Context: I have some background in statistical sorts of machine learning algorithms and am genuinely puzzled by this "deep learning" phenomenon and why it's catching on.

ma2rten · on April 15, 2014

Deep learning can expose hidden non-linear relationships in the data. It's state-of-the-art in applications such as object recognition from images and voice recognition. There have also been promising results in the natural language processing field. What all of these have in common is that they are "real" artificial intelligence, i.e. teaching computers things that they have been historically bad at compared to humans.

Search for a talk by Geoffrey Hinton if you are genuinely interested.

Houshalter · on April 15, 2014

Deep learning is the state of the art on image and sound data. It's taking over natural language processing as well.

Neural networks work well because they can learn complicated functions from the data. With image, sound and maybe text, the data has structure that can be exploited by NNs. Convolutional neural networks take advantage of this.

So instead of learning "pixel 4,773 correlates with the output", we can look at a smaller number of image patches and learn "features" from each patch that correlate with the output. We can go further and create multiple layers of these. Each small image patch extracts some features which are used by the layer above it, and so on to the final output.

turbolent · on April 15, 2014

There's a really nice explanation at http://galaxy.agh.edu.pl/~vlsi/AI/backp_t_en/backprop.html

plg · on April 14, 2014

Nice description of backprop for sure.

Aren't people using conjugate gradient descent to optimize NN weights now? Sure you need the partial derivatives but ... that's what GPUs are for, right? :)

nimish · on April 14, 2014

Backprop lets you calculate the gradient efficiently. What you do with that gradient is up to you (I would try L-BFGS or something akin to a stochastic variation). So you could use conjugate gradient or some other optimization method

j2kun · on April 15, 2014

Can someone explain to me why NN's are always in layers?

_ikke_ · on April 15, 2014

From the first chapter of this book, it is described that each layer adds more abstraction. The first layer (input) works on individual pixels, the next layer on a part of the image. Each deeper layer can make more high-level decissions.

j2kun · on April 15, 2014

I understand why more layers adds complexity, but that doesn't answer my question: why layers instead of an arbitrary acyclic graph?

jbooth · on April 15, 2014

I'm not an expert, but my guess is because it makes it really easy to materialize the weights between layers as a 2d matrix and we have some really good library code out there for dealing with matrices.

There's probably room for experimenting with more novel constructions and even some papers out there looking into the matter.

j2kun · on April 15, 2014

Efficiency is the expected answer. I'm just wondering if there's a more theoretical reason, such as "every function that can be computed by a non-layered acyclic network can be computed by a complete layered network using only a small number of extra nodes/layers."

jbooth · on April 15, 2014

I think that it can. With some weights of 0 and some weights of 1, you can trivially map 'jumps' that skip from a node in one layer to a node a couple layers distant, by means of some incorporate-no-other-inputs intermediate nodes, right? Sigmoid function on 1 is still 1? Once you have those, it's just a matter of how many layers you need for any acyclic structure, I think.

Although if you wanted to come up with difficult scenarios, it's not hard to think of structures that would make some of those middle layers really tall, or add a lot of middle layers.

j2kun · on April 16, 2014

As I mentioned in another branch of this thread, selectively choosing edges between nodes isn't an option, because in the standard model you have complete incidence between nodes in adjacent layers.

walrus · on April 15, 2014

Feedforward neural networks are acyclic; recurrent neural networks are cyclic. Recurrent NNs are harder to train in general.

j2kun · on April 15, 2014

Layered is much more specific than acyclic. I can come up with an example that is not layered by still acyclic. Just connect nodes across two or more layers.

walrus · on April 15, 2014

Functionally, layered and acyclic NNs are the same thing. An arbitrary acyclic NN acts the same as a layered NN with some of the weights fixed at 1 or 0. (Replace any edge crossing n layers in a non-layered acyclic NN with a chain of n nodes to get a layered NN that responds to input the same as the original non-layered NN.)

I suppose there may be some cases where the extra speed you get by omitting the intermediate nodes pays off. However, I can't imagine a situation where you'd know enough about the problem in advance that you could design the NN's graph in that level of detail.

j2kun · on April 15, 2014

Usually the adjacency between layers is complete, so your modification isn't quite without loss of generality.

I also cannot imagine a situation in which you'd know that much detail, but using a general network would allow one to, for example, dynamically modify the topology of the graph (as real neural networks do regularly).

EDIT: I guess what I'm asking for is a rigorous proof that the two models are equivalent with as little overhead as you say there should be.

argc · on April 15, 2014

Or "Why I wish I was better at math."