Microsoft Research: 'Each year Microsoft Research hosts hundreds of influential speakers from around the world including leading scientists, renowned experts in technology, book authors, and leading academics and makes videos of these lectures freely available.'
It's a pity Microsoft doesn't expend energy in figuring out how to make opening an email attachment or clicking on a URL safe.
I think it's important to note that all of these methods have been largely superseded by deep learning techniques. For example, we can now directly learn algorithms such as gradient descent[1] and classical inverse problems like superresolution are solvable with deep networks [2]. While there still may be a role for tools like CVX, I anticipate all future progress will come from end-to-end differentiable systems.
Largely superseded? This is a sarcastic comment, right?
In case it isn't: the assertion that these are superseded is categorically false. For any reasonably computationally difficult problem, being able to capture the structure of a problem is hugely powerful, rather than blindly throwing deep learning algorithms at it. Just becuase you can doesn't mean you should.
For example, algorithms for convex problems in particular can be orders of magnitude more efficient than naive nonlinear approaches. Also consider the case where the problem has some (possibly sparse) structure, where custom solvers can render trivial otherwise computationally intractable problems.
Indeed. Few people talk about were the deep (or for that matter shallow) dirty laundry is. NNs require a fantastic amount of babysitting and trying out different configurations, the first time around for a specific dataset/task. Once done, you do get great results.
I read about application of neural networks to fluid dynamics. It ended up being faster than usual approaches. At least some matemathical solutions might eventually be superseeded by pretrained nn at least in some contexts.
The attitude that "deep learning solves everything and we shouldn't bother with other techniques" is primarily one of laziness. There are many types of problems out there that call for many different types of approaches, but it's easier to just declare your favorite is best than it is to continue one's education and development.
I can think of reams of problems that convex or heavily-priored approaches are typically used for that are yet not even possible to connect to the machine-learning structure, yet you would claim that somehow deep learning has superseded fields it's not even connected to? This is unbridled arrogance.
Agreed, though sometimes "the man with a hammer" has just run out of ideas and then DL gives you an inefficient and expensive half-solution (which is better than nothing). Same thing happened with genetic algorithms etc., nothing new under the sun.
Perhaps the above comment can provide an occasion for a useful interchange on the relation of DL to other approaches. I'll try:
Last year a student of one of the authors credited in the video performed an interesting data modeling/structure discovery analysis for me. The problem had ~100 variables and ~300 observations. We used a latent-variable + sparse-interaction model and solved the resulting minimization problem with a convex optimizer, as described in the video.
This approach was preferable to a DL technique for several reasons: we wanted to preserve some interpretability of the discovered latent variables (just 2 or 3 for our problem); we had reason to believe the problem had a sparse-linear structure because of conservation of mass; we didn't have much data relative to the number of variables.
I don't see DL approaches (despite their many successes) as applicable to this kind of problem.
The OP motivates his approach with recommender systems, where there are millions of outcome variables (customers), and thousands-to-millions of stimulus variables (products). That would result in a "lot" (considerable understatement) of connections for a DL model to learn.
Unless I'm missing something, the hyper-scale recommender setting is also not the best place for a conventional multi-layer DL model -- too many connections. On the other hand, the explicit sparsity control built in to the OP's optimization is really helpful.
Absolutely not. Certainly, if you believe this to be the case, you can go make a $1 billion selling this software to the oil industry since full wave inversion for seismic imaging problems, one example of an inverse problem, is still an open and very difficult problem.
ML and inverse problems solve two completely different problems. Really, ML should be called more accurately empirical modeling since we're creating a generic model from empirical data. In inverse problems, we choose an underlying model and then fit this model to our data. The difference between the two is that empirical models use generic models whereas inverse problems typically use models based on physical laws like continuum mechanics. Because of this, we often don't care about the end model in an inverse problem, but the variables that parameterize it because they have physical meaning. Generally, in ML, the parameters don't have physical meaning. The similarity between the two is that we're using an optimization engine to match some kind of model to our data.
To reiterate, in an inverse problem, the variables we solve for typically have a physical meaning. For example, in full wave inversion, we often model the problem using the elastic equations and we solve for the tensor that relates stress and strain. This is a 6x6 symmetric matrix (21 variables) at every point in a 3-D mesh. Side note, these meshes are large, so the resulting optimization problem has millions if not billions of variables. This matrix represents the material at that location. In this context, we don't need the end model with all these parameters plugged in. We're just going to look at the material directly because it tells us where stuff like where the oil is. In the context of the optimization, we will run the simulation, but, really, we don't need a simulator when we're done.
Now, in ML, imagine we did something simple like a multilayer perceptron. Yes, there are more complicated models, but it doesn't matter in this context. What is the physical meaning of the weight matrix and offsets? Saying it's the neurons in the brain is a lie. What if we're modeling acoustic data? Now, if we're just interested in creating a box that maps inputs to outputs, it doesn't matter. However, going back to the seismic world, mapping inputs to outputs just means mapping acoustic sources to travel times. No one cares about this. We want to know the rocks in the ground.
As such, ML and inverse problems use some similar machinery. Specifically, they both use optimization tools. However, they're used in very different places. Presentations like the one in the article are important because solving inverse problems at scale is really, really hard.
As an expert in both the umbrella of methods known as compressed sensing and more recently deep learning, my opinion is that these techniques are largely complementary. Deep learning is fantastic at for instance building appearance models and exploiting hierarchical nature of natural images, and hierarchical feature building in general where appropriate, while methods from compressed sensing remain as quite useful general methods via which one can efficiently extract low dimensional structure, from noisy and at times even highly corrupted datasets.
It is true that deep learning is now state of the art in certain tasks such as super-resolution for natural images, which were previously the domain of linear inverse problems, and this is due to its ability to learn useful natural image priors, something that isn't possible with simpler linear or slightly non-linear models. Meanwhile, compressed sensing style methods still excel in situations where the data does not benefit from hierarchical compression, labeled data is not available and/or the data is highly corrupted. Take for example the Netflix challenge problem, discussed in the video, for which deep learning is unlikely to offer substantial benefits, at least for the problem as stated (we just observe partial information about movie ratings). Where deep learning could potentially help in that situation is for instance grouping movies according to high level semantic information derived from text descriptions, other metadata and even the video content of the films themselves, which are still somewhat open problems and would not necessarily add value, depending on the validity of the low rank assumption of movie preferences.
More recently studied problems such as phase retrieval, which are in a sense the most elementary non-linear inverse problems, have now been understood, and in fact have informed understanding of how information propagates in deep neural networks (http://yann.lecun.com/exdb/publis/pdf/bruna-icml-14.pdf). More generally, the study of favorable outcomes in non-convex optimization, which is informed by recent developments in the umbrella field of compressed sensing, will help drive understanding of what makes training deep neural networks possible and thus to improve it, with the current empirical performance of deep learning being far ahead of any theoretical understanding.
Broadly speaking, as opposed to fighting about the relevance of one field or the other, we should strive to achieve better overall results by using both sets of techniques complementarily.
"Deep learning is fantastic at for instance building appearance models and exploiting hierarchical nature of natural images..."
Agreed. And contrast this success with the lack of success of first-principles latent-variable modeling for natural images. A lot of very good researchers spent decades building multi-layer probabilistic models for natural image structures - I'm thinking about the Grenander school, for example. The jury is still out (I think) on the ultimate value of that approach.
But for classification, it turns out to be much more tractable to use DL. You don't need all the semantic information the multi-layer model contains to tell a car from a truck.
As you say, it's better to view these approaches as complementary.
It's a pity Microsoft doesn't expend energy in figuring out how to make opening an email attachment or clicking on a URL safe.