Bayesian Optimization for Collaborative Filtering with MLlib

minimaxir · on Aug 9, 2016

Wait, Spark has built-in Model Hyperparameter selection (http://spark.apache.org/docs/latest/ml-tuning.html), that was not mentioned in the article. What advantages does your service do?

Relatedly, why are you advocating using MLLib/RDDs when they have been deprecated in favor of ML/DataFrames (http://spark.apache.org/docs/latest/ml-guide.html)?

nl · on Aug 10, 2016

Relatedly, why are you advocating using MLLib/RDDs when they have been deprecated in favor of ML/DataFrames (http://spark.apache.org/docs/latest/ml-guide.html)?

Note that the DataFrame version of MLLib (which is still called MLLib btw, even if the package is different) isn't feature complete compared to the RDD version.

For example, the DataFrame version of LogisticRegression is binary only, while the RDD one allows multiple classes. I guess the suggested work around is to use the OneVsRest classifier (in Spark 2.0), but that blows the memory out on fairly small tasks.

I like Spark a lot, but MLLib isn't as strong as it appears at first glance.

Zephyr314 · on Aug 9, 2016

Great questions!

The Spark hyperparameter selection uses an exhaustive grid search approach, which can take a long time (complexity grows exponentially with number of parameters to tune) and produce poor results when compared to other methods [1]. Bayesian optimization is a great way to tune time consuming and expensive functions like ML pipelines, where finding a good configuration in a small number of total attempts is the only tractable way to tune the system.

ALS implementation in MLLib still requires ratings in RDD and hasn't moved over to ML/DataFrames yet.

Edit: Looks like the original comment was edited, but this post does in fact use the built in MLlib ALS implementation.

[1]: http://arxiv.org/abs/1603.09441

minimaxir · on Aug 9, 2016

> ALS implementation in MLLib still requires ratings in RDD and hasn't moved over to ML/DataFrames yet.

ALS has been available in ML/DataFrames since 1.3.0, per the documentation. (http://spark.apache.org/docs/latest/api/scala/index.html#org...)

> Edit: Looks like the original comment was edited, but this post does in fact use the built in MLlib cross validation methods.

I edited the comment to correct my error that ALS was not mentioned as being native to Spark. However, for Hyperparameter cross validation, I looked at the code in the repository (https://github.com/sigopt/sigopt-examples/blob/master/spark/...), and while the Spark CrossValidation function is imported at the beginning of the Scala file (from ML, not MLlib), it is not used in the code, in favor of SigOpt.

I'm confused.

idewanck · on Aug 9, 2016

Thanks for pointing out the switch away from RDD, looking at the master branch for MLlib in github it looks like ALS is still using RDDs? https://github.com/apache/spark/blob/master/mllib/src/main/s...

Admittedly I'm not a spark expert though, thanks for bringing this up!

minimaxir · on Aug 9, 2016

MLlib is for operating in RDDs, ML is for operating in DataFrames.

RDDs themselves, however, are obsolete to DataFrames as DataFrames are faster and cover most of the functionality of RDDs for common use cases.

kod · on Aug 10, 2016

RDDs are not obsolete. The reason dataframes are faster is exactly because they are more restrictive than RDDs. If you need to run arbitrary code, RDDs are stil more flexible.

dxbydt · on Aug 10, 2016

Since every dataframe has a lazy instance of rdd, and several methods on dataframes simply call the corresponding method on rdd ( eg. foreach ), I am not sure about the faster bit of your assertion.

nl · on Aug 10, 2016

They are faster in the sense that many thing you previously had to do in a RDD lambda are now Dataframe operations (which are optimized by the Catalyst compiler).

So if you want to do one of the operations in the sql.functions package[1] then Dataframes (and Datasets) are very valuable.

If not, then they won't give you much benefit. However, you will get a little improvement because the Tachyon out-of-JVM-memory framework[2] which I don't think RDD version has access to.

[1] http://spark.apache.org/docs/2.0.0/api/python/pyspark.sql.ht...

[2] https://dzone.com/articles/Accelerate-In-Memory-Processing-w...

kod · on Aug 10, 2016

They absolutely are faster, because there are optimizations available on data frames that are impossible on RDDs.

Pushdown is the most obvious one. If I don't know what data store is underlying your RDD, I don't know your schema, and I don't know what column you're projecting, pushdown is impossible. I can't know that with an RDD, because all I know when you call map is that you're converting from type A to type B.

Dataframes make that class of optimization possible, because they have more information (your schema, the underlying store), and have more limited operations (select a column, not an arbitrary map operation).

Zephyr314 · on Aug 9, 2016

Thanks for the heads up! We'll make sure to use DataFrames in our next Spark example. As mentioned before, we thought this would be a cool application of Bayesian optimization (especially since the Spark docs advocate the exponentially slower grid search approach as you pointed out), but we are not Spark experts. We appreciate the feedback!

lern_too_spel · on Aug 9, 2016

Under the hood, all the existing (as of 2.0) ML pipelines models immediately convert the Dataframe to an RDD and cache it, so there is a one iteration slowdown for using the Dataframe models versus an equivalently implemented RDD model. None of them try to get a speedup by directly operating on the Tungsten datastructure, but it is an obvious next step.

apathy · on Aug 10, 2016

I could give a shit about the hyperparameter tuning (CV... it Works For Me) but your writeup of Gaussian processes and why they are called kriging in spatial stats is awesome.

http://blog.sigopt.com/post/130275376068/sigopt-fundamentals...

mccourt · on Aug 10, 2016

Dear apathy (love the name), I wrote that post and am really glad that you liked it. I had another more recent one that focused on a different topic but had a solid paragraph right at the beginning that also stepped through some of the history. It might be useful because it had a bunch of links trying to join content between GP stuff, kriging and RKHS. It is linked below. Have a great day.

http://blog.sigopt.com/post/147952139093/sigopt-fundamentals...

apathy · on Aug 11, 2016

Great post. Why is the kernel trick so important for Gaussian processes? My understanding is that operating in an RKHS is primarily needed to enable that cute little maneuver.

Wait, never mind. If GPR is like an infinite-dimensional linear regression then doing it within an RKHS means you don't actually have to bother with the functional generalization and can get solutions to a potentially ill-behaved loss function along a grid/cube/hypercube/whatever. Is this part of why SigOpt works more efficiently than classical parameter space sampling designs?

Not-so-ninja edit: Time for me to re-read Rasmussen, I think.

mccourt · on Aug 11, 2016

There's a couple different points there, so lemme see if I can answer each of them.

First off, the use of the term "kernel trick" appears, I think, primarily within the machine learning community. It refers to the idea that some other (likely more useful) representation of the data of interest exists, but that the representation might be in a much larger, or even infinite-dimensional, space. Fortunately, in the context of certain algorithms such as support vector machines, that representation never appears by itself ... it only appears when inner-producted (not a word) with another such representation associated with some other piece of data. For certain representations, that inner product can be represented by a reproducing (also called positive definite) kernel, and thus can be computed without ever forming the larger representation. This concept appears in Wikipedia https://en.wikipedia.org/wiki/Reproducing_kernel_Hilbert_spa..., both for the Hilbert space inner product and for the Mercer's series representation. Unfortunately, as is the case with a lot of higher math on Wikipedia, it's more useful as a reference for an expert than to get you rolling on a new topic.

As far as kernel methods in general, those pop up all over the place, including in the context of Gaussian processes. Gaussian process regression (GPR), also called kernel-based approximation in the numerical analysis community, is one great example of that. I generally try to think of it not as an infinite dimensional linear regression method, but rather as a constrained optimization problem. There are infinitely many functions that pass through a given set of data, thus asking for the "one" is ill-posed ... I am interested in the most well-behaved one. We define the behavior of a function as its RKHS-norm, and the function that minimizes that norm, but still respects the observed data, is the solution to the GPR problem.

Regarding why SigOpt performs better than a good old-fashioned grid, that can be for multiple reasons. Probably the most important one, at least in my mind, is that SigOpt is not trying to create some perfect model of the function at hand - SigOpt is only interested in optimizing that function. That drives every decision we make, and actually it is something I often need to remind myself of because I grew up in approximation theory. A basic design of experiments is interested in understanding how the function works everywhere, but we can save some expense by more swiftly discarding regions unlikely to contain the optimum.

As far as why RKHS in particular work very well - that deals with the optimality theorems underlying approximation using reproducing kernels. Assuming you have a reasonable function on a reasonable domain it probably belongs to a RKHS - such functions can be represented very effectively by an appropriate kernel. Now, determining that kernel can be a complicated task (maximum likelihood estimation or cross-validation are common tools) but if you have an acceptable kernel you can make strong statements about the quality of the model. Because the quality of the model is constantly improving, the GP-backed optimization tool is constantly providing a better representation of the true function, and thus pointing towards the optimum more quickly.

There's another more fundamental reason why searching on a grid in higher dimensions is trouble, and it deals with the fact that the ratio of the volume of a sphere and of a cube with the same radius decreases as the dimension increases. This means that there is increasingly much volume away from the center of a box in increasingly many dimensions. Using a grid to try and fill that space becomes unacceptably costly. Of course, Gaussian processes have their own issues for larger problems (>30 dimensions, maybe) but SigOpt has more than just Gaussian processes behind the scenes. Also, there have been improvements over the years in GP performance in higher dimensions (for example, http://epubs.siam.org/doi/abs/10.1137/10080138X).

There's some pretty solid notes for graduate students on this topic at http://math.iit.edu/~fass/590/. That was the class I took to first learn this stuff as a student and I've tried to contribute back to them over the years. It comes at the topic first from the math side, not the stats side, but there's some stats stuff in there as well. Chapter 8 of those slides contains that theorem regarding the minimum-norm interpolant.

Hope that helps.

apathy · on Aug 11, 2016

Reading the slides now, my goodness a lot of things just became clear to me. What a wonderful collection of pearls. Thank you for "giving back" to the topic in this way.

apathy · on Aug 11, 2016

This is fantastic. You should probably set the no procrastination feature on Hacker News or it'll suck away your writing time like it did mine.

Thanks! Also go Big Red. ;-)

a1k0n · on Aug 9, 2016

So SigOpt was tuning rank (number of latent factors), number of iterations to run the algorithm (in my experience alternating least squares generally converges within 10-20 iterations, but there'd be no downside to running it longer unless it's overfitting), and the regularization strength.

What optimal parameters did it find for these?

Zephyr314 · on Aug 9, 2016

Great question. As you point out, all of these parameters (rank, num iterations and the actual reg. term) can in some way contribute to the regularization of the reconstruction of the ratings matrix. We thought it would be interesting to include all of them and here are the optimal parameters SigOpt found for this experiment:

rank = 36

numIter = 30

log_lambda = -2.90405347693

Zephyr314 · on Aug 9, 2016

I'm one of the co-founders of SigOpt (YC W15) and am happy to answer any questions about this post (or anything about SigOpt).

More info on the methods behind SigOpt can be found at https://sigopt.com/research.

apathy · on Aug 10, 2016

Oh, also, for students: https://sigopt.com/edu

I'm worried this is going to be like good Scotch for me.

blahi · on Aug 10, 2016

There is package, mlrMBO, created by the great guys who created mlr (absolutely awesome for building pipelines, you will ditch caret in a second!). Not on Spark obviously, but thought some might find it useful.

https://github.com/mlr-org/mlrMBO

tachim · on Aug 9, 2016

How does SigOpt compare to GPs?

Zephyr314 · on Aug 9, 2016

Great question. SigOpt is a hosted, scalable ensemble of different Bayesian optimization methods. Many of these methods use Gaussian Processes (GPs) as part of the modeling aspect of their Sequential Model Based Optimization (SMBO). We've written up a primer on Bayesian optimization [1] that goes over some of the different methods, with lots of citations for diving deeper.

We've found that SigOpt compares very favorably [2] to other Bayesian optimization approaches. In addition to this, our hosted platform allows people to harness the full power of GP backed Bayesian optimization with just a few lines of code [3] instead of the sometimes heavy administration required by other methods.

[1]: https://sigopt.ninja/1470077644/pdf/SigOpt_Bayesian_Optimiza...

[2]: http://arxiv.org/abs/1603.09441

[3]: https://sigopt.com/docs

tachim · on Aug 10, 2016

Thanks. Do you guys do any meta-learning based on information from optimization runs on the many instances that SigOpt runs on? E.g. if Alice uses Sigopt for a neural net architecture and Bob does too, is the information from Alice's run used to improve Bob's?

Zephyr314 · on Aug 10, 2016

We don't explicitly share information at the experiment level like that for privacy reasons. All user data is explicitly isolated from other users, on top of the fact that due to the black box nature of our optimizer we often do not know what the underlying method or system being optimized actually is.

We do, however, run a rigorous evaluation framework [1] over our methods as we iteratively improve (and compare to other techniques). This allows us to build up our ensemble of optimization strategies to most efficiently tackle problems that are most important to our users. As we see users leveraging our service for certain types of problems (like mixed continuous/categorical + failure regions) we do try to incorporate them more into our testing, roadmap, and ensemble, but only at the meta level.

[1]: http://arxiv.org/abs/1605.06170

visarga · on Aug 10, 2016

https://sigopt.com/pricing

- Individual: $1,000/month

- Enterprise: Custom pricing

I am not a multi-million $ company, so I guess it's useless for me.

Zephyr314 · on Aug 10, 2016

We have a free academic tier as well: https://sigopt.com/edu

idewanck · on Aug 9, 2016

Post author here, happy to answer any questions as well.