Bayesian Optimization Book

Djrhfbfnsks · on Nov 15, 2021

For most of the problems for which I've tried to use Bayesian Optimization, I've had poor results because of unknown and heterogeneous noise in the underlying process that I'm trying to optimize.

I believe that modeling the noise directly using a 2nd Gaussian Process [1] could help, but I haven't gotten reliable results. I was hoping this topic would be addressed in the book, but don't see it.

[1] https://rdrr.io/cran/hetGP/

thomasahle · on Nov 15, 2021

I spent a long time trying to implement Noisy Baysian Optimization [1], using both standard libraries and my own understanding, but ultimately I never got it to work very well.

It's a real pity, since a smart optimizer for very noisy functions would be really useful. I was trying to use it for chess engine tuning, since I know Deep Mind used it for tuning AlphaZero. I really wonder how they got it to work well.

[1] https://github.com/thomasahle/noisy-bayesian-optimization

aeytan · on Nov 15, 2021

Were you trying to do optimization using a binary outcome? I am not familiar with many packages that do this out of the box. Your implementation looks like a good start for the binary case, but you will get better results by computing expected improvement on the probability of winning, (i.e., Phi(f(x)), rather than f(x), where Phi() is the standard normal CDF). For more on why, see http://proceedings.mlr.press/v28/tesch13.pdf and https://arxiv.org/pdf/2110.09361.pdf. AFAIK this should be as simple as defining a simple MCAcquisitionObjective (https://botorch.org/docs/objectives) that passes samples through a torch.distributions.normal CDF. You can then just pass that objective into qNoisyExpectedImprovement. Feel free to open a botorch GH issue if you try this out or need help!

I would also recommend using many more raw_samples and random_restarts in opimize_acqf(). 512 and 20, respectively, are good defaults. The current values are much too small to be able to effectively optimize the acquisition functions.

thomasahle · on Nov 15, 2021

Yes, I'm using a binary outcome, since that's what I get from playing a game. To get probabilities I'd have to play a lot of games with the same settings/features/point and take the mean, but it seems that defeats the point of Bayesian optimization finding the best point to evaluate for each iteration.

The SPSA method seems to work quite well with binary outcomes. This is what I was trying to beat. Unfortunately I was never able to converge faster than SPSA (or even close to that) even increasing the number of samples. There is a pretty long thread of us trying to make it work here: https://www.talkchess.com/forum3/viewtopic.php?f=7&t=71650&h...

I got some feedback form the botorch team back then: https://github.com/pytorch/botorch/issues/347#:~:text=thomas...

romangarnett · on Nov 15, 2021

Replying as the author -- I do spend some time discussing hetereoskedastic noise (beginning in §2.2 and intermittedly throughout following chapters), although you're right that I don't discuss this particular modeling approach. Personally I think that inferring hetereoskedastic noise from data alone during Bayesian optimization is likely to be very difficult, as you'll need either a lot of data and/or to be in a very small dimension in order to identify the variable noise scale. (Note that the example in the hetGP writeup is only in one dimension.)

However, when the noise scale is either variable (but known) or can be modeled with a relatively simple (e.g., parametric) model, there may be some benefit to the added model complexity. Here you could include the parameters of the noise model into the model hyperparameters and proceed following the discussion in chapter 4. In doing so, I would be careful to ensure that the data actually support the heteroskedastic noise hypothesis.

Another approach that might be useful in some contexts is a heavy-tailed noise model such as Student-t errors (§§ 2.8, 11.9, 11.11).

Djrhfbfnsks · on Nov 15, 2021

Thanks for your suggestions. For my use case (tuning parameters of a financial market simulation), I'm essentially able to get good noise estimates for free by re-sampling a set of parameters multiple times.

So for example, rather than simulate an entire month in one shot, I'll simulate a day 30 times and therefore have a decent estimate of the noise for that result and be able to clearly distinguish the noise from the covariance of the Gaussian process.

The noise in these simulations can vary dramatically in parameter space (easily 10-100x), so it seems like it would be important to model.

romangarnett · on Nov 16, 2021

That's a fortunate scenario! If you have good noise estimates available then you can sidestep the need to infer the noise scale and instead simply proceed with "typical" heteroskedastic inference. When the observation noise variances are known, you only need to modify the typical GP inference equations to replace the σ²I term that appears in the homoskedastic case (where σ² is the constant noise scale) with a diagonal matrix N indicating the noise variances associated with each observation along the diagonal.

(One might imagine a slightly more flexible model including a scaling parameter, replacing N with c²N and inferring c from data.)

angrygoat · on Nov 15, 2021

Another high quality, freely available read on Bayesian data analysis is this by John K. Kruschke: https://www.r-5.org/files/books/computers/algo-list/statisti...

kettleballroll · on Nov 15, 2021

Bayesian optimization and bayesian data analysis are two different things. Your suggestion seems off topic to me, and it feels like you didn't even check the linked content before posting.

v3rt3x · on Nov 15, 2021

Agreed, but I would reiterate that Kruschke's work is fantastic and very accessible for the non-academic practitioner.

opless · on Nov 15, 2021

Maybe off topic, but nevertheless the comment was not without utility.

ssivark · on Nov 15, 2021

AFAIK, the nicest, simplest book that gives a taste of Bayesian probabilistic reasoning is "Model based Machine learning" by John Winn & co. -- http://mbmlbook.com/

It consists of friendly and readable chapters, each based on an example. Note that the book is very basic, and the mathematics & methods are much less sophisticated than other texts (such as the one posted above), but the concepts are elucidated very nicely.

Semaphor · on Nov 15, 2021

Thank you, for someone who never worked with ML, that really is a nice introduction. The book by OP seemed rather overwhelming to me ;)

nextos · on Nov 15, 2021

And Infer.NET is quite underappreciated. For some discrete problems, it's really good.

quanto · on Nov 15, 2021

Is Bayesian optimization widely used on the field with success? It had much fanfare 5 years ago, but I don't hear much nowadays.

I have some mathematical background in optimization, and I am quite curious how Bayesian optimization compares to other more established methods.

romangarnett · on Nov 15, 2021

The book includes an annotated bibliography outlining successes in a wide range of areas spanning science and engineering.

medo-bear · on Nov 15, 2021

bayesian optimisation can be used in cases where gradient evaluation is expensive. in neural networks this can be applied to selecting hyper-parrameters to a trained model

janto · on Nov 15, 2021

There are other gradient free optimizers (like Powell's method) that are simpler and in my experience work better.

abhgh · on Nov 15, 2021

I think of Bayes-opt as addressing multiple challenges: (a) the function being a black box, i.e. gradient info isn't available (b) tolerance to some level of noise - the author addresses this in one of his comments here (c) economic # of calls to the black box function because it might be expensive to evaluate (like in the case of hyperparameter optimization, where a function evaluation means retraining the model).

Powell's methods - COBYLA et al - address only (a) I think. But I might be wrong not having worked on them extensively. Do they also address the other challenges?

medo-bear · on Nov 15, 2021

bayesian methods are also prefered due to their well understood theoretical properies. this is one of the things you really want when you build safety critical models

janto · on Nov 15, 2021

I'm having a hard time imagining how using Bayesian optimization would improve safety during the search process, especially since in practice it often tends to just give approximately uniform sampling of a space anyway. It really likes the unexplored territory and is likely to try things far away from the optimum that are more risky.

romangarnett · on Nov 15, 2021

Some research has explicitly considered the question of safety during exploration; see for example https://arxiv.org/pdf/1602.04450.pdf which includes theoretical analysis of the resulting algorithm.

janto · on Nov 15, 2021

Interesting, although the existence of this research actually suggests that normal Bayesian opt by itself is not safe and needs to be modified to be so.

A relatively unique trait of Bayesian opt is the modelling of unexplored space. The attraction to exploring that space makes it actually less safe than other methods that do not explicitly care.

One could go through similar steps to model and generate safe steps in other methods. It doesn't seem specific to Bayesian opt. It might even be better since it's less likely to be so computationally expensive as the auxiliary opt within Bayesian opt tends to be.

romangarnett · on Nov 15, 2021

Indeed -- safety in exploration is not a relevant concern in all (or even in most) scenarios. However, in domains such as robotics where certain configurations may have unforeseen and severe consequences, you may want to slow down exploration and proceed more cautiously. As you suggest, you can optimize more efficiently if you can explore the domain without any safety concerns.

You're correct that the approach taken in the linked paper could be adapted to increase the safety of other sequential design settings (when needed), assuming you have access to a model that can quantify uncertainty.

medo-bear · on Nov 15, 2021

i mean it in terms of explainability. the point is that bayesian methods can be used to investigate theoretical properties of machine learning models that might otherwise be opaque. they can also provide estimates of model-uncertainty

armcat · on Nov 15, 2021

Looks really promising, will give it a read through!

For those looking for an easier entry into Bayesian analysis, as a related topic and a starting place before embarking on Bayesian optimization, I would highly recommend "Statistical Rethinking" by Richard McElreath: https://xcelab.net/rm/statistical-rethinking/. Why I really like Richard's book is that it bypasses lot of the heavy mathematical/integral work, and goes straight into sampling - from my experience, you generally can't do integrals by hand, you describe your model in terms of a hierarchy of probability distributions and let an MCMC sampler take care of the rest. Richard's book touches upon causality (important and often overlooked topic in ML!), and you can follow his course online: https://github.com/rmcelreath/stat_rethinking_2022

EDIT: made my recommendation more explicit as a related topic.

abhgh · on Nov 15, 2021

Maybe you meant your recommended book as related reading, but the linked text is for Bayesian optimization, which is different from Bayesian analysis. Sure the optimization uses some Bayesian ideas but the majority of the challenges are different.

armcat · on Nov 15, 2021

Correct, and I made an update to the comment. Thanks!