For most of the problems for which I've tried to use Bayesian Optimization, I've had poor results because of unknown and heterogeneous noise in the underlying process that I'm trying to optimize.
I believe that modeling the noise directly using a 2nd Gaussian Process [1] could help, but I haven't gotten reliable results. I was hoping this topic would be addressed in the book, but don't see it.
I spent a long time trying to implement Noisy Baysian Optimization [1], using both standard libraries and my own understanding, but ultimately I never got it to work very well.
It's a real pity, since a smart optimizer for very noisy functions would be really useful. I was trying to use it for chess engine tuning, since I know Deep Mind used it for tuning AlphaZero. I really wonder how they got it to work well.
Were you trying to do optimization using a binary outcome? I am not familiar with many packages that do this out of the box. Your implementation looks like a good start for the binary case, but you will get better results by computing expected improvement on the probability of winning, (i.e., Phi(f(x)), rather than f(x), where Phi() is the standard normal CDF). For more on why, see http://proceedings.mlr.press/v28/tesch13.pdf and https://arxiv.org/pdf/2110.09361.pdf. AFAIK this should be as simple as defining a simple MCAcquisitionObjective (https://botorch.org/docs/objectives) that passes samples through a torch.distributions.normal CDF. You can then just pass that objective into qNoisyExpectedImprovement. Feel free to open a botorch GH issue if you try this out or need help!
I would also recommend using many more raw_samples and random_restarts in opimize_acqf(). 512 and 20, respectively, are good defaults. The current values are much too small to be able to effectively optimize the acquisition functions.
Yes, I'm using a binary outcome, since that's what I get from playing a game. To get probabilities I'd have to play a lot of games with the same settings/features/point and take the mean, but it seems that defeats the point of Bayesian optimization finding the best point to evaluate for each iteration.
The SPSA method seems to work quite well with binary outcomes. This is what I was trying to beat. Unfortunately I was never able to converge faster than SPSA (or even close to that) even increasing the number of samples.
There is a pretty long thread of us trying to make it work here: https://www.talkchess.com/forum3/viewtopic.php?f=7&t=71650&h...
Replying as the author -- I do spend some time discussing hetereoskedastic noise (beginning in §2.2 and intermittedly throughout following chapters), although you're right that I don't discuss this particular modeling approach. Personally I think that inferring hetereoskedastic noise from data alone during Bayesian optimization is likely to be very difficult, as you'll need either a lot of data and/or to be in a very small dimension in order to identify the variable noise scale. (Note that the example in the hetGP writeup is only in one dimension.)
However, when the noise scale is either variable (but known) or can be modeled with a relatively simple (e.g., parametric) model, there may be some benefit to the added model complexity. Here you could include the parameters of the noise model into the model hyperparameters and proceed following the discussion in chapter 4. In doing so, I would be careful to ensure that the data actually support the heteroskedastic noise hypothesis.
Another approach that might be useful in some contexts is a heavy-tailed noise model such as Student-t errors (§§ 2.8, 11.9, 11.11).
Thanks for your suggestions. For my use case (tuning parameters of a financial market simulation), I'm essentially able to get good noise estimates for free by re-sampling a set of parameters multiple times.
So for example, rather than simulate an entire month in one shot, I'll simulate a day 30 times and therefore have a decent estimate of the noise for that result and be able to clearly distinguish the noise from the covariance of the Gaussian process.
The noise in these simulations can vary dramatically in parameter space (easily 10-100x), so it seems like it would be important to model.
That's a fortunate scenario! If you have good noise estimates available then you can sidestep the need to infer the noise scale and instead simply proceed with "typical" heteroskedastic inference. When the observation noise variances are known, you only need to modify the typical GP inference equations to replace the σ²I term that appears in the homoskedastic case (where σ² is the constant noise scale) with a diagonal matrix N indicating the noise variances associated with each observation along the diagonal.
(One might imagine a slightly more flexible model including a scaling parameter, replacing N with c²N and inferring c from data.)
Bayesian optimization and bayesian data analysis are two different things. Your suggestion seems off topic to me, and it feels like you didn't even check the linked content before posting.
AFAIK, the nicest, simplest book that gives a taste of Bayesian probabilistic reasoning is "Model based Machine learning" by John Winn & co. -- http://mbmlbook.com/
It consists of friendly and readable chapters, each based on an example. Note that the book is very basic, and the mathematics & methods are much less sophisticated than other texts (such as the one posted above), but the concepts are elucidated very nicely.
bayesian optimisation can be used in cases where gradient evaluation is expensive. in neural networks this can be applied to selecting hyper-parrameters to a trained model
I think of Bayes-opt as addressing multiple challenges: (a) the function being a black box, i.e. gradient info isn't available (b) tolerance to some level of noise - the author addresses this in one of his comments here (c) economic # of calls to the black box function because it might be expensive to evaluate (like in the case of hyperparameter optimization, where a function evaluation means retraining the model).
Powell's methods - COBYLA et al - address only (a) I think. But I might be wrong not having worked on them extensively. Do they also address the other challenges?
bayesian methods are also prefered due to their well understood theoretical properies. this is one of the things you really want when you build safety critical models
I'm having a hard time imagining how using Bayesian optimization would improve safety during the search process, especially since in practice it often tends to just give approximately uniform sampling of a space anyway. It really likes the unexplored territory and is likely to try things far away from the optimum that are more risky.
Some research has explicitly considered the question of safety during exploration; see for example https://arxiv.org/pdf/1602.04450.pdf which includes theoretical analysis of the resulting algorithm.
Interesting, although the existence of this research actually suggests that normal Bayesian opt by itself is not safe and needs to be modified to be so.
A relatively unique trait of Bayesian opt is the modelling of unexplored space. The attraction to exploring that space makes it actually less safe than other methods that do not explicitly care.
One could go through similar steps to model and generate safe steps in other methods. It doesn't seem specific to Bayesian opt. It might even be better since it's less likely to be so computationally expensive as the auxiliary opt within Bayesian opt tends to be.
Indeed -- safety in exploration is not a relevant concern in all (or even in most) scenarios. However, in domains such as robotics where certain configurations may have unforeseen and severe consequences, you may want to slow down exploration and proceed more cautiously. As you suggest, you can optimize more efficiently if you can explore the domain without any safety concerns.
You're correct that the approach taken in the linked paper could be adapted to increase the safety of other sequential design settings (when needed), assuming you have access to a model that can quantify uncertainty.
i mean it in terms of explainability. the point is that bayesian methods can be used to investigate theoretical properties of machine learning models that might otherwise be opaque. they can also provide estimates of model-uncertainty
Looks really promising, will give it a read through!
For those looking for an easier entry into Bayesian analysis, as a related topic and a starting place before embarking on Bayesian optimization, I would highly recommend "Statistical Rethinking" by Richard McElreath: https://xcelab.net/rm/statistical-rethinking/. Why I really like Richard's book is that it bypasses lot of the heavy mathematical/integral work, and goes straight into sampling - from my experience, you generally can't do integrals by hand, you describe your model in terms of a hierarchy of probability distributions and let an MCMC sampler take care of the rest. Richard's book touches upon causality (important and often overlooked topic in ML!), and you can follow his course online: https://github.com/rmcelreath/stat_rethinking_2022
EDIT: made my recommendation more explicit as a related topic.
Maybe you meant your recommended book as related reading, but the linked text is for Bayesian optimization, which is different from Bayesian analysis. Sure the optimization uses some Bayesian ideas but the majority of the challenges are different.
I believe that modeling the noise directly using a 2nd Gaussian Process [1] could help, but I haven't gotten reliable results. I was hoping this topic would be addressed in the book, but don't see it.
[1] https://rdrr.io/cran/hetGP/