Replying as the author -- I do spend some time discussing hetereoskedastic noise (beginning in §2.2 and intermittedly throughout following chapters), although you're right that I don't discuss this particular modeling approach. Personally I think that inferring hetereoskedastic noise from data alone during Bayesian optimization is likely to be very difficult, as you'll need either a lot of data and/or to be in a very small dimension in order to identify the variable noise scale. (Note that the example in the hetGP writeup is only in one dimension.)
However, when the noise scale is either variable (but known) or can be modeled with a relatively simple (e.g., parametric) model, there may be some benefit to the added model complexity. Here you could include the parameters of the noise model into the model hyperparameters and proceed following the discussion in chapter 4. In doing so, I would be careful to ensure that the data actually support the heteroskedastic noise hypothesis.
Another approach that might be useful in some contexts is a heavy-tailed noise model such as Student-t errors (§§ 2.8, 11.9, 11.11).
Thanks for your suggestions. For my use case (tuning parameters of a financial market simulation), I'm essentially able to get good noise estimates for free by re-sampling a set of parameters multiple times.
So for example, rather than simulate an entire month in one shot, I'll simulate a day 30 times and therefore have a decent estimate of the noise for that result and be able to clearly distinguish the noise from the covariance of the Gaussian process.
The noise in these simulations can vary dramatically in parameter space (easily 10-100x), so it seems like it would be important to model.
That's a fortunate scenario! If you have good noise estimates available then you can sidestep the need to infer the noise scale and instead simply proceed with "typical" heteroskedastic inference. When the observation noise variances are known, you only need to modify the typical GP inference equations to replace the σ²I term that appears in the homoskedastic case (where σ² is the constant noise scale) with a diagonal matrix N indicating the noise variances associated with each observation along the diagonal.
(One might imagine a slightly more flexible model including a scaling parameter, replacing N with c²N and inferring c from data.)
However, when the noise scale is either variable (but known) or can be modeled with a relatively simple (e.g., parametric) model, there may be some benefit to the added model complexity. Here you could include the parameters of the noise model into the model hyperparameters and proceed following the discussion in chapter 4. In doing so, I would be careful to ensure that the data actually support the heteroskedastic noise hypothesis.
Another approach that might be useful in some contexts is a heavy-tailed noise model such as Student-t errors (§§ 2.8, 11.9, 11.11).