A Formula for Bayesian A/B Testing

sharnett · on June 4, 2014

Very nice, much faster than simulating it. I guess if you're using an informative prior, instead of adding 1 to the number of successes and failures, you add the corresponding parameters of your beta prior?

A pretty good shortcut (if you don't have a log-beta function, for example) is to approximate both A and B with the normal distribution. Then their difference is also normal, so you can just check the probability that a normal random variable is greater than zero.

Specifically, the mean μ of a beta distribution is α/(α+β) and the variance σ^2 is αβ/((α+β)^2(α+β+1)). Use these as the parameters for your normal approximation, and we have the difference D ~ N(μ_A-μ_B, σ_A^2+σ_B^2). The probability that B beats A is just the CDF of D evaluated at 0.

In Python:

  from scipy.stats import norm as norm

  def beta_mean(a, b):
      return a/(a+b)
  def beta_var(a, b):
      return a*b/((a+b)**2*(a+b+1))
  def probability_B_beats_A(α_A, β_A, α_B, β_B):
      mu = beta_mean(α_A, β_A) - beta_mean(α_B, β_B)
      sigma = (beta_var(α_A, β_A) + beta_var(α_B, β_B))**.5
      return norm.cdf(0, mu, sigma)

dxbydt · on June 5, 2014

For those of us stuck in Java/Scala land, with attached results: https://gist.github.com/krishnanraman/4be978cbe8979e4c97fa

mglauco · on June 4, 2014

Great job. Tks

EvanMiller · on June 4, 2014

I'm pretty sure this formula is correct, but I haven't seen it published anywhere. John Cook has some veiled references to a closed-form solution when one of the parameters is an integer:

http://www.mdanderson.org/education-and-research/departments... [pdf]

But he doesn't really say what that closed form is, so I think his version must have been pretty hairy. (My version requires all four parameters to be integers, so I doubt we were talking about the same thing.)

Sadly I couldn't get the math to work out for producing a confidence interval on |p_B - p_A| so for now you're stuck with Monte Carlo for confidence bounds.

Thanks to prodding from Steven Noble over at Stripe, I'll have another formula up soon for asking the same question using count data instead of binary data. Stay tuned!

ajtulloch · on June 5, 2014

The key background here that is somewhat brushed over in the post is that the Beta distribution is conjugate to the Binomial - in the sense that if your parameter of interest theta has a Beta(a, b) prior, and you observe n successes and r failures from a Binomial trial with parameter(n, theta), your posterior is now Beta(a + n, b + r).

The 'non-informative' distribution can be interpreted as having no special information about the prior information of theta, so you just take theta uniform over [0, 1]. This is identical to the Beta(1, 1) distribution. Thus, you obtain the update equations @EvanMiller derived, while making it more clear how to generalize to the informative prior case.

In general though, I would have thought one would care more about distributional properties of the posterior p_B - p_A, which are more easily computed by MC methods - the closed form solution for P(p_B > p_A) is nice and fast, but it's not like hypothesis testing is in the critical path of a program, so I wouldn't shy away from MC methods here for a more nuanced view of your posteriors.

grayclhn · on June 5, 2014

I don't do this stuff, so this may be a stupid question. But it looks like you have p_A and p_B being independent in the posterior distribution. How does that work? I would assume that they're pretty correlated IRL --- we're not changing that much in A/B testing.

This must be coming from the likelihood function, so you may want to be more specific about the exact experiment you have in mind. I'm not sure off the top of my head whether imposing independence is going to overstate or understate the precision of the estimate...

noelwelsh · on June 5, 2014

This paper by John Cook has the inequalities: http://biostats.bepress.com/mdandersonbiostat/paper54/

octernion · on June 4, 2014

Thanks for doing this, Evan! Much appreciated - I've already implemented it to compare with our traditional confidence score based A/B metrics.

I'm not sure what you mean by Monte Carlo for confidence bounds - is there some reading I can do to brush up on that?

edit: here is your solution implemented in Ruby as I implemented it https://gist.github.com/nickelser/6dfa0f2737c502dae267

paraschopra · on June 5, 2014

Confidence interval means that is expected conversion rate of A is 0.5 and for B is 0.4, what is the actual spread. In other words, say, for 95% of the time does the conversion rate fall in the 0.3-0.5 range or 0.35-0.45 range. More data you collect , narrower this range will be. Confidence bounds for individual conversion rates can be calculated from the formula since they follow beta distribution. What Evan is saying that the confidence bound for the difference in conversion rates would need to be calculated through repeated simulations and taking (say) 5 percentile and 95 percentile of the values that come.

For example, given you know probabilities for A and B (say 0.5 and 0.4) and that number of trials has been 100, you can simply run thousands of simulations, know what the conversion rates come out to be in respective simulations and aggregate the difference in conversion rates at the end of each simulation. That would be a confidence bound.

octernion · on June 5, 2014

Gotcha - thanks so much for writing this up. I'll see about actually implementing this :)

sesqu · on June 5, 2014

Where do you use the fact that all the parameters are integers? I can only see the requirement for the one parameter you sum over.

jessaustin · on June 4, 2014

If you are unable to find a log-gamma function lying around, rewrite the above equation in terms of factorials using Γ(z)=(z−1)!, and notice that there are an equal number of multiplicative terms in the numerator and denominator. If you alternately multiply and divide one term at a time, you should be able to arrive at an answer without encountering numerical overflow.

Something about the right side of the equation immediately preceding this quote seems to indicate that many of the terms in the numerator would cancel with equivalents in the denominator. I'm not really familiar with CAS systems, but is this the sort of thing they could do? Doing this simplification once when one writes the code seems to be a win over calculating the original expression every time the code runs.

avibryant · on June 5, 2014

I had the same thought, and went down the simplification rabbit-hole. Here's where I ended up: https://gist.github.com/avibryant/545e63b446a840951d5e

nimish · on June 4, 2014

Yes but numerically approximating the beta function is extremely fast, much faster than running a CAS optimization routine and possibly faster than going through the factorials!

Not really worth it unless this is in an inner-loop.

paraschopra · on June 5, 2014

Evan, does this formula account for multiple comparisons (if you have multiple goals and multiple variations)? I guess it would suffer from the same problems that if you have 100 variations and 10 goals, some of them are bound to produce a significant result, just by random chance. Isn't it? In classical testing, you can fix it by making your calculated alpha smaller than the real alpha, so you need much more data if there are multiple comparisons. What happens in Bayesian case?

Edit: I did some Googling and found this http://www.stat.columbia.edu/~gelman/research/unpublished/mu...

grayclhn · on June 5, 2014

That paper is a very different setup --- the hierarchical model Gelman, Hill, and Yajima use implicitly builds multiple comparisons into their process. You could do the same thing here in principle by drawing alpha and beta for different experiments from an underlying distribution, but I'd be cautious about applying that approach mindlessly.

If you haven't read it, Gelman and Hill's book is excellent: http://www.stat.columbia.edu/~gelman/arm/ and so is Gelman's blog.

vog · on June 4, 2014

Great work! However, an additional testing section would be nice (right below the implementation section).

That section should provide two or three lists of example input values, and the expected output value (up to some accuracy).

As the author notes, although this formula looks pretty simple, you can make a lot of numerical mistakes when implementing it. A test suite would help implementors to catch those mistakes early.

akane · on June 5, 2014

Assuming the Julia implementation is correct, you could write tests based on output from that.

hidden-markov · on June 10, 2014

Indeed, much faster than Monte Carlo integration:

  5.778  evaluation.py:20(evaluate) # Sampling version
  0.043  evaluation.py:64(evaluate) # Closed formula

(See my A/B testing library https://github.com/bogdan-kulynych/trials)

spitfire · on June 6, 2014

Here's the Mathematica version:

PrBbeatsA[\[Alpha]a_, \[Beta]a_, \[Alpha]b_, \[Beta]b_] := \!\( \UnderoverscriptBox[\(\[Sum]\), \(i = 0\), \(\[Alpha]b\)] \FractionBox[\(Beta[\[Alpha]a + i, \[Beta]b + \[Beta]a]\), \(\((\[Beta]b + i)\) Beta[ 1 + i, \[Beta]b] Beta[\[Alpha]a, \[Beta]a]\)]\)

psychometry · on June 4, 2014

A minor complaint: that formula calculates a probability (a number within the range [0,1]), rather than odds (a ratio of expected values).