Very nice, much faster than simulating it. I guess if you're using an informative prior, instead of adding 1 to the number of successes and failures, you add the corresponding parameters of your beta prior?
A pretty good shortcut (if you don't have a log-beta function, for example) is to approximate both A and B with the normal distribution. Then their difference is also normal, so you can just check the probability that a normal random variable is greater than zero.
Specifically, the mean μ of a beta distribution is α/(α+β) and the variance σ^2 is αβ/((α+β)^2(α+β+1)). Use these as the parameters for your normal approximation, and we have the difference D ~ N(μ_A-μ_B, σ_A^2+σ_B^2). The probability that B beats A is just the CDF of D evaluated at 0.
I'm pretty sure this formula is correct, but I haven't seen it published anywhere. John Cook has some veiled references to a closed-form solution when one of the parameters is an integer:
But he doesn't really say what that closed form is, so I think his version must have been pretty hairy. (My version requires all four parameters to be integers, so I doubt we were talking about the same thing.)
Sadly I couldn't get the math to work out for producing a confidence interval on |p_B - p_A| so for now you're stuck with Monte Carlo for confidence bounds.
Thanks to prodding from Steven Noble over at Stripe, I'll have another formula up soon for asking the same question using count data instead of binary data. Stay tuned!
The key background here that is somewhat brushed over in the post is that the Beta distribution is conjugate to the Binomial - in the sense that if your parameter of interest theta has a Beta(a, b) prior, and you observe n successes and r failures from a Binomial trial with parameter(n, theta), your posterior is now Beta(a + n, b + r).
The 'non-informative' distribution can be interpreted as having no special information about the prior information of theta, so you just take theta uniform over [0, 1]. This is identical to the Beta(1, 1) distribution. Thus, you obtain the update equations @EvanMiller derived, while making it more clear how to generalize to the informative prior case.
In general though, I would have thought one would care more about distributional properties of the posterior p_B - p_A, which are more easily computed by MC methods - the closed form solution for P(p_B > p_A) is nice and fast, but it's not like hypothesis testing is in the critical path of a program, so I wouldn't shy away from MC methods here for a more nuanced view of your posteriors.
I don't do this stuff, so this may be a stupid question. But it looks like you have p_A and p_B being independent in the posterior distribution. How does that work? I would assume that they're pretty correlated IRL --- we're not changing that much in A/B testing.
This must be coming from the likelihood function, so you may want to be more specific about the exact experiment you have in mind. I'm not sure off the top of my head whether imposing independence is going to overstate or understate the precision of the estimate...
Confidence interval means that is expected conversion rate of A is 0.5 and for B is 0.4, what is the actual spread. In other words, say, for 95% of the time does the conversion rate fall in the 0.3-0.5 range or 0.35-0.45 range. More data you collect , narrower this range will be. Confidence bounds for individual conversion rates can be calculated from the formula since they follow beta distribution. What Evan is saying that the confidence bound for the difference in conversion rates would need to be calculated through repeated simulations and taking (say) 5 percentile and 95 percentile of the values that come.
For example, given you know probabilities for A and B (say 0.5 and 0.4) and that number of trials has been 100, you can simply run thousands of simulations, know what the conversion rates come out to be in respective simulations and aggregate the difference in conversion rates at the end of each simulation. That would be a confidence bound.
If you are unable to find a log-gamma function lying around, rewrite the above equation in terms of factorials using Γ(z)=(z−1)!, and notice that there are an equal number of multiplicative terms in the numerator and denominator. If you alternately multiply and divide one term at a time, you should be able to arrive at an answer without encountering numerical overflow.
Something about the right side of the equation immediately preceding this quote seems to indicate that many of the terms in the numerator would cancel with equivalents in the denominator. I'm not really familiar with CAS systems, but is this the sort of thing they could do? Doing this simplification once when one writes the code seems to be a win over calculating the original expression every time the code runs.
Yes but numerically approximating the beta function is extremely fast, much faster than running a CAS optimization routine and possibly faster than going through the factorials!
Not really worth it unless this is in an inner-loop.
Evan, does this formula account for multiple comparisons (if you have multiple goals and multiple variations)? I guess it would suffer from the same problems that if you have 100 variations and 10 goals, some of them are bound to produce a significant result, just by random chance. Isn't it? In classical testing, you can fix it by making your calculated alpha smaller than the real alpha, so you need much more data if there are multiple comparisons. What happens in Bayesian case?
That paper is a very different setup --- the hierarchical model Gelman, Hill, and Yajima use implicitly builds multiple comparisons into their process. You could do the same thing here in principle by drawing alpha and beta for different experiments from an underlying distribution, but I'd be cautious about applying that approach mindlessly.
Great work! However, an additional testing section would be nice (right below the implementation section).
That section should provide two or three lists of example input values, and the expected output value (up to some accuracy).
As the author notes, although this formula looks pretty simple, you can make a lot of numerical mistakes when implementing it. A test suite would help implementors to catch those mistakes early.
A pretty good shortcut (if you don't have a log-beta function, for example) is to approximate both A and B with the normal distribution. Then their difference is also normal, so you can just check the probability that a normal random variable is greater than zero.
Specifically, the mean μ of a beta distribution is α/(α+β) and the variance σ^2 is αβ/((α+β)^2(α+β+1)). Use these as the parameters for your normal approximation, and we have the difference D ~ N(μ_A-μ_B, σ_A^2+σ_B^2). The probability that B beats A is just the CDF of D evaluated at 0.
In Python: