This is yet another article that ignores the fact that there is a MUCH better approach to this problem.
Thompson sampling avoids the problems of multiple testing, power, early stopping and so on by starting with a proper Bayesian approach. The idea is that the question we want to answer is more "Which alternative is nearly as good as the best with pretty high probability?". This is very different from the question being answered by a classical test of significance. Moreover, it would be good if we could answer the question partially by decreasing the number of times we sample options that are clearly worse than the best. What we want to solve is the multi-armed bandit problem, not the retrospective analysis of experimental results problem.
The really good news is that Thompson sampling is both much simpler than hypothesis testing can be done in far more complex situations. It is known to be an asymptotically optimal solution to the multi-armed bandit and often takes only a few lines of very simple code to implement.
Thompson sampling is a great tool. I've used it to make reasonably large amounts of money. But it does not solve the same problem as A/B testing.
Thompson Sampling (at least the standard approach) assumes that conversion rates do not change. In reality they vary significantly over a week, and this fundamentally breaks bandit algorithms.
Furthermore, you do not need to use Thompson Sampling to have a proper Bayesian approach. At VWO we also use a proper Bayesian approach, but we use A/B testing in to avoid the various pitfalls that Thompson Sampling has. Google Optimize uses an approach very similar to ours, (although it may be flawed [1]) and so does A/B Tasty (probably not flawed).
Note: I'm the Director of Data Science at VWO. Obviously I'm biased, etc. However my post critiquing bandits was published before I took on this role. It was a followup to a previous post of mine which led people to accidentally misuse bandits: https://www.chrisstucchio.com/blog/2012/bandit_algorithms_vs...
> Depending on what your website is selling, people will have a different propensity to purchase on Saturday than they have on Tuesday.
Affects multi-armed bandit and fixed tests. If you do fixed A/B test on Tuesday, your results will also be wrong. Either way, you have to decide on what kind of seasonality your data has, and don't make any adjustments until the period is complete.
If anything, multi-armed bandit shines because it can adapt to trends you don't anticipate.
> Delayed response is a big problem when A/B testing the response to an email campaign.
Affects multi-armed bandit and fixed tests. If you include immature data in your p-test, your results will be wrong. Either way, you have to decide how long it takes to declare an individual success or failure.
> You don't get samples for free by counting visits instead of users
Affects multi-armed bandit and fixed tests. Focusing on relevant data increases the power of your experiment.
---
For every single problem, the author admits "A/B tests have the same problem", and then somehow concludes that multi-bandit tests are harder because of these design decisions, despite the fact they affect any experiment process.
If anything, multi-armed bandit shines because it can adapt to trends you don't anticipate.
It can, but the time it takes is exp(# of samples already passed).
You can improve this by using a non-stationary Bayesian model (i.e. one that assumes conversion rates change over time) but this usually involves solving PDEs or something equally difficult.
For every single problem, the author admits "A/B tests have the same problem", and then somehow concludes that multi-bandit tests are harder because of these design decisions, despite the fact they affect any experiment process.
The point the author (me) is trying to make is not that bandits are fundamentally flawed. The point is that for A/B tests, all these problems have simple fixes: make sure to run the A/B test for long enough.
For bandits, the fixes are not nearly as simple. It usually involves non-simple math, or at the very least non-intuitive things (for instance not actually running a bandit until 1 week has passed).
At VWO we realized that most of our customers are not sophisticated enough to get all this stuff right, which is why we didn't switch to bandits.
So what I'm proposing to do is run A/B with a 50/50 split for a full week, then when B wins shift to 0/100 in favor of B.
You seem to be proposing to run A/B with a 50/50 split for a full week, then when B does a lot better shift to 10/90 in favor of B and maybe a few weeks later shift to 1/99.
What practical benefit do you see to this approach? From my perspective this just slows down the experimental process and keeps losing variations (and associated code complexity) around for a lot longer.
First, Google Analytics (for example) runs content experiments for a minimum of two weeks regardless of results. It's hardly an unrealistic timeframe for reliable conclusions.
> What practical benefit do you see to this approach?
Statistically rigorous results, with minimal regret.
In your example, you reach the end of the week, and your 50/50 split has one-sided p=0.10, double the usual p<0.05 criteria. What do you do?
(a) Call it in favor of B, despite being uncertain about the outcome.
(b) Keep running the test. This compromises the statistical rigor of your test.
(c) Keep running the test, but use sequential hypothesis testing, e.g. http://elem.com/~btilly/ab-testing-multiple-looks/part1-rigo... This significantly increases the time to reach a conclusion, and costs you conversions in the meantime.
The essential difference when choosing the approach is that 50/50 split optimizes for shortest time to conclusion, and multi-bandit optimizes for fewest failures.
There are times when the former is more important, e.g. marketing wants to know how to brand a product that is being released next month. These are the clinical-like experiments that frequentist approaches were formulated for.
Statistically rigorous results, with minimal regret.
The results are only statistically rigorous provided your bandit obeys relatively strong assumptions.
As another example, suppose you ran a 2-week test. Suppose that from week 1 to week 2, both conversion rates changed, but the delta between them remained roughly the same. A 50/50 A/B split doesn't mind this, and in fact still returns the right answer. Bandits do mind.
I don't do p-values. I do Bayesian testing, same as you. I just recognize that in the real world, weaker assumptions are more robust to experimenter or model error, both of which are generally the dominant error mode.
In web A/B testing, the latter is usually the most applicable, and for that, you cannot beat Thompson sampling on the average, no matter how clever yur scheme.
This is simply not true. The Gittins Index beats Thompson sampling, subject again to the same strong assumptions.
Look, I know the theoretical advantages of bandits and I advocate their use under some limited circumstances. I just find the stronger assumptions they require (or alternately the much heavier math requirements) mean they aren't a great replacement for A/B tests which are much simpler and easier to get right.
Thompson sampling does not need to assume stability. You can inject time features into the model if you want to model seasonality (or, more accurately, ignorances of seasonality) and you can also have a hidden random-walk variable.
Yes, if you assume stability and things vary, you will not have good results. That seems like any statistics.
However, with an A/B test you don't need to change the math or eliminate the stability assumption from it. You just need to choose a good test duration.
As I pointed out to pauldraper in the other thread, when you start fixing bandits by only changing the split every season, suddenly bandits start to look a lot like A/B testing.
Actually, Chris, I think you misunderstand my comment.
Thompson sampling (and Bayesian Bandits in general) can be applied with a model for estimating conversion that is more complex than P(conversion|A). It can include parameters for time of day, day of week and even be non-parametric.
If you do this, you the standard Thompson sampling framework with nochanges* whatsoever will still kill losers quickly (if the loss is big enough that seasonality cannot save it) and will also wait until it has enough data seasonally to make more refined decisions. This is very different from simply waiting for an even season point to make decisions.
You do need more data to understand a more complex world, but having an optimal inference engine to help is better than having a severely sub-optimal engine.
I understand that. The blog post I linked to describes doing exactly that.
But the point I'm making is different. This is a lot of stuff to get right and most people aren't that sophisticated. Getting A/B tests right is a lot easier, mainly because they are significantly more robust to model error.
This ignores what I have seen in my experience, which is that marketing teams - composed of the people who dictate what A/B tests the business should run - have little to no background in statistics, let alone any interest whatsoever in actually performing legitimate A/B tests.
It's often the case that the decision maker has already decided to move ahead with option A, but performs a minimal "fake" A/B test to put in their report as a way to justify their choice. I've seen A/B tests deployed at 10am in the morning, and taken down at 1pm with less than a dozen data points collected. The A/B test "owner" is happy to see that option A resulted in 7 conversions, with option B only having 5. Not statistically significant whatsoever, but hey let's waste developers' time and energy for two days implementing an A/B test in order to help someone else try to nab their quarterly marketing bonus.
Join us, comrade, in the fight against the statistical blight!
Move your decision process to multi-armed bandit and you never have to decide when to end an A/B test -- math does it for you, in a provably optimal way.
I'm not sure this solves it because you have to have a really strong sense of the loss function to pull it off. That's much easier to intuit and use to guide experiments than actually build into the bandit algo.
> That's much easier to intuit and use to guide experiments than actually build into the bandit algo.
IDK about your intuition, but for most other people, it gets in the way of statistics.
The "loss function" is just as easy to calculate for A/B tests as for multi-armed bandit. The value of user doing A is $X, the value of B is $Y, and the value of C is $Z.
This is yet another comment claiming that Thompson sampling is the answer to all of our statistical problems!
Naive Thompson sampling (like the code you linked to) will result in problems equally disastrous to those I wrote about in the Qubit whitepaper. Other comments have highlighted a key problem with simple bandit algorithms - reward distributions which change over time will render their results worthless. You can model these dynamics but not in 'a few lines of very simple code'. It is verging on the irresponsible to suggest otherwise.
I personally favour a bayesian state-space model to elegantly take care of these things - but that's outside the remit of the whitepaper and outside the skill set of most non-statisticians. Frequentist testing, when done properly, is simple to implement and has statistical guarantees that are very attractive in practice.
I'm not a Bayesian fanatic, but given how perfectly A/B test optimization fits in the Bayesian approach, it's a shame it's not yet the de facto standard.
I think the primary reasons are (1) it's not as intuitive (especially for the uninitiated), (2) it's harder to implement an automated feedback mechanism, (3) FUD. E.g. https://www.chrisstucchio.com/blog/2015/dont_use_bandits.htm... lists devastating complications with correct multi-armed bandit tests, and then in fine print admits that traditional tests have all the same complications.
Is it really less intuitive? I would've said less familiar. Null-hypothesis significance testing is unintuitive: nonexperts seem to explain it wrongly more often than not. (Like "the p-value is the probability the result was random chance".) Probably both approaches are unintuitive to humans unless they're explained really well.
You're right. True understanding of null hypothesis and p-value isn't easier than Bayesian. Heck, just try to figure out if you need a one-sided or two-sided test.
But it's easier to try to black box it and say this number cruncher tells me if these two things are really different. Of course, if you don't understand what's going on, you start p-hacking.
Actually, telling you whether things are different isn't the goal.
The goal is to make practical business decisions with reasonably high likelihood of being right and, if wrong, having only limited impact.
The output of a well done Bayesian analysis can be very easy for business people to understand. Statements like (you have a 60% chance of making just the right decision and a 35% chance of making a decision that is within 3% of right) are easy for most business stake-holders to understand because that is close to how they frame their own decision making process.
Laboring to reject a null hypothesis is an unnatural act for most people.
I'd guess the majority of p-hacking is not intentionally fraudulent, yeah. The first few intros to stats I tried to read had trouble sinking in -- it felt opaque and authoritarian somehow. Easy to treat it as a cookbook until I learned of the bayesian approach.
Given most people don't get A/B testing, it's a stretch to me to believe that some that does would know about more complex approaches that require more skill.
Firmly believe that there's way more to be gained by more people understanding how to use A/B testing than more complex solutions.
This is a great read. But left me thinking I'm missing in the fundamentals. Can you recommend a book (or other posts) on statistics fundementals for programmers?
I agree with you (and love your blog, btw), but I think you're skipping over at least a few benefits you can get out of a mature / well built a/b framework that are hard to build into a bandit approach. The biggest one I've found personally useful is days-in analysis; for example, quantifying the impact of a signup-time experiment on one-week retention. This doesn't really apply to learning ranking functions or other transactional (short-feedback loop) optimization.
That being said, building a "proper" a/b harness is really hard and will be a constant source of bugs / FUD around decision-making (don't believe me? try running an a/a experiment and see how many false positives you get). I've personally built a dead-simple bandit system when starting greenfield and would recommend the same to anyone else.
Probably worth mentioning that the Google Content Experiments framework is in the process of being replaced with Google Optimize (currently in a private beta) which does NOT make use of multi-armed bandits much to my confusion and disappointment.
Huh. So do you know if they do anything help with repeat testing/peeking?
Optimizely takes an interesting approach: they apply repeat testing methods, segmenting the tests by user views of the results. Like 30x more complicated than multi-bandit, but they don't need a feedback mechanism.
Thank you Ted for bringing sanity to this conversation. Terrific point and post.
By the way, I doubt you remember me, but thank you for inviting me on a tour of Veoh ten years ago when I was a young college sophomore. I enjoyed the opportunity as well as our brief chat about Bayesianism.
Thompson sampling avoids the problems of multiple testing, power, early stopping and so on by starting with a proper Bayesian approach. The idea is that the question we want to answer is more "Which alternative is nearly as good as the best with pretty high probability?". This is very different from the question being answered by a classical test of significance. Moreover, it would be good if we could answer the question partially by decreasing the number of times we sample options that are clearly worse than the best. What we want to solve is the multi-armed bandit problem, not the retrospective analysis of experimental results problem.
The really good news is that Thompson sampling is both much simpler than hypothesis testing can be done in far more complex situations. It is known to be an asymptotically optimal solution to the multi-armed bandit and often takes only a few lines of very simple code to implement.
See http://tdunning.blogspot.com/2012/02/bayesian-bandits.html for an essay and see https://github.com/tdunning/bandit-ranking for an example applied to ranking.