This is yet another article that ignores the fact that there is a MUCH better ap...

yummyfajitas · on Jan 20, 2017

Thompson sampling is a great tool. I've used it to make reasonably large amounts of money. But it does not solve the same problem as A/B testing.

Thompson Sampling (at least the standard approach) assumes that conversion rates do not change. In reality they vary significantly over a week, and this fundamentally breaks bandit algorithms.

https://www.chrisstucchio.com/blog/2015/dont_use_bandits.htm...

Furthermore, you do not need to use Thompson Sampling to have a proper Bayesian approach. At VWO we also use a proper Bayesian approach, but we use A/B testing in to avoid the various pitfalls that Thompson Sampling has. Google Optimize uses an approach very similar to ours, (although it may be flawed [1]) and so does A/B Tasty (probably not flawed).

https://cdn2.hubspot.net/hubfs/310840/VWO_SmartStats_technic...

Note: I'm the Director of Data Science at VWO. Obviously I'm biased, etc. However my post critiquing bandits was published before I took on this role. It was a followup to a previous post of mine which led people to accidentally misuse bandits: https://www.chrisstucchio.com/blog/2012/bandit_algorithms_vs...

[1] The head of data science at A/B Tasty suggests Google Optimize counts sessions rather than visitors, which would break the IID assumption. https://www.abtasty.com/uk/blog/data-scientist-hubert-google...

paulddraper · on Jan 20, 2017

No, no, no, no. https://www.chrisstucchio.com/blog/2015/dont_use_bandits.htm... needs a rebuttal so very, very badly.

> Depending on what your website is selling, people will have a different propensity to purchase on Saturday than they have on Tuesday.

Affects multi-armed bandit and fixed tests. If you do fixed A/B test on Tuesday, your results will also be wrong. Either way, you have to decide on what kind of seasonality your data has, and don't make any adjustments until the period is complete.

If anything, multi-armed bandit shines because it can adapt to trends you don't anticipate.

> Delayed response is a big problem when A/B testing the response to an email campaign.

Affects multi-armed bandit and fixed tests. If you include immature data in your p-test, your results will be wrong. Either way, you have to decide how long it takes to declare an individual success or failure.

> You don't get samples for free by counting visits instead of users

Affects multi-armed bandit and fixed tests. Focusing on relevant data increases the power of your experiment.

---

For every single problem, the author admits "A/B tests have the same problem", and then somehow concludes that multi-bandit tests are harder because of these design decisions, despite the fact they affect any experiment process.

yummyfajitas · on Jan 20, 2017

If anything, multi-armed bandit shines because it can adapt to trends you don't anticipate.

It can, but the time it takes is exp(# of samples already passed).

You can improve this by using a non-stationary Bayesian model (i.e. one that assumes conversion rates change over time) but this usually involves solving PDEs or something equally difficult.

For every single problem, the author admits "A/B tests have the same problem", and then somehow concludes that multi-bandit tests are harder because of these design decisions, despite the fact they affect any experiment process.

The point the author (me) is trying to make is not that bandits are fundamentally flawed. The point is that for A/B tests, all these problems have simple fixes: make sure to run the A/B test for long enough.

For bandits, the fixes are not nearly as simple. It usually involves non-simple math, or at the very least non-intuitive things (for instance not actually running a bandit until 1 week has passed).

At VWO we realized that most of our customers are not sophisticated enough to get all this stuff right, which is why we didn't switch to bandits.

paulddraper · on Jan 20, 2017

> The point is that for A/B tests, all these problems have simple fixes: make sure to run the A/B test for long enough.

Multi-bandit has the same fix: make sure the test has run for long enough before adjusting sampling proportions.

yummyfajitas · on Jan 20, 2017

So what I'm proposing to do is run A/B with a 50/50 split for a full week, then when B wins shift to 0/100 in favor of B.

You seem to be proposing to run A/B with a 50/50 split for a full week, then when B does a lot better shift to 10/90 in favor of B and maybe a few weeks later shift to 1/99.

What practical benefit do you see to this approach? From my perspective this just slows down the experimental process and keeps losing variations (and associated code complexity) around for a lot longer.

paulddraper · on Jan 20, 2017

First, Google Analytics (for example) runs content experiments for a minimum of two weeks regardless of results. It's hardly an unrealistic timeframe for reliable conclusions.

> What practical benefit do you see to this approach?

Statistically rigorous results, with minimal regret.

In your example, you reach the end of the week, and your 50/50 split has one-sided p=0.10, double the usual p<0.05 criteria. What do you do?

(a) Call it in favor of B, despite being uncertain about the outcome. (b) Keep running the test. This compromises the statistical rigor of your test. (c) Keep running the test, but use sequential hypothesis testing, e.g. http://elem.com/~btilly/ab-testing-multiple-looks/part1-rigo... This significantly increases the time to reach a conclusion, and costs you conversions in the meantime.

(a) and (b) are the most popular choices, despite them being statistically unjustifiable. http://www.evanmiller.org/how-not-to-run-an-ab-test.html

---

The essential difference when choosing the approach is that 50/50 split optimizes for shortest time to conclusion, and multi-bandit optimizes for fewest failures.

In web A/B testing, the latter is usually the most applicable, and for that, you cannot beat Thompson sampling on the average, no matter how diabolically clever your scheme. https://www.lucidchart.com/blog/2016/10/20/the-fatal-flaw-of...

There are times when the former is more important, e.g. marketing wants to know how to brand a product that is being released next month. These are the clinical-like experiments that frequentist approaches were formulated for.

yummyfajitas · on Jan 20, 2017

Statistically rigorous results, with minimal regret.

The results are only statistically rigorous provided your bandit obeys relatively strong assumptions.

As another example, suppose you ran a 2-week test. Suppose that from week 1 to week 2, both conversion rates changed, but the delta between them remained roughly the same. A 50/50 A/B split doesn't mind this, and in fact still returns the right answer. Bandits do mind.

I don't do p-values. I do Bayesian testing, same as you. I just recognize that in the real world, weaker assumptions are more robust to experimenter or model error, both of which are generally the dominant error mode.

In web A/B testing, the latter is usually the most applicable, and for that, you cannot beat Thompson sampling on the average, no matter how clever yur scheme.

This is simply not true. The Gittins Index beats Thompson sampling, subject again to the same strong assumptions.

Look, I know the theoretical advantages of bandits and I advocate their use under some limited circumstances. I just find the stronger assumptions they require (or alternately the much heavier math requirements) mean they aren't a great replacement for A/B tests which are much simpler and easier to get right.

ted_dunning · on Jan 20, 2017

Thompson sampling does not need to assume stability. You can inject time features into the model if you want to model seasonality (or, more accurately, ignorances of seasonality) and you can also have a hidden random-walk variable.

Yes, if you assume stability and things vary, you will not have good results. That seems like any statistics.

yummyfajitas · on Jan 20, 2017

I agree. I've even done similar things a few years back:

https://www.chrisstucchio.com/blog/2013/time_varying_convers...

However, with an A/B test you don't need to change the math or eliminate the stability assumption from it. You just need to choose a good test duration.

As I pointed out to pauldraper in the other thread, when you start fixing bandits by only changing the split every season, suddenly bandits start to look a lot like A/B testing.

ted_dunning · on Jan 20, 2017

Actually, Chris, I think you misunderstand my comment.

Thompson sampling (and Bayesian Bandits in general) can be applied with a model for estimating conversion that is more complex than P(conversion|A). It can include parameters for time of day, day of week and even be non-parametric.

If you do this, you the standard Thompson sampling framework with nochanges* whatsoever will still kill losers quickly (if the loss is big enough that seasonality cannot save it) and will also wait until it has enough data seasonally to make more refined decisions. This is very different from simply waiting for an even season point to make decisions.

You do need more data to understand a more complex world, but having an optimal inference engine to help is better than having a severely sub-optimal engine.

yummyfajitas · on Jan 21, 2017

I understand that. The blog post I linked to describes doing exactly that.

But the point I'm making is different. This is a lot of stuff to get right and most people aren't that sophisticated. Getting A/B tests right is a lot easier, mainly because they are significantly more robust to model error.

xapata · on Jan 20, 2017

VWO, you mean Vanguard FTSE Emerging Markets ETF?

yummyfajitas · on Jan 20, 2017

Visual website optimizer, vwo.com. We are an A/B testing vendor and analytics tool.

xapata · on Jan 20, 2017

Ah. Namespace collision.

developer2 · on Jan 19, 2017

This ignores what I have seen in my experience, which is that marketing teams - composed of the people who dictate what A/B tests the business should run - have little to no background in statistics, let alone any interest whatsoever in actually performing legitimate A/B tests.

It's often the case that the decision maker has already decided to move ahead with option A, but performs a minimal "fake" A/B test to put in their report as a way to justify their choice. I've seen A/B tests deployed at 10am in the morning, and taken down at 1pm with less than a dozen data points collected. The A/B test "owner" is happy to see that option A resulted in 7 conversions, with option B only having 5. Not statistically significant whatsoever, but hey let's waste developers' time and energy for two days implementing an A/B test in order to help someone else try to nab their quarterly marketing bonus.

paulddraper · on Jan 20, 2017

Join us, comrade, in the fight against the statistical blight!

Move your decision process to multi-armed bandit and you never have to decide when to end an A/B test -- math does it for you, in a provably optimal way.

RA_Fisher · on Jan 20, 2017

I'm not sure this solves it because you have to have a really strong sense of the loss function to pull it off. That's much easier to intuit and use to guide experiments than actually build into the bandit algo.

paulddraper · on Jan 20, 2017

> That's much easier to intuit and use to guide experiments than actually build into the bandit algo.

IDK about your intuition, but for most other people, it gets in the way of statistics.

The "loss function" is just as easy to calculate for A/B tests as for multi-armed bandit. The value of user doing A is $X, the value of B is $Y, and the value of C is $Z.

RA_Fisher · on Jan 20, 2017

You gave me some reading to do. :) Thank you.

jblow · on Jan 20, 2017

But it's DATA SCIENCE. You know it's SCIENCE because they called it SCIENCE.

martingoodson · on Jan 20, 2017

This is yet another comment claiming that Thompson sampling is the answer to all of our statistical problems!

Naive Thompson sampling (like the code you linked to) will result in problems equally disastrous to those I wrote about in the Qubit whitepaper. Other comments have highlighted a key problem with simple bandit algorithms - reward distributions which change over time will render their results worthless. You can model these dynamics but not in 'a few lines of very simple code'. It is verging on the irresponsible to suggest otherwise.

I personally favour a bayesian state-space model to elegantly take care of these things - but that's outside the remit of the whitepaper and outside the skill set of most non-statisticians. Frequentist testing, when done properly, is simple to implement and has statistical guarantees that are very attractive in practice.

paulddraper · on Jan 20, 2017

+1

I wrote a blog post focusing on the #2 problem mentioned: multiple testing. I simulated the typical mitigating approaches, and comparing them against Thompson sampling (code in GitHub). https://www.lucidchart.com/blog/2016/10/20/the-fatal-flaw-of...

I'm not a Bayesian fanatic, but given how perfectly A/B test optimization fits in the Bayesian approach, it's a shame it's not yet the de facto standard.

I think the primary reasons are (1) it's not as intuitive (especially for the uninitiated), (2) it's harder to implement an automated feedback mechanism, (3) FUD. E.g. https://www.chrisstucchio.com/blog/2015/dont_use_bandits.htm... lists devastating complications with correct multi-armed bandit tests, and then in fine print admits that traditional tests have all the same complications.

abecedarius · on Jan 20, 2017

Is it really less intuitive? I would've said less familiar. Null-hypothesis significance testing is unintuitive: nonexperts seem to explain it wrongly more often than not. (Like "the p-value is the probability the result was random chance".) Probably both approaches are unintuitive to humans unless they're explained really well.

paulddraper · on Jan 20, 2017

You're right. True understanding of null hypothesis and p-value isn't easier than Bayesian. Heck, just try to figure out if you need a one-sided or two-sided test.

But it's easier to try to black box it and say this number cruncher tells me if these two things are really different. Of course, if you don't understand what's going on, you start p-hacking.

Anyway, you're I think you're right.

ted_dunning · on Jan 20, 2017

Actually, telling you whether things are different isn't the goal.

The goal is to make practical business decisions with reasonably high likelihood of being right and, if wrong, having only limited impact.

The output of a well done Bayesian analysis can be very easy for business people to understand. Statements like (you have a 60% chance of making just the right decision and a 35% chance of making a decision that is within 3% of right) are easy for most business stake-holders to understand because that is close to how they frame their own decision making process.

Laboring to reject a null hypothesis is an unnatural act for most people.

abecedarius · on Jan 20, 2017

I'd guess the majority of p-hacking is not intentionally fraudulent, yeah. The first few intros to stats I tried to read had trouble sinking in -- it felt opaque and authoritarian somehow. Easy to treat it as a cookbook until I learned of the bayesian approach.

saycheese · on Jan 20, 2017

Given most people don't get A/B testing, it's a stretch to me to believe that some that does would know about more complex approaches that require more skill.

Firmly believe that there's way more to be gained by more people understanding how to use A/B testing than more complex solutions.

amasad · on Jan 20, 2017

This is a great read. But left me thinking I'm missing in the fundamentals. Can you recommend a book (or other posts) on statistics fundementals for programmers?

abecedarius · on Jan 20, 2017

If no one actually knowledgeable answers, I'll say I liked what I read of Doing Bayesian Data Analysis -- early partial draft here: https://faculty.washington.edu/jmiyamot/p548/kruschkejk%20ba...

I didn't persevere because the R language rubs me the wrong way and I'd be comfier with Python or Haskell or something. http://camdavidsonpilon.github.io/Probabilistic-Programming-... looks promising in that vein, but I haven't read it.

mattj · on Jan 19, 2017

I agree with you (and love your blog, btw), but I think you're skipping over at least a few benefits you can get out of a mature / well built a/b framework that are hard to build into a bandit approach. The biggest one I've found personally useful is days-in analysis; for example, quantifying the impact of a signup-time experiment on one-week retention. This doesn't really apply to learning ranking functions or other transactional (short-feedback loop) optimization.

That being said, building a "proper" a/b harness is really hard and will be a constant source of bugs / FUD around decision-making (don't believe me? try running an a/a experiment and see how many false positives you get). I've personally built a dead-simple bandit system when starting greenfield and would recommend the same to anyone else.

paulddraper · on Jan 20, 2017

Speaking of mature, well-built A/B test frameworks, Google Analytics uses multi-armed bandit.

https://support.google.com/analytics/answer/2844870?hl=en

mhoad · on Jan 20, 2017

Probably worth mentioning that the Google Content Experiments framework is in the process of being replaced with Google Optimize (currently in a private beta) which does NOT make use of multi-armed bandits much to my confusion and disappointment.

paulddraper · on Jan 20, 2017

Huh. So do you know if they do anything help with repeat testing/peeking?

Optimizely takes an interesting approach: they apply repeat testing methods, segmenting the tests by user views of the results. Like 30x more complicated than multi-bandit, but they don't need a feedback mechanism.

tedsanders · on Jan 20, 2017

Thank you Ted for bringing sanity to this conversation. Terrific point and post.

By the way, I doubt you remember me, but thank you for inviting me on a tour of Veoh ten years ago when I was a young college sophomore. I enjoyed the opportunity as well as our brief chat about Bayesianism.

ted_dunning · on Jan 20, 2017

My calendar agrees with you ... we apparently had lunch in December, 2007. Sadly, I don't remember it off-hand.

On the other hand, one reason is that I invited a lot of students to come visit (and a number to intern with us).