I want an A/B test framework that automatically optimizes the size of the groups to maximize revenue.
At first, it would pick say a 50/50 split. Then as data rolls in that shows group A is more likely to convert, shift more users over to group A. Keep a few users on B to keep gathering data. Eventually, when enough data has come in, it might turn out that flow A doesn't work at all for users in France - so the ideal would be for most users in France to end up in group B, whereas the rest of the world is in group A.
I want the framework to do all this behind the scenes - and preferably with statistical rigorousness. And then to tell me which groups have diminished to near zero (allowing me to remove the associated code).
Sadly, there is an issue with the Novelty Effect[1]. If you push traffic to the current winner, it probably won't validate that its the actual winner. So you may trade more conversions now, for a higher churn than you can tolerate later.
For example, you run two campaigns:
1. Get my widgets, one year only 19.99!
2. Get my widgets, first year only 19.99!
The first one may win, but they all cancel at the second year because they thought it was only for one year. They all leave reviews complaining that you scammed them.
So, I would venture that this idea is a bad one, but sounds good on paper.
PS. A/B tests don't just provide you with evidence that one solution might be better than the other, they also provide some protection in that a number of participants will get the status-quo.
> So, I would venture that this idea is a bad one, but sounds good on paper.
It's a great idea, it's just vulnerable to non-stationary effects (novelty effect, seasonality, etc). But it's actually no worse than fixed time horizon testing for your example if you run the test less than a year. You A/B test that copy for a month, push everyone to A, and you're still not going to realize it's actually worse.
Yeah. If churn is part of the experiment, then even after you stop the a/b test for treatment, you may have to wait at least a year before you have the final results.
As others have mentioned, you're referring to Thompson sampling and plenty of testing providers offer this (and if you have any DS on staff, they'll be more than happy to implement it).
My experience is that there's a good reason why this hasn't taken off: the returns for this degree of optimization are far lower than you think.
I once worked with a very eager, but junior DS who thought that we should build out a massive internal framework for doing this. He didn't quite understand the math behind it, so I build him a demo to understand the basics. What we realized in running the demo under various conditions is that the total return on adding this complexity to optimization was negligible at the scale we were operating at and required much more complexity than our current set up.
This pattern repeats in a lot of DS related optimization in my experience. The difference between a close guess and perfectly optimal is often surprisingly little. Many DS teams perform optimizations on business processes that yield a lower improvement in revenue than the salary of the DS that built it.
Small nit: it’s a bad idea if NPV of future returns is less than the cost. If someone making $100k/yr can produce one $50k/yr optimization that NPV’s out to $260k, it’s worth it. I suspect you meant that, just a battle I have at work a lot with people who only look at single-year returns.
Besides complexity, a price you pay with multi-armed bandits is that you learn less about the non-optimal options (because as your confidence grows that an option is not the best, you run fewer samples through it). It turns out the people running these experiments are often not satisfied to learn "A is better than B." They want to know "A is 7% better than B," but a MAB system will only run enough B samples to make the first statement.
I'd expect to be running tens of experiments at any one time. Some of those experiments might be variations in wording or colorschemes - others might be entirely different signup flows.
I'd let the experiment framework decide (ie. optimize) who gets shown what.
Over time, the maintenance burden of tens of experiments (and every possible user being in any combination of experiments) would exceed the benefits, so then I'd want to end some experiments, keeping just whatever variant performs best. And I'd be making new experiments with new ideas.
There might be a particular situation where B might be more effective than A, and therefore should be kept, if only for that specific situation
There might be a cutoff point, where maintaining B would cost more than it's worth, but that's a parameter you will have to determine for each test
At first, it would pick say a 50/50 split. Then as data rolls in that shows group A is more likely to convert, shift more users over to group A. Keep a few users on B to keep gathering data. Eventually, when enough data has come in, it might turn out that flow A doesn't work at all for users in France - so the ideal would be for most users in France to end up in group B, whereas the rest of the world is in group A.
I want the framework to do all this behind the scenes - and preferably with statistical rigorousness. And then to tell me which groups have diminished to near zero (allowing me to remove the associated code).