I want an A/B test framework that automatically optimizes the size of the groups...

withinboredom · on June 16, 2023

Sadly, there is an issue with the Novelty Effect[1]. If you push traffic to the current winner, it probably won't validate that its the actual winner. So you may trade more conversions now, for a higher churn than you can tolerate later.

For example, you run two campaigns:

1. Get my widgets, one year only 19.99!

2. Get my widgets, first year only 19.99!

The first one may win, but they all cancel at the second year because they thought it was only for one year. They all leave reviews complaining that you scammed them.

So, I would venture that this idea is a bad one, but sounds good on paper.

[1]: https://medium.com/geekculture/the-novelty-effect-an-importa...

PS. A/B tests don't just provide you with evidence that one solution might be better than the other, they also provide some protection in that a number of participants will get the status-quo.

travisjungroth · on June 16, 2023

> So, I would venture that this idea is a bad one, but sounds good on paper.

It's a great idea, it's just vulnerable to non-stationary effects (novelty effect, seasonality, etc). But it's actually no worse than fixed time horizon testing for your example if you run the test less than a year. You A/B test that copy for a month, push everyone to A, and you're still not going to realize it's actually worse.

withinboredom · on June 16, 2023

Yeah. If churn is part of the experiment, then even after you stop the a/b test for treatment, you may have to wait at least a year before you have the final results.

PheonixPharts · on June 16, 2023

As others have mentioned, you're referring to Thompson sampling and plenty of testing providers offer this (and if you have any DS on staff, they'll be more than happy to implement it).

My experience is that there's a good reason why this hasn't taken off: the returns for this degree of optimization are far lower than you think.

I once worked with a very eager, but junior DS who thought that we should build out a massive internal framework for doing this. He didn't quite understand the math behind it, so I build him a demo to understand the basics. What we realized in running the demo under various conditions is that the total return on adding this complexity to optimization was negligible at the scale we were operating at and required much more complexity than our current set up.

This pattern repeats in a lot of DS related optimization in my experience. The difference between a close guess and perfectly optimal is often surprisingly little. Many DS teams perform optimizations on business processes that yield a lower improvement in revenue than the salary of the DS that built it.

brookst · on June 16, 2023

Small nit: it’s a bad idea if NPV of future returns is less than the cost. If someone making $100k/yr can produce one $50k/yr optimization that NPV’s out to $260k, it’s worth it. I suspect you meant that, just a battle I have at work a lot with people who only look at single-year returns.

tomfutur · on June 16, 2023

Besides complexity, a price you pay with multi-armed bandits is that you learn less about the non-optimal options (because as your confidence grows that an option is not the best, you run fewer samples through it). It turns out the people running these experiments are often not satisfied to learn "A is better than B." They want to know "A is 7% better than B," but a MAB system will only run enough B samples to make the first statement.

jdwyah · on June 16, 2023

Get yourself a multi arm bandit and some Thompson sampling https://engineering.ezcater.com/multi-armed-bandit-experimen...

quadrature · on June 16, 2023

Curious about your use case.

Is the idea that you wa t to optimize the conversion and then you would remove the experiment code with the winning variant ?.

Or would you prefer to keep the code in and have it continuously optimize variants ?.

londons_explore · on June 16, 2023

I'd expect to be running tens of experiments at any one time. Some of those experiments might be variations in wording or colorschemes - others might be entirely different signup flows.

I'd let the experiment framework decide (ie. optimize) who gets shown what.

Over time, the maintenance burden of tens of experiments (and every possible user being in any combination of experiments) would exceed the benefits, so then I'd want to end some experiments, keeping just whatever variant performs best. And I'd be making new experiments with new ideas.

joseda-hg · on June 16, 2023

There might be a particular situation where B might be more effective than A, and therefore should be kept, if only for that specific situation There might be a cutoff point, where maintaining B would cost more than it's worth, but that's a parameter you will have to determine for each test

iLoveOncall · on June 16, 2023

That sounds like you don't want A/B testing at all.

londons_explore · on June 16, 2023

Indeed - I really want A/B testing combined with conversion optimization.

bobsmooth · on June 16, 2023

Why can't users just tell me what works!

hotstickyballs · on June 16, 2023

Exactly. That just sounds like a Bayesian update

NavinF · on June 16, 2023

It's https://en.wikipedia.org/wiki/Multi-armed_bandit