Checking and stopping is only a problem if you use the inappropriate formula. It...

contravariant · on Jan 19, 2017

You can probably come up with a proper formula, but you do actually need to put some effort into it to get it right.

Of course there's the generic Bonferroni correction where you just divide the significance threshold by the number of tests, but you'll end up with a lot more false negatives (where you just miss the result entirely), and if you want to keep running the test for arbitrary lengths of time it becomes even more difficult (you'll need to keep lowering the threshold). Then again this does give strong guarantees and might work well enough if you don't check the results often.

Basically to get the best results you'll need to balance the false negative rate against the false positive rate and the expected length of the test, which is a rather complicated trade off. But I expect someone will have done at least some of the calculations for simple A/B testing.

apathy · on Jan 19, 2017

It is a problem. You are "spending" some of your error "budget" whenever you peek prior to an endpoint. This is why if you're going to do interim analyses in clinical trials, you have to include it in the design.

People got tired of trialists playing games with statistics and patients dying or ineffective drugs making it to market. Now if you want to run a trial as evidence for approval, you need to specify an endpoint, how it will be tested, when, with what alpha (false positive) threshold, and what's the minimum effect size required for this.

If you are doing interim analysis, futility, or non-inferiority, you have to write that into the design, too.

People can jerk around with subgroup analyses in publications but the FDA won't accept that sort of horse shit for actual approval. And thank heavens for that.

yummyfajitas · on Jan 20, 2017

Just to expand on this, David Robinson quantified this a while back: http://varianceexplained.org/r/bayesian-ab-testing/

He ran some simulated A/B tests using my Bayesian A/B testing technique (which is now powering VWO). He showed that while peeking does ensure that loss < specified threshold, if you don't peek the loss will be even lower.

So although peeking is still valid, that validity does come at a cost.

tedsanders · on Jan 19, 2017

In a world with rational actors and free computation, there shouldn't ever be a penalty for having more information about reality. Therefore, the only reason not to peek is that actors are irrational and/or computation is expensive.

Honestly, if the first 100 patients die into a 1,000-patient clinical trial, I have zero qualms about making the judgment to stop early, even if it wasn't written into the design. I'm not going to kill 900 people by religiously following bad statistical principles.

I think we should be open-minded and understand that sometimes peeking is ok and sometimes it isn't.

When the effect is large, you can end earlier. There's no reason to cling to a formula and procedure that requires a fixed number of samples when other methods exist that lack that drawback.

ncallaway · on Jan 20, 2017

These comments generally read to me along the lines of:

"It's totally reasonable to roll your own cryptography, as long as you're an expert in the field and do it correctly".

The rules of "Don't Stop And Peek" is general advice that is given out because the vast majority of people that conduct these kinds of trials are not statisticians, and will not do the math. Those that attempt the math will likely do the math wrong.

So, you're totally correct that if you know exactly what you're doing as a statistician and apply the math correctly, it can be possible to peek at results while still drawing a conclusion. This ignores the fact that, for most humans who are receiving the advice, this is terrible advice.

So, general rule: "Don't Stop and Peek". Advanced rule: "Unless you are a trained statistician, Don't Stop and Peek".

contravariant · on Jan 20, 2017

There's no problem with aborting a test early, for whatever reason. However that doesn't mean you can still draw conclusions from such a test. If you plan to do a trial with 1,000 patients and you stop midway because you've reached statistical significance you run a big risk of claiming a treatment works when it doesn't.

Similarly, every test you do has a small probability to give a false positive, the more test you do the bigger the total chance that you'll be jumping to conclusions.

Also, the size of the effect is irrelevant since that should already be accounted for by whatever test you do.

tedsanders · on Jan 20, 2017

Any Bayesian analysis should still be valid, despite stopping conditions. You can still draw conclusions from an aborted test. You just have to use valid formulas, and not formulas that assume incorrect counterfactual scenarios. I think it's dogmatic to say stopping is bad and you can't do analysis. In my mind, you totally can. It just needs to be the appropriate analysis.

Anyway, I don't even think false positives are the right way to think about this. The framing should be continuous, not binary. The goal is to maximize success, not maximize the odds of picking the better of A or B.

mattkrause · on Jan 20, 2017

Your last sentence is telling.

A Bayesian analysis tells you Bayesian things: specifically, this is the most reasonable conclusion one can draw from this data, right now. A frequentist analysis also tells you things, but they are different things. Specifically, frequentists are concerned with...frequencies: if we were to run this procedure many times, here's what we can say about the possible outcomes.

There's this persistent meme that all Bayesian methods protect you from multiple comparisons/multiple looks problems. That's not really true. Bayesian methods don't offer any Type I/Type II error control--and why would they, when the notion of these errors doesn't make much sense in a Bayesian framework?

You can certainly use Bayesian methods to estimate a parameter's value. However this you cannot repeatedly test your estimates--Bayesian or otherwise--in a NHST-like framework and expect things to work out correctly.