Is it really a failure mode? I think it should be fine to stop whenever you want...

daveguy · on Jan 19, 2017

Yes. It really is a failure mode. No, it is not fine to stop whenever you want during a test. It doesn't matter what statistical test you are using -- you need to gather enough data to have sufficient statistical power.

Also, the more tests you do the lower the p-value should be.

If you stop when you reach your p-value then you are misinterpreting the results if you claim the results mean anything.

tedsanders · on Jan 19, 2017

This seems terribly dogmatic to me. Take the example of clinical trials. Suppose you're testing a new cancer drug. You design an experiment to test the new drug, named B, versus an established chemotherapy treatment named A. You expect B's performance to be similar to A's performance in controlling the cancer, so to make sure your trial has high power, you plan to test the drugs on 2,000 patients (with each drug administered to 1,000).

Now consider the following two scenarios:

(1) After giving drug B to 100 patients, all 100 patients are dead. Do you continue the trial, giving the (apparently) deadly drug B to 900 more patients?

(2) After giving drug B to 100 patients, all 100 patients are totally cured (vs A curing 3 in 100). Do you continue the trial, withholding the (apparent) cure for cancer from 900 more patients?

In either case, since you have a strong effect, it seems to me there is logical justification to end the trial early.

Obviously the stakes are higher in clinical trials than website design, but in both cases, data acquisition has costs and intermediate results may inform changes to your experiment design.

I honestly cannot see how anyone could blankly assert that stopping a test is always wrong. There are certainly circumstances where you do want to stop early. You just have to make sure you aren't misinterpreting a statistic when you do so.

mattkrause · on Jan 20, 2017

You can arrange a trial (either a clinical one or an A/B test) so that it stops early. However, the analysis plan needs to take that into account: you absolutely cannot just peek at the p-values willy-nilly and use that information to make go/no-go decisions; doing this makes them uninterpretable.

One of the simplest ways to end a trial early is through curtailment--you stop the trial when additional data won't change its outcome. Imagine you have a big box of 100 items, which can either be Item A or Item B. You want to test whether the box contains an equal proportion of As and Bs, so you pull an item from the box, unwrap it, and record what is inside. Naively, one might think that it is necessary to unwrap all 100 items, but you can actually stop after you find 58 of the same type because the 58:42 split--and all more extreme imbalances--allows you to reject the 50:50 hypothesis.

Curtailment is "exact" and fairly easy to compute if you have a finite sample size, each of which contributes a bounded amount to your result. This would certainly happen in your extreme examples. There are also more complicated approaches that allow you to stop even earlier if either a) you're willing to sometimes make a different decision than if you ran to completion and/or b) you're willing to stop earlier on average, even if it means running for longer in some cases.

daveguy · on Jan 19, 2017

Obviously it depends on how strong the effect is. The point is you identify the effect and the power. If you get to 1600 people and you are seeing a > 10% effect then sure -- you can stop. As long as you have sufficient power. The point is you must know what the statistical power is, and know where your break points are. You absolutely can not stop just at seeing a 10% effect -- which could happen if you happen to get one in the first 10 samples. That is not dogmatic, it is good statistics.

Take your example. 100 patients, all dead or 100 patients all alive -- you have demonstrated an infinite effect (edit: ok no need to be hyperbolic, 99+% effect), and probably have covered statistical rigor. If your drug is that effective you are probably criminally liable a lot sooner than 100 dead patients. Unfortunately actual medical (and A/B) studies do not mimic make believe scenarios.

tedsanders · on Jan 19, 2017

Totally agree. Stopping at 10% and then claiming the effect size is 10% would be silly. But seeing a giant effect and stopping is totally cool in my book. The bigger the effect difference, the fewer samples you need to judge it. So I think it can be fine to peek and halt. Nothing forces us to use a static number of samples other than an old statistics formula.

daveguy · on Jan 20, 2017

The point is, you don't know what that number is unless you do the math. It's not a matter of "judging it". It is a matter of calculating it.

If you "peek and halt" without doing the math. You might as well have a random good result in the first 10 and say "look! Positive results!". You recognize that is ridiculous. So when is peeking and stopping not ridiculous?

A: when the statistical power is sufficient for the observed effect. In the examples -- 1600 or 6000 for a 10% or 5% effect, respectively. And much less for a 20% or 40% effect! -- but you don't know the number required unless you do the math.

ted_dunning · on Jan 20, 2017

Again, this limitation doesn't apply to good bandit approaches. You see big effects quickly and smaller effects more slowly and don't need to do any pre-computation about power at all.

You can even get an economic estimate of the expected amount of value you are leaving on the table by stopping at any point.

apathy · on Jan 19, 2017

"Wrong" isn't the word we're looking for here, I don't think. But your above example is bullshit -- nobody puts 1000 patients at risk in a phase I (safety) trial, and if the dose isn't reasonably well calibrated by the phase III study you're describing above, someone's going to jail. In Phase II we will often have stopping rules for exactly this reason, just in case the sampling was biased in the small Phase I sample.

Above there are a number of things to notice:

1) The phasing approximates Thompson sampling to a degree, in that large late-phase trials MUST follow smaller early phase trials. Nobody is going to waste patients on SuperMab (look it up).

2) The endpoints are hard, fast, and pre-specified:

IFF we have N adverse events in M patients, we shut down the trial for toxicity.

IFF we have X or more complete responses in Y patients, we shut down the trial because it would be unethical to deprive the control arm.

IFF we have Z or fewer responses in the treatment arm, given our ultimate accrual goal (total sample size), it will be impossible to conclude (using the test we have selected and preregistered) that the new drug isn't WORSE than the standard, so we'll shut it down for futility. Those patients will be better served by another trial.

You are massively oversimplifying a well-understood problem. Decision theory is a thing, and it's been a thing for 100 years. Instead of lighting your strawman on fire, how about reframing it?

Stopping isn't "always" wrong, but stopping because you've managed to hit some extremal value is pretty much always biased. The "Winner's curse", regression to the mean, all of these things happen because people forget about sampling variability. It's also why point estimates (even test statistics) rather than posterior distributions are misleading. If you're going to stop at an uncertain time or for unspecified reasons, you need to include the "slop" in your estimates.

"We estimate that the new page is 2x (95% CI, 1.0001x-10x) more likely to result in a conversion"... hey, you stopped early and at least you're being honest about it... but if we leave out the uncertainty then it's just misleading.

All of the above is taken into account when designing trials because not only do we not like killing people, we don't like going to jail for stupid avoidable mistakes.

Angostura · on Jan 20, 2017

> But your above example is bullshit — nobody puts 1000 patients at risk in a phase I (safety) trial

It isn't bullshit - at least, not for that reason. It was a thought experiment, using an extreme example to test your assertions.

michaelmrose · on Jan 20, 2017

It is an example that tends to bring in lots of irrelevant detail and ethics that distracts from the actual point

tedsanders · on Jan 19, 2017

My point is that you can still extract useful information when your stop is dynamic rather than static. One typical scenario is when your effect size ends up being larger than you originally guessed. There's little reason to continue if the difference becomes obvious.

In the future, I would appreciate it if you steelmanned my comments or asked for clarification instead of insulting me. It hurt my feelings. I wish I had written a better comment that hadn't incited such a reaction from you. Best wishes.

apathy · on Jan 20, 2017

You are right, I shot from the hip. Sorry about that.

I also have noprocrast set in my profile so I couldn't go back and edit it (something I thought about doing). I probably would have toned it down if I hadn't requested that Hacker News kick me off after 15 minutes.

Your line of discussion is productive. It's just important that people understand the difference between degrees of belief and degrees of evidence from a specific study and never confuse the two. Trouble is, lots of folks confuse them, and lots of other folks prey on that confusion.

Anyways, sorry for being a jerk.

tedsanders · on Jan 20, 2017

No worries and thanks for the apology. I apologize for my own comments in this thread, which were lower than the quality I aspire to. I had pulled an all-nighter for work and was sitting grumpily with my phone at an airport.

ted_dunning · on Jan 20, 2017

Your intuition about effects bigger than you expected is just on target.

But it applies at all scales of effect. Stop when you have a big enough effect or have high confidence that you won't care.

ted_dunning · on Jan 20, 2017

P-values are just the wrong metric here. With good sequential testing, you would stop

a) when the challenger strategy is much worse than the best alternative with reasonable probability,

or

b) when it is clear that some strategy is very close to the best with high probability (remember, you may still not know how good the best is)

Note that waiting for significance kind of handles the first case, but in the case of ties, it completely fails on the second case. That is, when there are nearly tied options, you will wait nearly forever. It is better to pick one of the tied options and say that it is very likely as good as the best.

cSoze · on Jan 19, 2017

I mean.

(1) would absolutely not be continued, obviously.

(2) would be continued to completion.

In science one is maximally conservative with positive results. Negative results can be accepted rather quickly but positive results need to be maximally investigated. False positives are unacceptable, false negatives are often less damaging.

I'd question how scenario #1 made it through animal trials preceding experiments in humans.

mattkrause · on Jan 20, 2017

Actually, trials are sometimes stopped early for "efficacy." This is done on the grounds that--past some point--incremental gains in statistical certainty are not worth depriving the subjects in the control arms of the drugs' benefits. For example, this happened in the PARADIGM-HF trial in 2014: http://www.forbes.com/sites/larryhusten/2014/03/31/novartis-...

This isn't totally uncontroversial (e.g., https://www.ncbi.nlm.nih.gov/pubmed/18226746 ) but it's not obvious to me how to balance statistical and ethical concerns.

As for the animal studies, it sometimes happens. Someone up-thread mentioned SuperMAB, a potential drug which seemed safe in monkeys. However, when first tested in humans (using a dose 500x lower than the one administered to the monkeys), it generated a cytokine storm that put all six test subjects in the ICU. Here's a fairly decent summary: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2964774/

RA_Fisher · on Jan 20, 2017

Using Bayesian methods and some relatively rarely used Frequentist methods you can achieve what you suggest. Maybe you'll take it from a statistician that you're being grossly naive about experimentation.

imh · on Jan 20, 2017

You are advocating for something called sequential testing. I'm not going to go into the details, but you don't use the same metrics with sequential testing if you want to control your final error rates. That is, you can't look at the p-value on your t-test and say that's your false positive rate for the sequential test. It's important for reasons you describe, but not exactly idiot proof.

GavinMcG · on Jan 20, 2017

Math does tend to be a bit dogmatic, yes.

imh · on Jan 20, 2017

If you decide to stop when a test says the error rate (FPR) is 5%, your error rate will be higher than 5%. If you don't want to call that a failure mode, it's at least a misuse of the metric.

sogen · on Jan 20, 2017

Yes, stopping at 95% percent is BAD. First of all you need to reach on a sample size large enough, otherwise you are just lying to yourself.

Source: I have a degree in analytics.