It seems that the main issue here is that with pre-registration, study authors have to pick a single measure of primary benefit at the outset, whereas before, they might have made that choice after getting results back. The original study is at PLoSONe, and it is not a difficult read[1]. From that source:
>>Prior to 2000, investigators had a greater opportunity to measure a range of variables and to select the most successful outcomes when reporting their results... Among the 25 preregistered trials published in 2000 or later, 12 reported significant, positive effects for cardiovascular-related variables other than the primary outcome.
That is, in most cases, there are large effects for some outcome, and if they get to choose the primary outcome after looking at some results, they could have been cherry-picking the outcome variables.
> It seems that the main issue here is that with pre-registration, study authors have to pick a single measure of primary benefit at the outset, whereas before, they might have made that choice after getting results back
Right. And then you'd have to use proper statistical reasoning for that state of affairs. Which nobody ever does, cause they're not statisticians and it's complicated and it would reduce the chance of 'statistical significance'.
So they just use a standard calculation of statistical significance -- which is based on the assumption that you have picked a single hypothesis in advance and then done your test. So it's completely invalid to use it how everyone typically does.
Imagine you flip a coin 50 times. Then you see, okay, did I ever get 10 heads in a row? Nope? Okay, how about 5 heads followed by 5 tails? Nope. Okay.... try a couple dozen other things, oh, look, I got exactly 3 tails followed by exactly 3 heads followed by exactly 3 tails again! Let's run my test of statistical significance to see if that was just chance, or is likely significant -- oh hey, it's significant, this is likely a magic coin not random at all!
Nope. If you test everything you can think of, _something_ will come up as 'statistically significant', but it's not really, those tests of statistical significance -- which calculate how likely it is the results you got happened by random chance happenstance vs an actual correlation likely to be repeatable -- are no longer valid if you go hunting for significance like that.
if they get to choose the primary outcome after looking at some results, they could have been cherry-picking the outcome variables
Not just could, would. Choosing your hypothesis after you run the experiment is (or at least should be) a cardinal sin in science for good reason. At the standard p-value cutoff of .05, even when there's absolutely no effect going on the probability of getting a spurious positive result when you do n comparisons is equal to 1 - (.95^n).
So that 5% chance of a type I error if you only look at one test statistic jumps to 40% if you look at ten, and to 72% if you look at 25.
I have a question. Would I be on firm statistical footing if I started looking for effects after the study, as long as I choose a p-value such that (1-((1-p)^n) < 0.05?
In other words, I run a study to determine if jellybeans cause acne[1]. The result is inconclusive (p < .05). Now--after the results are collected--I wish to check the correlation between color and acne. There are 20 colors. Would it be statistically sound to "correct" for my cherrypicking by setting p = 0.0025? That would result in
> Would I be on firm statistical footing if I started looking for effects after the study, as long as I choose a p-value such that (1-((1-p)^n) < 0.05?
No. For one thing, how do you actually know n? If in your analysis you can make c independent binary choices, n would be on the order of 2^c. For most reasonable sequence of choices, n would be impractically large. And if you think you know n, how would you convince everyone else that your value of n is reasonable/trustworthy?
The xkcd case (which is solved by applying the Bonferroni correction) is special because there's only one test run in 20 experiments, so the correction is more straightforward.
There is a good paper about this garden of forking paths by Gelman and Loken [1].
> At the standard p-value cutoff of .05, even when there's absolutely no effect going on the probability of getting a spurious positive result when you do n comparisons is equal to 1 - (.95^n).
That's only the probability of getting a positive result by chance; the total probability of getting a positive result even if there is no effect is much higher.
"Probability of getting a positive result by chance" and "probability of getting a positive result when there is no effect" are the same thing. This is what the p-value measures. (The full phrase is generally a combination of the two: "probability of getting a positive result by chance when there is no effect"; but that's so long that people shorten it.)
There are lots of reasons you could get a false positive result that have nothing to do with chance: e.g. non-representative sample, miscalibrated equipment, biased methodology, researcher fraud, flawed statistical analysis, etc.
Indeed, but the comment that you quoted was referring to the case where the methodology and p-values are sound, and the only issue is testing multiple hypotheses without correction.
Cardinals and sin are the realm of religion. Not sure why it's being brought up here.
Choosing a hypothesis after the experiment is run is perfectly valid as long as your experiment is valid for that hypothesis. Besides, you would always run a new experiment again anyway.
So perhaps it's worthwhile to trot out the idea of exploratory vs. confirmatory research.
In exploratory research you collect a bunch of data, and then mine it for interesting associations that might merit further study. It's an essential part of the scientific process, but anything you find from doing it needs to be treated as extremely tentative because it's liable to produce spurious results at least as often as it finds genuine effects.
But that's not what this paper's talking about. It's talking about experiments that are being used to support the approval of new treatments and drugs. That's confirmatory research. In that realm you absolutely must paint the target on the wall before you throw your darts.
Those are ways of looking at it, but experiments have definite structures and that structure can be exploited to create experiments that potentially reveal more information than another experiment.
If the experiment is constructed correctly, it can support validation of multiple hypotheses.
I would completely believe that most experimentation being performed currently is not structured to make further exploitation possible.
I should have clarified that statement as I don't think it's quite correct.
It should read:
I would completely believe that most experimentation being performed currently is not structured to make further exploitation, of the type desired/wished, possible.
There are almost certainly facts available that are not discovered/discussed from past experiments. Many of them are likely trivial and/or not what researchers would wish or hope that their data could tell them. However, they can still be mined.
In any case though, you would still re-run experiments to further validate/reject the hypotheses. That is simply basic science.
Yes, you can always choose any hypothesis you want. It's largely irrelevant.
Every experiment will support analysis through a set of hypotheses. Just because you didn't select all of those hypotheses before the experiment ran doesn't mean you can't select it after the experiment.
Imagine that an experiment has been run, but you do not know the results (or even what was done). Now you select a hypothesis, if the experiment required to validate the hypothesis is the same as what was run previously, you can now look at and use the results.
A hypothesis is like running a query against a database. Many queries are valid, even though the data may not have changed.
>Otherwise, why are you re-running it?
Science requires it. Doctrine from one-off experimentation is religion (hard to dump).
It's fine if you run a separate experiment to justify the hypothesis. If you choose the hypothesis after conducting the experiment, then the p-values you obtain for that hypothesis are invalid.
To be fair, Viagra was originally being developed as a high blood pressure medication. They decided to switch to erectile dysfunction when they found out why study participants were hoarding the pills.
Pre-registration would not hinder them doing that discovery, but they would have had to do another registration and study to get it to the market with the new use. They would likely still be able to use the outcome of the original study to assess the safety of the drug.
Which actually makes an important but subtle point: just because the rate of positive effects identified went down doesn't mean the effects that would otherwise have been identified were all false. It just means we don't know. There may be extremely strong evidence of effects that would overcome any amount of multiple testing correction, but they still aren't allowed to use it. It might take years or even decades for them to do a follow up study to validate the result, which means a significant number of people are harmed by not having access to the drug in the interim. Just playing devil's advocate to make the point that it's not a given that we're getting an better overall outcome by being this stringent.
> [...], which means a significant number of people are harmed by not having access to the drug in the interim.
Yes, that might happen. The more likely outcome though is that we are saved from a lot of drugs that don't work any better than chance (or even worse).
Fair enough... but the issue here is about findings of an effect versus no effect. It's about statistical significance. In order for single-comparison p-values and the like to be valid, there has to be a (one) single comparison. There is a way to do 'any-of-k' testing, but the required effect sizes get larger.
It's hard to explain briefly, and I don't know the details. However, I know that my own antidepressant has no significant difference versus a placebo effect with respect to depression. I'm not claiming that they don't work for people, just that they were rushed to market given current studies.
Reboxetine is one example. It's an antidepressant pushed by Pfizer. Initial studies showed it was effective versus placebo or SSRIs, but later research found serious publication bias -- Pfizer suppressed evidence suggesting it was not effective. Only after the German Institute for Quality and Efficiency in Health Care obtained unpublished data from Prizer was its lack of efficacy obvious:
I was hoping for some sort of accessible synthesis of the available literature, since I don't have the time or inclination to get eyeballs deep in this issue. But thank you for the links.
Everyone knows what the problem is. It is not that the drugs don't work (well most of them), it is that they only work for a small subset of people. It is possible these days to figure out which drugs work for which patients, but doing this makes the drug non-viable commercially. We are stuck with a system where the way a drug is tested is totally different to the way it will be used in practice - a drug has to work in a large percentage of people in the trial, while once improved it just becomes another drug through which people get cycled while their doctor tries to figure out which one works for the individual.
This isn't what "the problem" is. There are lots of problems, but one of the big ones this combats is systematic deliberate fraud. Companies were cherry-picking the trials that got results (and if you run enough trials, some will have results just through chance) in order to get drugs approved that were not any better than placebo.
If there is actual data to suggest who the drug works on best, and empirical tests to determine who that is, it can absolutely be tested within those limits and approved for use on that population.
The reason the companies are cherry picking the results is they know this basic problem exists. Only drugs that have evidence of working at stage I and II progress to stage III.
What happens is the drug companies decide at stage III what patients to let into the trial. What they are trying to avoid is anything that will limit the market size once approved. This means they try to skate as close as they can to the effectiveness limit and still get the drug approved.
For example, if they have a new cancer drug that they know only works on cancers with certain mutations, what they will try to do is to test the drug in a broad population of patients including those without the mutation. Since for the approval process it does not really matter how effective the drug is (as long as overall it is statically significant) and that you can only market to the population you tested in, you will want to shove in as many “filler patients" as you think the drug can support. Due to the difficulty of designing large trials they often get the filler number wrong. I don’t condone this activity, but this is the commercial reality.
More fundamentally the pharmaceutical industry is being crushed between two very powerful forces - the cost of developing new drugs continues to get more expensive every year (mostly thanks to regulations), while the real market for each drug shrinks as sub-populations are identified for each disease. Combine this with a refusal of society and insurance companies to pay ever increasing drug prices and the outlook for the industry is not great.
This is not a conspiracy theory. Who to include in a stage III trial is one of the most difficult problems facing pharma management. They aim to make these as broad as possible and lots of drugs have died because the stage III trials were made too open.
If you want to get a good idea of what really goes on in the pharam industry have a read of the “In The Pipeline” blog [1]. Most of the real value is in the comments.
"Everyone knows what the problem is. It is not that the drugs don't work (well most of them), it is that they only work for a small subset of people."
1. I honestly don't give drug companies that much integrity.
2. Sadly, I feel they systematically lied to the FDA in order to get their product to market? No, I don't think the average pharmaceutical research scientist wanted to poison people, but I have horrid feeling a lot of them went along with the sharade in order to make a living? I picture too many scientists thinking, "Well it doesn't work as we thought, but I'm really not hurting anyone because the Plecebo Effect was huge? So, in order to keep my job, I will not object to these results/statistics?". I have found that when it comes down to a person's livelihood, life--money; people don't raise their hands? They go into this weird, socially acceptable denial?("Hay it's legal!") That's why when even here on HN, I make sure to read the blurry statements.(I'm usually disappointed, but I want to hear the dissenters, and critics.)
3. I'll admit I threw the baby out with the bath water years ago, and I don't have hard evidence (in front of me collated, and ready to provide objective examples) to back up my intuition, and assertions, but I don't think one has to look too far for examples?
4. I grew up not questioning scientific integrity. I thought these institutions/people were the pillars of honesty!; what I have learned in the last decade, specifically about the Pharmacutical Industry, hurt me on so many levels--I literally feel like crying sometimes.
5. My understanding of statistics is fuzzy. To those that need a refresher, I found two books helpful: How to lie with Statistics, by Darrell Huff., and The Comic Book Guide to Statistics--don't know the author.
Yes, but as noted in the article some of that may be because of other factors that have changed in the meantime. I.e. if hospital treatment for heart disease has advanced and noticeably affects outcomes at that level, it could change the overall outcome of a study about heart disease.
I myself don't anticipate that accounting for much though.
Without intimate familiarity in the field it's really hard to figure out cause and effect in something so complicated.
It's been noted that it's become much harder to find new drugs. Conventional explanations are that we're pretty good at this by now, and all the low hanging fruit has been picked (there's also fields like antibiotics where few are trying for unrelated issues).
More rigor in studies could be a cause, and/or there may be fewer successes because its become that much harder, and more iffy attempts are being made.
If anyone would like (a lot) more information about why this is a significant change, Ben Goldacre's book Bad Pharma is both excellent and terrifying. I think much harder about taking medicine unless it's really required since reading it.
"Results for all cause mortality were similar. Prior to 2000, 24 trials reported all cause-mortality and 5 reported significant reductions in total mortality (25%), 18 were null (71%) and one (CAST) reported significant harm (Table 3). Following the year 2000, no study showed a significant benefit for total mortality."
So 24/30 pre-2000 trials they looked at reported all cause mortality. This is important because all cause mortality is the gold standard endpoint in a clinical trial. If you were cherry picking an outcome measure, this is the last one you would choose as it's the hardest one to get a positive result for.
This casts doubt on the paper's central thesis. How to explain the difference in pre and post registration all cause mortality? The possibilities are:
1. Trials performed after 2000 are genuinely less likely to be positive because of altered funding, lack of low hanging fruit etc. It is a well known problem that estimating the survival of the control group in a randomised trial based on historical data is usually an underestimate, which messes up the power calculation and the chance of a positive result. Beyond a certain point, it becomes infeasible to do a clinical trial to show a benefit over a very healthy control group, because you would need 100,000 patients.
2. After registration became mandatory, fraudulent investigators who make up their results stopped doing clinical trials.
The other point is that in most areas, there are only a few acceptable clinical trial endpoints. For example in cancer studies, there are really only 3: overall survival (time alive), event free survival (time until the cancer comes back or starts growing) and response rate (% of cases where the tumour shrinks by 20% or more). Cardiology is a bit different because there is less consensus about valid endpoints. Nevertheless, clinicians and regulatory bodies are pretty strict about which endpoints they consider meaningful.
So in my opinion the authors chose the subject area that would give them the most 'shocking' result (Cardiology, due to having more options for endpoints), glossed over the interpretation (why such a big difference in the gold standard endpoint that is supposed to be immune to manipulation?), and over-hyped the significance of the result.
Wow! In 25 trials 2 treatments were beneficial and 1 was detrimental. It would seem that the process of finding new treatments is horribly inefficient and perhaps just completely broken given that so few treatments were found to work. I mean weren't most of these treatments found to work on animals first? What's going on???
We humans are biologically quite different from a lot of the species we experiment on. To take a salient example, we've cured cancer (all sorts of cancers) in mice hundreds of times. It's unlikely you'll find a tumor in a mouse we can't fix. But the same things don't work in humans.
And although I feel like a weird contrarian saying this, it's entirely possible that by selecting for treatments that are effective in other species, we may be inadvertently missing out on things that might work in humans.
Keep in mind that the typical life span of a lab mouse is about 2 years. If your criterion for "curing cancer" in humans is based on the cancer not returning for just 2 years after treatment, then we've also "cured cancer" in many humans.
While I agree with you that we are quite different to mice, as far as I know we can’t cure cancer in mice. What we can cure is artificial cancers (i.e those injected into mice), but I know of no study where anyone has tried and succeed curing mouse cancer that replicates human cancer (i.e. let the mice age and develop cancer naturally then treat).
If anyone know of a natural cancer cure study in mice or rats that has been published please get in contact with me (or post here) as I have been looking for one for years.
> we've cured cancer (all sorts of cancers) in mice hundreds of times. It's unlikely you'll find a tumor in a mouse we can't fix.
Wait, what? Really?
If that's true, I can see only one explanation - with mice models, we can do far more aggressive research, and therefore we cover a bigger chunk of the parameter space. Some of the same research strategies may not be ethical with human models.
Well, there's a whole bunch of other stuff too. Most of the time with animal models, we induce cancer by a variety of means - examples include using breeds with certain cancer suppressing genes knocked out, heavy exposure to carcinogens, or direct grafting/injection. These are all mechanisms of getting cancer that are extremely atypical in the human beings we are trying to cure, which may (I think its super likely, especially in the case of knockout genes and grafting) influence cancer-treatment interaction.
Exactly. I know of no mouse study that actually tries to replicate what we want to do in humans.
In regards looking for a cure for cancer we are like the drunk who lost their keys in the dark alleyway, but who is searching around the street lamp since the light is better. Until we actually start doing the right studies we will really struggle getting anywhere.
I too am absolutely fascinated by this. Seems to be this behaviour that humans have to mine a particular area until it's completely exhausted - you can see this in capitalism, in art, in fossil fuel consumption/extraction, in mouse and animal models of drug trails, silicon computer chips, Internet businesses, search engines and everything else.
We never seem to realise while we are on the part of the curve that's accelerating (and things are great) that it's actually a bell curve. There are probably people writing things about this sort of stuff, how we can maybe start to overlap our understanding, maybe even measure what part of the curve we are on? Maybe get governments to invest money into new technologies just at the right points...
> Seems to be this behaviour that humans have to mine a particular area until it's completely exhausted [...] in fossil fuel consumption/extraction [...]
Not actually true. There are lots of examples of mines closing down because exploiting them further is not economical. Some of them restart operations when prices for the dug up commodity rise.
(A Google search for `mine reopening' easily finds a few examples in the news around the globe.)
One example that goes in the opposite direction is the difficulties researchers have had in coming up with an animal model for studying the risks of tobacco exposure:
http://www.ncbi.nlm.nih.gov/pubmed/17661225
Actually most of them worked in humans too since they are stage III trials and drugs only progress to stage III if they have shown some sign of effectiveness in stage I and II. Pharmaceutical companies just don’t spend 10's of million of dollars on stage III trials unless they have evidence that the drug works.
For those interested, I did a little meta-study of my own on the distribution of p-values. https://news.ycombinator.com/item?id=10077042 It would be pretty interesting to build a larger p-value database. It could be a first step to measure how bad non-registered statistical-based science really is and allow us to measure the benefits to registration.
>>Prior to 2000, investigators had a greater opportunity to measure a range of variables and to select the most successful outcomes when reporting their results... Among the 25 preregistered trials published in 2000 or later, 12 reported significant, positive effects for cardiovascular-related variables other than the primary outcome.
That is, in most cases, there are large effects for some outcome, and if they get to choose the primary outcome after looking at some results, they could have been cherry-picking the outcome variables.
[1] http://journals.plos.org/plosone/article?id=10.1371/journal....