Leakage and the reproducibility crisis in ML-based science

a-dub · on July 15, 2022

this is sort of one of the weird problems that shows up at the intersection between science in the public interest and a market driven system of production.

pure science that is publicly funded in the public interest would publish all raw data along with re-runnable processing pipelines that will literally reproduce the figures of interest.

but, the funding is often provided by governments with the aim of producing commercializable new technology that can make life better for society.

the problem is that if you do the science in the open, then it can be literally picked off by large incumbents before smaller inventors have a chance to try and spin up commercialization of their life's work.

so we have this system today where science is semi-closed in order to protect the inventors, but sometimes to the detriment of the science itself.

adminprof · on July 15, 2022

I think you're missing two fatal problems in this "publish all raw data and code" mindset. I don't think the desire of commercialization is high on the list of fatal problems preventing people from publishing data+software.

1) How do you handle research in domains where the data is about people, so that releasing it harms their privacy? Healthcare, web activity, finances. Sure you can try to anonymize it, anonymization is imperfect, and even fully anonymized data can be joined to other data sources to de-identify people; k-anonymity only works in a closed ecosystem. If we live in a world where search engine companies don't publish their research because of this constraint, that seems worse than the current system.

2) How does one define "re-runnable processing"? Software rots, dependencies disappear, operating systems become incompatible with software, permission models change. Does every researcher now need a docker expert to publish? Who verifies that something is re-runnable, and how are they paid for it?

a-dub · on July 15, 2022

> 1) How do you handle research in domains where the data is about people, so that releasing it harms their privacy?

that's an interesting problem that i have not thought about.

i think maybe that this is not a technical problem, but more an ethical one. under the open data approach, if you want to study humans you probably would need to get express informed consent that indicates that their data will be public and that it could be linked back to them.

> 2) How does one define "re-runnable processing"? Software rots, dependencies disappear, operating systems become incompatible with software, permission models change. Does every researcher now need a docker expert to publish? Who verifies that something is re-runnable, and how are they paid for it?

one defines it by building a specialized system for the purpose of reproducible research computing. i would envision this as a sort of distributed abstract virtual machine and source code packaging standard where the entire environment that was used to process the data is packaged and shipped with the paper. the success of this system would depend on the designers getting it right such that researchers _wouldn't_ have to worry about weird systems level kludges like docker. as it would behave as a hermetically sealed virtual machine (or cluster of virtual machines), there would be no concerns about bitrot unless one needed to make changes or build a new image based on an existing one.

the good news is that most data processing and simulation code is pretty well suited to this sort of paradigm. often it just does cpu/gpu computations and file i/o. internet connectivity or outside dependencies are pretty much out of scope.

i don't think it's hard... there just hasn't been the will or financial backing to build this out right and therefore it does not exist.

jmalicki · on July 15, 2022

> i think maybe that this is not a technical problem, but more an ethical one. under the open data approach, if you want to study humans you probably would need to get express informed consent that indicates that their data will be public and that it could be linked back to them.

As someone who wants science to advance, I want highly trusted researchers to be able to do studies that involve my private, personal data, that I would not consent to being public and linked back to me.

It is highly important to me that we allow these studies to not use open data.

A great example of this is the US college scorecard, which uses very private tax returns to measure how much college degrees and majors contribute to income (not the only value of college education, but certainly an important one):

https://collegescorecard.ed.gov/

Only high degrees of trust allowed this data to be published on extremely private information, and I think that makes for a better world. I am pro-open data, but research on non-open data should absolutely exist.

For instance, should any research about mental health for transgender people be abolished? Because anything on that subject is not going to be open, or at the least those who would be open to their data being public are a probably non-representative subset.

makeitdouble · on July 16, 2022

> get express informed consent that indicates that their data will be public and that it could be linked back to them.

10~20 years ago I could see it. Nowadays it’s a tough ask that would severly limit the number of people participating. This could also steer away most minority groups, which would make the research not only limited, but also misleading (we’d still draw conclusions from them, and decide policies accordingly, even as they come from grossly biased participant pools)

Aside from just the public aspect of having ones data in the open, there is also second/third order discoveries that would happen from there (e.g knowing someone’s cooking habits could be enough to deduce overall health status, potentially chronic illness, ethnicity/religion, relationship status etc.)

wdkrnls · on July 16, 2022

It does exist. It's called GNU Guix.

jmalicki · on July 16, 2022

> 2) How does one define "re-runnable processing"? Software rots, dependencies disappear, operating systems become incompatible with software, permission models change. Does every researcher now need a docker expert to publish? Who verifies that something is re-runnable, and how are they paid for it?

This is always a problem even with some of the most open scientific code.

Requiring that the code be published, and perhaps a peer-reviewer to run it just once with a bit of support to ensure that the submitters aren't completely bullshitting, before the paper gets approved to be published, might be a good start.

nicoco · on July 15, 2022

From my experience in the digital health sector, concerns for privacy is always the reason given for not sharing anything valuable and/or useful to others. But it's just a convenient way of hiding the 'desire of commercialisation'.

a-dub · on July 15, 2022

this is also true, and it also runs within science itself. if someone spends two years collecting some data that is very hard to collect and it has a few papers worth of insights within it, they're going to want to keep that data private until they can get those papers out themselves lest someone else come along, download their data and scoop them before they have a chance to see the fruits of their hard labor.

while it's not great for science at large, i don't blame them either.

jmalicki · on July 16, 2022

It's solvable if publishing the dataset counts as a paper, and citations of the dataset which should be required count as citations for e.g. tenure.

For example, ImageNet for machine learning is a very expensive and difficult data set to produce that has resulted in revolutionary advances in machine learning. And people build models on it, cite their results as evidence their models are good, and cite the paper.

nicoco · on July 16, 2022

This is an interesting idea. Although I am afraid that publishing a dataset, even a good one, will not be considered "real science" by our (broken) institutions.

nicoco · on July 16, 2022

You have a valid point here. It's probably utopian, but to me the only reasonable answer to this is to acknowledge that science is a collective process. Of course, this goes against the (stupid) idea that some extremely deserving geniuses are the ones that make science...

majormajor · on July 16, 2022

Even in areas where commercialization isn't super relevant I've known few academics particularly interested in doing all the tedious process crap to make it easy to publish their data and their calculations in a "reproducible pipeline" sort of way.

I think in large parts it's a tooling and awareness problem - like, historically awareness and use of source control has been very low in the space, but for data, even industry tools tend to be less sophisticated than the ones for code.

daniel-cussen · on July 16, 2022

That's how it has to be. This is a fairly open system in comparison to what America's founders were objecting against, this really closed guild system where you only find out how to actually make the contraption work for real when the master [of the craft, who teaches the other journeymen and apprentices] is eg getting death threats on a fluke happenstance and finally telling you...the tricks of the trade. On his deathbed. Other times valuable knowledge was just forgotten. And from that point on, they didn't make them like they used to.

That's how you get "limited protection" of the arts and sciences, and patents that were democratic ie I can file on my own for a provisional patent, it's really hard in cases like the Wright Brothers--and some others, and there's crackpots--but at least you didn't need to lobby to get one like in England, where it was an act of Parliament to get an inventor his patent, like James Watt. [Don't have visibility on how the two systems evolved when they broke apart.]

uoaei · on July 16, 2022

So what you're saying is, intellectual property rights and the profit potential of excluding certain actors from using certain techniques is the source of our woes.

And there are those who say progress proceeds optimally in this arrangement! Hardly: only "innovation", which is distinct from progress, is optimized. Progress is the act of cleaning up the mess made by ineffective innovations (ie "minimum publishable units"), distilling them into a basis for the next paradigm shift.

Progress, everyone's ostensible goal, is hampered by siloes and IP hoarding, as it multiplies the efforts made by many who should otherwise be working in tandem, by hiding them each in their own little silo so they all learn the same lessons completely separately.

a-dub · on July 15, 2022

...also, if a technique appears in a paper, an expert on that technique should be a reviewer and/or a standard rubric should be applied (i think nature and science have gotten much more rigorous about this in recent years in the wake of the psychology replication crisis).

antipaul · on July 15, 2022

Not a bad checklist (“model info sheet”)

But rather than stand-alone, it should be incorporated into publications.

In my experience, only a minority of applied machine learning papers provide even a minority of the info requested by the info sheet.

Meaning, you really have no proper idea how cross validation was done, what preprocessing was done etc. - in actually published papers

randomwalker · on July 15, 2022

OP here. I totally agree that ideally authors should report most of this information in the paper itself. One advantage of a standalone document (we suggest putting it in an appendix) is that it's easy for reviewers to check that all of this information has been reported. Of course, authors could answer some of the questions by pointing to the sections of the paper in which they have been answered.

throwoutway · on July 16, 2022

What is "data leakage"? Do the authors define it? They reference Kaufman et al, but that makes it sound just like "errors". But what are the errors?

Flashtoo · on July 16, 2022

When you evaluate an ML approach, you should use one part of the data to train your model and a completely separate part to evaluate it. Otherwise, your model can just memorize parts of the data (or overfit in some other way), resulting in artificially high performance. Data leakage is when there is a problem in this separation and you somehow use information about the evaluation dataset in the model training process. The table in the article lists various examples. The simplest would be to just not have a separate evaluation set. A more subtle one is if you normalize your input data based on both the training and evaluation sets; this way the normalization will be better suited to the evaluation set than it should be if you had no knowledge of it, resulting in artificially high performance.

YeGoblynQueenne · on July 17, 2022

Great, now could you please email the authors and explain to them how to explain "lekage" in their draft paper, so the rest of us can read it also?

YeGoblynQueenne · on July 17, 2022

I'm reading the draft paper linked from the article and "data leakage" is used multiple times without any attempt at defining it.

Oh well, I guess this is only meant for insiders who understand the in-group jargon. The rest of us need not be interested at all.

Edit: for context, here is the paragraph titled "Leakage" from the draft paper:

Leakage. Data leakage has long been recognized as a lead- ing cause of errors in ML applications (Nisbet et al., 2009). In formative work on leakage, Kaufman et al. (2012) provide an overview of different types of errors and give several rec- ommendations for mitigating these errors. Since this paper was published, the ML community has investigated leak- age in several engineering applications and modeling com- petitions (Fraser, 2016; Ghani et al., 2020; Becker, 2018; Brownlee, 2016; Collins-Thompson). However, leakage oc- curring in ML-based science has not been comprehensively investigated. As a result, mitigations for data leakage in scientific applications of ML remain understudied

https://arxiv.org/pdf/2207.07048.pdf

Yes, alright. But what the flying fuck is "leakage"? Am I supposed to go read the "formative work on leakage" cited? What if that work also leaves it undefined and points to an earlier source? What the hell is this paper about? Hhow hard it is to explain what your main subject is, so I can read your paper knowing what you're talking about?

How frustrating.

esalman · on July 19, 2022

Looks like you only read the first page of the draft.

There is a whole one and a half page section (2.4. Towards a solution: A taxonomy of data leakage) that describes every kind of leakage the authors considered for this work.

YeGoblynQueenne · on July 20, 2022

Yes of cours I only read the beginning of the draft paper because it was very frustrating to read and I had no particularly strong reason to keep on reading it.

This is what will happen to most people who read the draft paper. They will start with the abstract, feel confused, get annoyed and leave before they get to the "texonomy".

In fact I did glance at the "taxonomy" as I was scanning through the paper dejectedly in the last few seconds I bothered with it and it still didn't seem to explain what the different kinds of data leakage are different kinds of. That's basically when I stopped reading.

And this is why you don't write papers like that, because it makes them less likely to be read. And because, as an author in a field where people put tens of thousands of preprints on the internet, you want to make absolutely sure that your paper is as likely to be read as possible, by as many people as possible.

Basically, if you want people to read your papers, the absolute worst thing you can do is to expect your readers to "keep reading to see what we mean". Most people will give up and reviewers will be so annoyed you're not respecting their time that they'll skim your paper looking for reasons to skewer it.

The next worst thing is to expect your readers to try and guess what you mean. I could very easily make an educated guess at what "data leakage" means, but then I would have to read the paper while never being sure I know what exactly I'm reading and whether I have it subtly wrong. Again, you don't want to cause that kind of stress to your readers. You want everyone who reads your paper to feel happy and calm and admire your bright, squeaky clean ideas.

So make things as clear as possible, as hassle-free as possible, as friction-free as possible. Otherwise, you're kicking yourself in the butt.

That's my free advice.

AtNightWeCode · on July 15, 2022

There is even tech that claims to solve the train-test split “under the hood”. You also get surprised with the low amount of data points some of these ML people think is necessary. Far off from what you learn in basic statistics classes.

To not provide accurate ways of reproducing something claimed in a paper means that the paper is invalid.

photochemsyn · on July 16, 2022

I think the most significant scientific result in which machine learning is playing a major role is computational protein folding, and that at least doesn't seem to have these reproducibility problems:

https://alphafold.ebi.ac.uk/

It's a very well-defined problem and the datasets it uses are very well-characterized (the protein crystallography database, maybe some NMR structures as well), so perhaps that helps.

dekhn · on July 15, 2022

Recently, I saw that people were tagging their input records (test records in git repos) specifically so that later data loaders would reject those records in appropriate conditions. I forget what the tech was called but it was interesting.

telotortium · on July 17, 2022

Some people are just adding a certain well-known string to their data so that it will not be used: https://news.ycombinator.com/item?id=30927569.

vba616 · on July 18, 2022

How about instead of calling it "data leakage", refer to the "Clever Hans effect"?

mike_hearn · on July 16, 2022

A useful sounding of the alarm about the expertise crisis, but:

"we advocate for a standard where bugs and other errors in data analysis that change or challenge a paper's findings constitute irreproducibility."

Please no. One of the biggest problems I face when talking to people about bad science in any context is the belief that "peer reviewed and reproducible" is a synonym for correct. The phrase the authors are looking for here is not irreproducible, but rather something like: flawed, incorrect, biased, pseudo-scientific, or even intellectually fraudulent. The danger of trying to redefine the word reproducible to mean more than "do people get the same results" is threefold:

1. Researchers will reject claims their work is not reproducible by saying "actually they didn't do the same things we did so of course they didn't get the same results" and that will be a convincing rebuttal to outsiders who don't dig into the details.

2. It would further undermine trust in academic research. Way too frequently, I read a paper that makes an interesting claim, only to discover that their paper or maybe entire field has redefined common words in ways that make the claims misleading.

3. It doubles down on the unhelpful and probably counter-productive "reproducibility crisis" framing.

Why unhelpful, well, there isn't really a reproducibility crisis. What we have here is actually an intellectual fraud crisis. After reading a ton of papers from outside CS in past few years it became impossible to avoid the uncomfortable conclusion that in many fields the majority of observable errors are not really errors at all, but are actually deliberate. Or at least, they are deliberately fooling themselves which amounts to the same thing.

To highlight just a few examples of mistakes where you think, how can nobody have noticed this:

• (from the linked paper) "a recent study included the use of anti-hypertensive drugs as a feature for predicting hypertension. Such a feature could lead to leakage because the model would not have access to this information when predicting the health outcome for a new patient. Further, if the fact that a patient uses anti-hypertensive drugs is already known at prediction time, the prediction of hypertension becomes a trivial task"

• A widely used ML model from social science that claimed to predict if a Twitter account is a bot was put online and found to have an FP rate of 50%+. When this was pointed out by third parties, the response was to claim the testers were "academic trolls". Nothing was ever retracted and the model continued to be used in new papers across the field.

• A COVID modelling paper blithely computed that the average Brit lives with 7 people. This was obviously wrong both in absolute values and just being nonsensical design to begin with (that should be an input taken from census data not an output), and the peer reviewer even noticed this but approved the paper anyway.

• The Ferguson Report 9 model that directly led to lockdowns in the UK and other countries, was full of computational bugs like typos in their custom PRNG, out of bounds memory reads, bugs in a custom Fisher-Yates shuffle, thread safety errors and more. Nobody in the academic world appeared to care about this.

The paper authors suggest making researchers fill out more paperwork to get published. I find it impossible to believe that this will work. Mandatory signed data sharing statements didn't work: there was a study posted on HN in the past few months in which someone tested this to see if the data was genuinely made available on request and something like >90% of scientists refused (in epidemiology I think). Similar results were found in other fields. In this light the mass adoption of ever more opaque and buggy statistical/computational techniques is not merely an accidental drift, correctable with a minor bit of bureaucratic oversight. These techniques seem to be popular exactly because they grant so many angles of freedom to get away with scientific murder.

jokoon · on July 15, 2022

Machine learning isn't really science, since it's only statistical methods. It doesn't provide insight into what intelligence is. It's only techniques, so it's just engineering. It's brute force hacking at best, and when it sort of works, it's impossible to figure out why it does because it's black boxes all the way down.

So of course there are cool things like gpt, but it's not like it's scientific progress. It doesn't really to understand how brains work, and how to understand what general intelligence really is.

randomwalker · on July 15, 2022

It's possible you may have misunderstood the title of the post. It isn't about the science of ML, or GPT-3, or brains. Rather, it's about using ML as a tool to do actual science, like medicine or political science or chemistry or whatnot. The first sentence of the post explains this.

nestorD · on July 15, 2022

Machine learning is not about getting insight into what intelligence is (it might do so as a byproduct but very few people are using it with that goal in mind).

However, ML is useful to generalist science as long as you are be aware of its shortcomings and not just trying to replace something with ML without thinking about it.

To give you an example I worked on (to be published): I worked with some physicists that use an incredibly slow and expensive iterative solver to get information on particules. We introduced a machine learning algorithm that predicts the end result. It does not replace the solver (you could not trust its results, contrary to a physics based numerical algorithm) but, using its guess as a starting point for the iterative solver, you can make the overall solving process orders of magnitude faster.

YeBanKo · on July 15, 2022

> It does not replace the solver (you could not trust its results, contrary to a physics based numerical algorithm)

And I guess the outcome variable in the train set for the ML model was produced by the solver?

nestorD · on July 16, 2022

By an unmodified solver, yes (we did have a test/train split in case people are wondering after having read the above article).

notrealyme123 · on July 15, 2022

Statistics are the backbone of many natural sciences.

It is also valid to make scientific progress just inside of a field and not in the grand scheme of things.

deelowe · on July 15, 2022

I feel that the deterministic computing theologists are going to be in for a rude awakening over time. Computing need not be perfect to work and the thing about recent advancements in ML is that they scale Extremely well.