I worked in a cancer lab as an undergrad. New PI had just come over to the school. He had us working on a protocol he'd developed in his last lab to culture a specific type of glandular tissue in a specific way.
I and two other students spent months trying to recreate his results, at his behest. When we couldn't get it to work, he'd shrug and say some shit like "keep plugging away at the variables," or "you didn't pipette it right." I don't even know what the fuck the second one means.
Then an experienced grad student joined the lab. He spent, I don't know, a week at it? and was successfully reproducing the culture.
I still don't know how he did it. I just know that he wasn't carrying a magic wand, and the PI certainly wasn't perpetrating a fraud against himself. It was just, I guess, experience and skill.
A novice was trying to fix a broken Lisp machine by turning the power off and on.
Knight, seeing what the student was doing, spoke sternly: "You cannot fix a machine by just power-cycling it with no understanding of what is going wrong."
I have a similar story involving myself and my sister. She was power cycling our family computer over and over, becoming more and more frustrated as it went into 'safe mode' equivalent. I allowed it to fully power on and then rebooted it, and it was 'fixed'.
When this situation occurs, I always pause to pet the computer and say something nice to it, as to make the other person believe the server is listening.
Lab technique is definitely a thing. Mostly it's about knowing what not to pay attention to so that you can give extra attention to the things that matter. It's also about recognizing the intermediate signals that say "Yep, still good." (ie. Are the bubbles the right size? Is the foam the right color? Does it smell right/wrong? Did a layer of glop form at the correct step?)
However, I rarely see anyone taking good enough notes to reproduce an experiment. I had a Physics Lab taught by a professor who would randomly take someone's notebook in an assignment and reproduce the experiment using your notes--and fail you on that assignment if he couldn't.
Very few students ever passed that. You really have to be excessively meticulous. However, when you are debugging a real experiment without a known correct answer, if you aren't that meticulous you will never find your own bugs.
I managed to establish that flask headspace oxygen concentration played a critical role in synthesizing silver nanocubes after spending a year trying to reproduce a paper which didn't mention it at all.
Emailing the original papers authors I managed to extract the comment "we place the lid lightly on top of the vial" (which only stood out in a photo as something which didn't make sense) and then filmed a bunch of my vials and noted that whether it worked depended on how the water evaporating/condensing sealed against the loose fitting lid.
(This was after a lot effort spent drying ethylene glycol too)
Turns out, you can get a perfect result everytime if you drop the reaction vial headspace oxygen to 17-19%.
People just don't monitor what they should (or take a 1/10 success rate since for most nanoparticulate research one vial goes a long way).
I'm still really proud of that but damn does it explain replication problems to me.
I have a textbook of cookery [1] written for students intending to become professional cooks, it emphasizes precision measurements and repeatable processes. Where it expects adjustments it explains what it means and what constitutes "it looks right". It stands head and shoulders above pretty much all other cookery books I have read. But of course it lacks glossy pictures and is not backed up by a video of Nigella Lawson in a low cut dress (not that I'm objecting to Nigella and her cleavage)
[1] Professional Cookery: The Process Approach by Daniel R. Stevenson
Is it an actual "reproduction" if the specific experimental steps cannot be reproduced? I think what people do not understand is that there is a difference between "result" and "process". The existence of one is not a validation of the other. As programmers, we know this: there are many ways of implementing the same algorithm.
So, it really depends on what is being "reproduced". Is it the outcome? Or is it the process? If it is just the outcome, then the original authors could merely store an example of it as the only proof necessary. But, my intuition says that it is much more often the process that is needing to be reproduced/proven.
Therefore, "lab technique" should have nothing to do with a proof of process. Either it can or it cannot be done as it has been defined.
> Either it can or it cannot be done as it has been defined.
You can't define everything precisely--and inaccuracy may not lead to failure--just suboptimal results.
My favorite example of this was a semiconductor fabrication lab in college. There were 4 or 5 masks which have the step of "Align masks on lithography machine using alignment marks on silicon and expose photresist".
Okaaaaay. No big deal, you're putting lines and crosses in the center of other lines and crosses visually through a stereo microscope. Humans do that well.
Erm, that lithography machine is possibly older than the professor and likely hasn't been maintained in about the same timeframe.
So, the vernier screws have an enormous amount of backlash. If you've used shitty guitar tuners, you know how to deal with this. If you haven't, you spend a lot of time being frustrated figuring out what backlash is and how to deal with it. Okay, that's lab technique.
However, if even if you do know about backlash, this wastes time on a piece of equipment which is already time constrained. If you actually know something about semiconductors and think a bit about the process, you realize that Mask 2 is the crucial alignment because it defines your critical dimensions and screwing it up will haunt you while you can be a little more cavalier about the other masks. So, you spend way more time fighting the machine on Mask 2 to get it right (because it's important) and you spend less time on the rest. That's also lab technique.
Finally, if you're paranoid, you test the measurement equipment and calibrate it every single time you enter the lab. You note the serial numbers of the broken ones and only use the ones that reliably give correct measurements.
Your prize for this level of lab technique is that your transistor gives gorgeous graphs and works exactly like the lab claims it should while everybody else gets mushy results, at best.
I had the same kind of experience making holograms in the physics lab. My transmission hologram was a blurry, faded mess. I suspected old chemicals, but to be honest we had no idea why. My lab partner and I tried about three times, with three total failures.
The reflection hologram turned out to be the best one the lab supervisor had seen in a decade. I have no idea why it worked at all, let alone why it turned out so good!
Both processes are simple conceptually, but incredibly finicky in practice. The interference fringes can be ruined by wind outside the lab shaking the walls, which then shakes the optical table through the floor. You can't feel it, but the hologram is ruined. The laser could have a short coherence length, and you wouldn't know it. If it warms up too fast, it'll shift the fringes. Don't cough. Don't bump the table. Don't bang the door. So on, and so forth.
It turns out for most complex analytical procedures there are a million more ways to do something wrong than right. That's why as much as I like the idea of publishing failures along with successes, it just doesn't make sense. However, as you mention, that doesn't mean that something is irreproducible... get the right grad student to show you how (hopefully they can actually comment on your procedure and help, but pharma/bio-chem is notoriously touchy). Like cooking, good experimental method and helpful documentation (recipes) take both temperament and training.
"Electron Band Structure In Germanium, My Ass", is a great example of this... I'm pretty sure they had a bad contact (using indium solder?) when they ran the experiment. When you're getting garbage like that... stop and see if you can reproduce a single result, and until you can, something about your method/equipment/model sucks.
> It turns out for most complex analytical procedures there are a million more ways to do something wrong than right. That's why as much as I like the idea of publishing failures along with successes, it just doesn't make sense. However, as you mention, that doesn't mean that something is irreproducible... get the right grad student to show you how (hopefully they can actually comment on your procedure and help, but pharma/bio-chem is notoriously touchy). Like cooking, good experimental method and helpful documentation (recipes) take both temperament and training.
This sounds like a complicated way of saying the procedures for some scientific research are underspecified.
Well, yes... the problem is that if you're not reasonably proficient and observant then entire books worth of specification will be insufficient. Intel deals with this in their fabs by Copy Exactly! Not just the equipment and recipes, but the floor wax, the paint, the roof, soda machines, cafeteria... everything but the people who don't go inside except to fix stuff.
So, if you're in software enjoy the fact that you have the wonder of a compiler, an OS, and a CPU you can (mostly) trust. The rest of the physical world dreams of that level of control and repeatability for anything like the tiny incremental cost. Other than math and theoretical physics that's as close as you'll get to fully specified.
That's why I've immensely enjoyed the content on the NileRed YouTube channel. I don't even study chemistry, I mostly have a physics background!
The way the host shows mistakes, troubleshooting processes, recovery attempts, and even the outright failures is so refreshing after endless canned presentations focused on the final success, ignoring years of trial and error.
Often, it's the trial and error part that's actually important, the end result not so much. The journey, not the destination, if you know what I mean.
PS: I actually did the semiconductor conductivity experiment at University, and it was a blast! It was supposed to be a 12-hour experiment, most of which was collecting the data by hand and then analysing the curve. Fitting two nonlinear functions using fancy graph paper sounded awful dull. Two hours of playing with liquid nitrogen followed by ten hours of paperwork? No thanks! Like the guy complaining in that blog, I swiped the (at the time) fancy digital oscilloscope from the senior lab, output the data to a floppy, and did a nonlinear curve fit in Mathematica. Because it could fit both regimes at once, I got about one digit more accuracy than the reference text provided. The professor gave me 6 out of 5 for my score, and I finished the 12 hour lab in 4 hours. It was my first real life lesson in the power of computers to solve physical problems.
I had a protocol fail repeatedly, and expensively, because of a combination of a contaminated reagent and a tiny amount of sample. Took me ages to even approach being able to debug that one as the protocol worked for everyone else. Turned out everyone else using the same materials had so much input sample that the contamination didn't matter, or used the non-contaminated part of the output.
Almost to the day I worked out the issue, I saw an immensely more experienced lab person running an almost identical experiment to my setup. They always added another reagent "in case there's contamination".
Similar thing sometimes happens in deep learning field. Just a couple of weeks ago I helped a coworker reproduce a result from a paper. He had been debugging it for several days. Took me about 5 minutes to make a correct guess that his batch size is too small for batchnorm to be effective. He trained the model on 8 GPUs, using a decent batch size, but he didn't realize that the default batchnorm op in Pytorch isn't synchronizing batch statistics across GPUs, so the effective batch size was in fact 8x smaller. How was I able to see it so quickly? Experience.
I think this type of non-rational intuition shows a combination of intelligence and experience. “It was obvious” gets said by a smart person for things that are opaque to someone that lacks a correctly functioning intuition.
I'd say it's a product of unconscious competence. "I can't articulate why, but this bit here is the problem." I do maintain, though, that if you can't explain your reasoning for something then you don't understand it. Doesn't mean you're wrong, though!
But, this is only one kind of reproduction ineptitude. The process description/definition was not explicit enough. Either a process can or it cannot be reproduced as specified. There is no middle ground there. This is literally what a proof is. A sequence of reproducible steps.
You made an important point in considering that other causal factors can be present besides fraud in science's reproduction problems. But, it appears like this was made to suggest that fraud is not as serious of a problem as it is in the reproduction crisis.
If more emphasis is placed in the scientific community on the necessity for proofs, then fraud would stand out like a sore thumb. Until then, it can easily be masked within the lax atmosphere of (a lack of) explicit reproduction specifications.
Typically you don't take a single result with any face value, you hope to see validation in the literature from another source or you set up your own assay and see for yourself. If you were to ask my advisor whether the sky is blue, he would respond that he doesn't believe in anything, and that is a pretty reliable worldview to hold as a scientist. I've had to sometimes step up concentrations quite a bit or veer way off protocol to reproduce a result, and even then it won't look as pretty, but I don't blame the authors for things like that. Maybe their undergrad mixed a higher concentration than instructed that day. The results that get published are the prettiest pictures, your very ugly initial result before optimization is only shown in lab meetings.
It does get laid bare eventually. I've heard of a well known PI being investigated by their funding agencies due to a post doc doctoring a western blot result. I heard a story about a grad student who tried to repeat a graduated grad student's protocol. They tried and tried for years without much success, initially blaming their own ignorance given they were a freshman when they started their efforts on the experiment. Eventually, after going back and forth with the former student, the PI, and collaborators, it was revealed the original protocol was outright wrong. In fact, more digging was done and that former students thesis research turned out to be fraudulent. The former student lost their degree and their current job, and the current student wasted two years of their life trying to make something impossible work. It's awful when fraud happens for everyone who is involved, whether intentionally or not. Fraud does tend to shake itself out, but it can take painful time.
I had the same experience in physics as an undergrad. Our lab got some early graphene transistors, they were meant to work around the 5 kelvin mark. No matter how slowly I cool them or how I set things up, I couldn't fet them going. The PhD had no issues with the exact same transistors, cryostat, electronics etc.
I had a similar experience as a beginning grad student. A postdoc in the lab had developed a radioligand binding assay, which we were using to purify a small protein. And I could just not get it to work, no matter how careful I was.
TL;DR - I was being too careful, and so too slow. Because the binding was transient. So once I focused on speed instead of precision, the assay worked fine.
I distinctly recall a situation where speed made a huge difference. I took an undergraduate lab at UCSC where you were given raw bacterial cells and had to use the original literature to purify restriction endonucleases with it. The course ran for a quarter, and on the first day the professor said: "Most of you are going to spend a few days planning, then running your purifications for a few weeks. If I were going to do this, I'd spend a couple weeks planning and then do the whole experiment in a day, because the moment you break the cells open, all the proteases will start eating up your restriction endonucleases and you will lose activity in a couple days.
I've also found a few situations where I couldn't get somethign to work and it turned out the person who had it "working" wasn't running controls (positive or negative), so you didn't really know if it worked.
It was a matter of ejecting the pipette contents at the right moment while removing the pipette from the medium at the correct speed (I exaggerated my ignorance a little bit for rhetorical effect.) But there was no way to verbally teach the right "feel" for it - you just had to do it over and over until you learnt how pipetting "right" felt. You could (and I did) observe my PI do it the "right" way, but you can only get so close to hitting a baseball by watching a pro - at some point, you pick up the bat and learn by swinging.
As an outsider to Academia, this looks like a long array of excuses and saddens me. A huge number of hours are wasted every year by undergrads fighting with incomplete papers when we could do better. We could enforce higher standards. Everything in a paper should be explainable and reproducible. Have a look at efforts by PapersWithCode, Arxiv Sanity, colah's blog, or 3blue1brown in either curating content or explaining concepts. I couldn't find a single excuse in this blog post for which we can't come up with a solution, if we have the consensus to enforce it.
Enforcing higher standards is way more difficult than you think, even impractical most of the time. Having consensus and enforcing it is not trivial. Requiring from every PhD student the level of quality of 3blue1brown is extremely naive.
Your comment is the equivalent of saying there should be no armies and instead we should be peaceful and love each other. Great idea. And those excuses of why I cannot connect A to B? Let's just reach consensus to use the same connector everywhere!
I'd like to live in that world, but unfortunately it is not the one we got.
You're making a strawman here. What I want is that prestigious journals/conferences not accept papers unless they come with the code, dataset and weights required to replicate the result, and every formula come with explanations of the terms. If the terms are not widely used, a more thorough explanation should be requested by reviewers.
It's not that difficult. There's already a Distill Prize for Clarity in Machine Learning. A great spark would be if a company like DeepMind or OpenAI would enforce standards like these internally and host a conference that rewards papers at that standard. It would be a great PR move, at great benefit for humanity.
There are many good papers that would not fulfill your requirements. I agree that ideally they would go through several revisions until every detail is fixed, but rejection levels would skyrocket. We would not have many great papers.
I am not too familiar with the AI field, but I know that at least in my field the quality is very often too poor. We have to try to improve it. But if I rejected every paper I review because every detail to replicate it is not there, I would have rejected 99% of them. And some of the ones that eventually got accepted were very valid and very useful papers, and replication not so important after all.
Publishing the code isn’t a panacea. Someone still has to find the bug. It hopefully would make it easier to come to a consensus faster when there is a dispute. In practice I think ultimately there is little glamour and fruitfulness for exploration. What might be impactful is setting up prizes for papers to be disproven or corrected. Ultimately though we can produce, as a species, far more information than we could ever hope to verify.
I agree to a small extent, but this is very field dependent.
Social sciences, where you're taking a bunch of data and running a Python, R, or Julia script against it? Sure. Require the PI to archive the data (or anonymized version where PII is an issue) along with the script and a quick description of the method.
When you cross over to more STEM oriented fields, tacit knowledge (e.g., the 'experience' or stuff you pick up from your colleagues) becomes much more important. Reproducibility is still important, but you can only expect the PI to provide so much knowledge before it becomes untenable.
People do run experiments in social science too. There's just as much, if not more, tacit knowledge required to reproduce those experiments correctly (even after accounting for differential participant populations).
I think there is one good reason: eventually the student (or post doc) is supposed to come up with a new process (trying to implement a theoretical idea/possibility), and this is when he needs all the experience he gained when reestablishing a 'known' process (like riding a bike with side wheels).
If she doesnt have the experience then it is too much. A metapher would be: if you make summit trails too easy, none of these mountaineers will be able to successfully challenge a new path.
But... but... there is ample room to literally experiment, change variables, optimize methods, eliminate smaller problems, and more importantly to test other hypotheses.
it should not depend on mindless/random trial-and-error guesswork and faith-based replication.
yes, it's hard to write everything down well. look at the recent DARPA paralell replication study. but it gives very high quality research/science for a little overhead.
Large parts of this post are naked apologism for grave aberrations in computer
science research, the equivalent to code smells that have calcified and are
now "how we've always done things" because nobody has the courage to do
anything to fix them, or even knows how. That these are addressed to new PhDs
as "this is the real world and you must get used to it" is just tragic. The
surest way for all this garbage to remain the way it is, is for new PhDs to
accept it as it is and not try to do anything to change it.
My comment is not asking anything of anyone. It's pointing out that trying to convince new PhDs, who might see the distortions in their field for what they are, that it's all in their mind and everything is OK, is only perpetuating the awful situation.
Anyway it's more complicated than what you make it sound. For instance, if you don't like the culture at Google, you have a choice to not work for Google. And if you don't like the culture in some field of research you also have the option to not join that field of research. For sure, you don't know _exactly_ what you're getting into before you get into academia, but students who are interested in doing a PhD already have some experience -often, substantial experience- of how things go and even of the behaviour of the professors they choose as their advisers, with whom they may have done a Master's project etc. Some have publishe papers, presented posters and joined conferences already from their Master's or degree. And so on. Very few people just stumble into a PhD by mistake.
I did a chemE grad program and my matlab coding was horrendous, my versioning and organization were also bad. I’d had one intro matlab class and an intro Mathematica class and then zero coding classes. For the longest time I thought if only I’d taken more programming classes I would be able to do better.
Wiser now, I know I was wrong about the classes. As this article shows academic code experience does not necessarily mean anything in terms of code smells. That author sounds like how I used to code. I guess some never realized there are better ways, that their code doesn’t have to be so Byzantine and hacky.
Yes, that bit about horrible code that caused such distress to an MSc student was particularly bad. Personally, I wouldn't be caught dead releasing that kind of code. It's not like there's a deadline for releasing code, most venues don't require it and it usually takes months for the review process to complete at conferences (years even in journals). There's nothing stopping one from working on the code at a leisurely pace once a paper is submitted, then releasing the code whenever it's fit for public presentation. There is no social contract that says you have to publish exactly the code that was used to produce certain results, just show that there is some code that does what your paper says it should. And if a precise record of what was done when is needed- well, that's called "git".
But I guess part of the problems with "publish or perish" is that there is very little incentive to keep working on code after a paper is submitted.
Edit: So, I spent a few years in the industry before starting my PhD so code quality was important to me especially because my first proper job was in a shop that did code reviews and all. How did you get to appreciate the need for good coding standards? And how did you learn how to do it right? If you don't mind me prying! :)
Because my code and project structures were causing me to do a lot of extra work, especially when I had to refer to older analyses I’d been doing. I knew there had to be a better way so i developer a habit of paying attention to how the job was done, not just getting the job done.
Switching to Python and seeing enough python code helped too, but I think other widely used languages would have been about the same like JavaScript. Seeing a lot more code than just googled matlab snippets was a big part of it.
Anyway essentially it had become increasingly apparent that “there’s got to be a better way!”
There's a rumor going around that a lot of published economic results the 1990's are invalid because of an issue in an old version of STATA (or a STATA plugin, idk).
Whether it's true or not, I don't know, but it wouldn't surprise me.
It's a form of technical debt for the research culture. It's also a prisoner's dilemma: putting in the extra effort to be different just puts you at a competitive disadvantage b/c the system doesn't offer any rewards for going beyond the bare minimum. (Heck, you open yourself up to liability by publishing your code -- people misuse/misunderstand it and blame the poor results on you, people expect you to support it, people find bugs and it reflects poorly on you...) The system selects for people who either don't see the issue or are not bothered enough by it to insist on behaving differently.
It's not just a computer science problem btw -- it encompasses physics too.
Coupled with a lot of academics viewing coding as an unfortunate requirement to their craft. I can probably count on one hand of the folks I've worked with who know what a linter even is, or even keep minimally abreast of new versions of the language they work in every day.
Not even a requirement -- some of the elderly scientists I've worked with get by on spreadsheets! I did a tutorial on Git for my lab group in grad school -- there was some code under SVN at the time, but most folks didn't really use version control at all. (One nerve-wracking instance, I had to revert some months-old changes to our instrument control code, with only my memory to go on, because a bug I'd introduced was holding up someone else's experiment.)
Most people in my field use either IDL or Python (if they're doing data analysis) or C/Fortran (for heavy-duty simulations). Some also know Matlab or Mathematica. I've been trying to convert people to Julia -- it's gonna supplant Python for data analysis and maybe even C/Fortran for simulations.
I imagine this is all due to the more-than-obvious pyramid-esque scheme of academia: offer 12 professorship positions to 200 PhD students. I mean, the "smartest people in the world" are apparently the ones falling for this scam. It is just a breeding ground for brown-nosing, politics, and even straight corruption. It is laughable but it is also doing a massive disservice to the potential of human civilization.
Not GP, but everything I've heard of PhD programs is that you're basically destined for academia. From my (admittedly ignorant!) perspective, it seems like only 10% of PhD students don't desire professorships.
In the life sciences from the people I know in the field, probably 20% of PhDs are considering academic positions. You can do a lot in the private sector with a PhD even well outside of your field, especially if you manage to pick up some data science skills along the way.
To be fair, in my field (deep learning, computer vision), the papers often do not contain enough information to reproduce the results. To take a recent example, Google's EfficientDet paper did not contain enough detail to be able to implement BiFPN, so nobody could replicate their results until official implementation was released. And even then, to the best of my knowledge, nobody has been able to train the models to the same accuracy in PyTorch - the results matching Google's merely port the TensorFlow weights.
Much of the recent "efficient" DL work is like that. Efficient models are notoriously difficult to train, and all manner of secret sauce is simply not mentioned, and without it you won't get the same result. At higher levels of precision, a single percentage point of a metric can mean 10% increase in error rate, so this is not negligible.
To the authors' credit though, a lot of this work does get released in full source code form, so even if you can't achieve the same result on your own hardware, you can at least test the results using the provided weights, and see that they _are_ in fact achievable.
This is why science needs to get rid of "papers" and make everything a public repository of information. GitHub, etc. could be used to file "Issues" (like your anecdote) to ensure a recorded history of what the paper might be missing or how it could be better expressed, etc.
In fact, I really do not understand why in 2020 we are writing papers/PDFs anymore. We have more than enough software and software tooling to have superior formats for sharing and discussing information.
Is our biggest hurdle to this that the most experienced scientists/professors/researchers in our day are not as technically competent as they would need to be? Otherwise, why is this not already the norm? We do not need ArXiv or journals, we need GitXiv.
That papers are unnecessary is a common misconception shared by those not in the field. Papers are _extremely_ useful. I read a couple carefully every week and skim many others. But I will concede that a paper is not sufficient in most cases, at least in my field. And researchers have only very limited amount of time which they can devote to engaging in "discussions". Arguably their time would be much better spent creating the next great thing rather than fixing bugs in their latest paper, as frustrating as that is for everyone else.
I think the author of that comment was saying a paper alone is typically not sufficient. A researcher should summarize their work in a written paper AND provide resources, code, media, etc.
I don't think that's what's being requested. It's ok if resources are dumped and left "as is". We just need more than we are getting, and it's not like they don't have that, it's just not as expected for them to publish it.
My personal experience with trying to recreate tech from research papers (real-time mobile computer vision in my case):
- assuming highly specialized Math skills (e.g. manifolds), which wasn't me at the time, so it made testing and bug hunting harder
- code missing, or code is a long C file vomited out by Matlab, with some pieces switched to (Desktop) SSE instructions for speed gains
- papers were missing vital parameters to reproduce the experiment (e.g. exact setting for a variable that influences an optimization loop precision vs speed)
- the experiment was very constrained and the whole algorithm would have never worked in real life, which is what I had supposed for the first few months (meh)
- most papers, as the article says, are just a little bump over the previous ones, so now you have to read and implement a tree of papers
- sometimes there would be the need of a dataset to train a model, but the dataset is closed source and incredibly expensive, so not a good avenue
At the time I was also working with the first version of (Apple) Metal which, after going crazy on why my algo wasn't working, I discovered had a precision bug with the division operator. FML
Still it was a very instructing experience, the biggest takeaway if you do something similar: don't be certain that once you implemented an algorithm, it will work as advertised. It's totally different from say, writing an API, it's not a well constrained problem.
> - the experiment was very constrained and the whole algorithm would have never worked in real life, which is what I had supposed for the first few months (meh)
This has been my experience with a large number of papers that make all sorts of wild claims. Whilst technically true they are more often than not are not actually very useful for any practical application.
In my experience of the same field, the only lab you can consistently trust the outputs from is Microsoft Research. I don't know if they have some kind of blind QA or something, but everything from them has worked as advertised.
Yeah, Microsoft Research, Disney Research and ETH are the most reliable. You learn very fast to read only papers from the biggest universities and companies.
Yes I agree, nine times out of ten you've made a typo, connected the red wire where the black wire should be.
But I diagree with the overall sentiment of the article which fails to highlight that - when all else has failed and you have double checked things your end ..... sometimes there IS a typo, an incorrect equation, a simple error IN THE ORIGINAL research. Sure, blame yourself as a first instinct (which is a good thing to do) BUT - There is indeed a replication crisis currently in all fields of study and research. Forgive the link to a totaly non-tech field but it illustrates my counter-point as well as any other.
"The Role of Replication (The Replication Crisis in Psychology)"[1]
The openness of psychological research – sharing our methodology and data via publication – is a key to the effectiveness of the scientific method. This allows other psychologists to know exactly how data were gathered, and it allows them to potentially use the same methods to test new hypotheses.
In an ideal world, this openness also allows other researchers to check whether a study was valid by replication – essentially, using the same methods to see if they yield the same results. The ability to replicate allows us to hold researchers accountable for their work.
It is saddening to see so much pessimism in the thread.
Tight regulations on reproducibility are the first thing academia needs. Academia is a rat race with so much dishonesty and optimizing for measure (citation count etc..) these days. I have seen too many of professors producing low quality papers for the sake of producing papers - which creates a lot of noise, and a cargo cult PhD culture. Without reproducibility how can you even trust the results?
While academic code doesn't adhere to code quality standards of software development, I don't think many software engineers involve in ridiculing the author if they produce the code, leave alone other academics.
Btw, I submitted a link (a very comprehensive article by a CS academic) but it was lost in the HN noise. Maybe someone with higher karma points repost that? I found that article through nofreeview-noreview manifesto website and that's very well written, it covers a number of problems like this.. (http://a3nm.net/work/research/wrong).. PS I am no way affiliated to the author. Just mentioning because that article deserves to be at HN front page.
If you run their code on their data you haven't reproduced their result. You have merely copied it. If their code works on another very similar data set to produce a very similar result, then you have something. Or if you're trying to understand the technique, you have to write your code and see if it reproduces their result on their data.
What nonsense!
Copying something is reproducing it.
If it cannot be copied then it cannot be reproduced.
If it can be copied then it can be investigated.
I ran into exactly this problem with:
White's Reality Check (WRC), described in White's (2000) Econometrica paper titled "A Reality Check for Data Snooping",
The algorithms that were being tested were so poorly described that less than half could be reproduced.
Once we got there for what we could we found two fundamental problems with "Whites Reality Check", one theoretical that is blindingly obvious once you see it and the other that they ignored trading costs. When we measured the levels of trading costs that would permit the algorithms to work it was clear they could not.
That we could not deduce without reproducing the results. If we could have copied the original code it would have shaved about eight weeks from my research.
BTW "Whites Reality Check" was at the time pone of the most cited paper in its, rather narrow, field.
But running their code, presumably, would get their result on their data, unless it was a fraud or weird misconfiguration. We’re not talking about fraud, we’re talking about reproducibility supporting a theory.
You suspected the result was wrong, so your interest was in debugging their code to contradict their result. That has value but also is not a reproduction of the study.
Irreproducibility can result from irreplicability or many other reasons. Publishing the code & data helps rule out the trivial replicability failure: that the authors made an implementation error, so the result is not reproducible. Establishing replicability is essential, but that won't prevent all irreproducible results. Attempts to reproduce results under different conditions are still necessary for the functioning of the scientific process - but they are a waste of time if the original result is not replicable.
This is the main reason why we are building Nextjournal [0]. By using immutability of Datomic + Clojure the code (and the results) are immutable and permanent.
Irreproducibile results should have zero place in science or academia. Hell, it has no place in politics, the media, etc. "Irreproducibile results" defeats the purpose of proving that it is not bullshit or a farce.
Even academic papers with open-source code can be infuriating to get working. Usually it's a mess of hidden dependencies, specific versions of global libraries, hard coded paths, undocumented compiler settings, specific OS versions...
Usually, the highly specific experience and knowledge of the author is assumed in the reader...
Just as a neat bit of trivia that isn't mentioned in the article, the inventor of the equals sign was Robert Recorde[1]. Dijkstra provides some additional background at [2].
I do wonder if standardized equipment with digital controls (or digitally monitored controls) to record all salient values throughout an experiment would 'solve' this.
Obviously some cutting edge stuff can't use standardized equipment, but can you standardize a lot of other stuff?
I don't think you need to go as far as standardizing the equipment, capturing the control settings as well as the output data in a standard form would do.
I'm currently trying to push the use of an existing standard [1] for capturing engineering experimental results.
To expand, you'd have petridishes that monitored temperature and humidity levels, maybe a couple other values. Gas chambers would record various pressure values, etc.
I obviously don't have the experience to enumerate what is important. haha but maybe its still a worthwhile idea
In the case of AI: just train again. Good or bad luck with the random weight initialization can have a huge influence on your results. Nobody really talks about it, but many "pro" papers use a global seed and deterministic randomness to avoid the issue.
> I'm about to tell you something which can sometimes be harder to believe than conspiracy theories about academia: you've got a bug in your code.
If this is the case in a majority of the instances where someone fails to reproduce the software backing a paper then there may be another issue at play. Someone re-implementing a paper is "just" taking a written description of something and transcribing it to code. If a mistake, that can completely ruin the results of the work, can be made this easily it should be fairly straight forward to see that the original implementer could also have made a mistake that could have thrown off their results.
Does the world of academia have any tools to prevent this? It seems like this could effect a lot of the research being done today. Given the following:
1. Most research being done today utilizes some software to generate their results.
2. This software often encodes some novel way of implementing some analysis method.
3. The published paper from research with a novel method of analysis will need to describe their method of analysis.
4. Future researchers will find this paper, attempt to implement the described analysis method, and publish new results from this implementation.
We can see we are wasting a lot of resources reimplementing already written code. We may also not be implementing this code correctly and may be skewing results in different ways.
> Debugging research code is extremely difficult, and requires a different mind set - the attention to detail you need to adopt will be beyond anything you've done before. With research code, and numerical or data-driven code in particular, bugs will not manifest themselves in crashes. Often bugged code will not only run, but produce some kind of broken result. It is up to you to validate that every line of code you write is correct. No one is going to do that for you. And yes, sometimes that means doing things the hard way: examining data by hand, and staring at lines of code one by one until you spot what is wrong.
This is a very common mindset that many academics have but I don't understand why this is the case. A paper seems like a fantastic opportunity to define an interface boundary. If your paper describes a new method to take A and transform it into B then it should be possible for you to write and publish your `a_to_b` method aside your paper. You could even write unit tests on your `a_to_b`. If a future researcher comes along and finds a way to apply your `a_to_b` to more things they could modify your code and, just by rerunning your tests, should be able to verify their implementation actually works.
If a future researcher decided to use you `a_to_b` you could write some code to automatically generate a list of papers to reference.
If we are seriously spending this much time treading the same water then it should be possible to dramatically improve the quality and throughput of academics by providing some tool like this to them.
I know someone will say "but you gain so much knowledge re-implementing XYZ" and to them I'd say that you don't need to read the code while reading the paper. You could write the code yourself and just utilize the unit tests provided by the author to make sure you fully understand each edge case.
There have been pushes to publish more code in recent years. Most journal article formats at least allow for a "supplemental material" section like here: https://journals.aps.org/prc/authors
Publishing code in a journal article in analogous to publishing key equations. Equations are just functions themselves after all. However, articles have to be succinct, and journals still exist in a hybrid physical/digital space. It isn't very useful to physically ink thousands of lines of code onto a page, especially if it represents a vastly different level of technical detail than the rest of the paper. In practice, people who write comprehensive software typically make it available through github or similar, and put a reference in the article. If not that, people will send you their code if you contact them. If they're stingy you may not get source, though. I don't know of any funding agency requirements that code source be made available, though I think that might be a good thing to try.
I think the biggest difficulty is with medium-sized pieces of code. Small enough that the bookkeeping/maintenance needed to make it available gets skipped, but large enough that it isn't possible to provide details in the article.
Let me make a statement and let you can judge it. (its below and stated as "STATEMENT")
BACKGROUND
I have been working for 15 years in industry doing hardcore ML. I have the fortunate drive and background that I was able to get my masters degree while working full time from an R1 school. No watered-down online degree, no certificate. I would drive twice a week to class for 4 years and did a full thesis which was published. Since then, I have published 6 papers, all peer reviewed. I even did a sabbatical with another research lab of which I was invited to come.
After 15 years, I decided to go back and get my Phd, all while continue to work full time. My thought was that it would be easy to get a phd with all my technical experience and math chops ive developed over the last 15 years. I essentially have been doing math 5 days a week for 15 years. Here's what happened...
Coursework was a breeze. I barely put any time into it and I easily can get a B+. This is really helpful because I am working 40-50 hours a week at my full time job nd managing my family. I passed my candidacy exam on the first try with little issue (this is rare for my department).
The biggest hangup I have about the phd process is what my advisor wants me to do when writing papers. He is the youngest full professor in the department and is from a well known and well respected graduate research university. But, the way he has me slant my papers is absurd. Results which I feel are very important to the assessment of the reader to decide if they should use the method, he has me remove because the results are "too subtle." He is constantly beating on me to think about "the casual reviewer."
Students in the lab produce papers which are very brittle and overfit to the test data. His lab uses the same dataset paper after paper. My advisor was so proud of a method his top student produced that he offered his code for my workplace to use. It didn't work as well as a much simpler method we used. Eventually we gave the student our data so there could be a fairest shake at getting students method to work. The student never got the method work to work nearly as well as in his published paper despite telling my company and I over and over that "it will work". The student is now at amazon lab 126.
STATEMENT: Academia is peer reviewed driven but the peers are other academics and so the system of innovation is dead; academics have very little understanding of what actually works in practice. Great example: its of no surprise that Google has such a hard time using ML on MRI datasets. The groups working on this are made up of Phds from my grad lab!
TL;DR - worked for 15 years, went back for phd, here's what i hear:
"think of the casual reviewer"
"fiddle with your net so that your results are better than X"
"you have to sell your results so that its clear your method has merit"
"can you get me results that are 1% better? use the tricks from blog Y"
"As long as your results are 1% better, you are fine"
Edit 1:boasts are given to avoid "your experience doesn't count because you X" strawmans, where X={are lazy,are a young student, are inexperienced, went to an easy school, are in an easy program, naive to the peer review process}
It's unfortunate that this is getting downvoted. Perhaps it comes across as a little boastful, but it matches my experience in academia.
When I was doing my PhD (Statistics) I spent several weeks running simulation studies to compare our new statistical model to 2-3 existing models that did similar things, across a wide range of data sets both real and generated with varying parameters. It was an unsupervised learning problem so there wasn't a "correct" answer to compare to on the real data sets, but on the simulated data we outperformed the existing models over about 60% of the parameter space that we tested, including the part of the parameter space that I thought was most useful/likely in practice.
Rather than present all the simulation results and have a nuanced discussion of where we did better and where we did worse, my advisor had me remove the 40% of the parameter space where we did worse from the results, so it looked like we always did as good or better.
The scientific contribution would undoubtably have been better if we had presented the results where we did worse and discussed why we believed that to be the case, but getting the paper past reviewers and into a top journal (as we eventually did) was considered more important.
I find this sad as well, particularly since, in my experience, explaining that nuance makes for excellent papers. I have found reviewers actually respond positively to being upfront about limitations, along with discussions about why. Because that is not the norm, it ends up being novel. But, because it requires nuance, it depends on an ability to write well.
Of course you have the other side of the coin where those same ML students take a pre-trained VGG model, fine tune it on a couple thousand pictures of hotdogs or whatever and raise millions in VC money for their "AI" company.
Don't these companies usually end up failing though? I'm not very familiar with the startup space but in academia it feels like these system-gaming labs receive perpetual encouragement.
In a very large number of cases, the startup closing its doors after two years is not considered a failure by the VC fund. The startup successfully spent the money the VC firm was contractually obligated to invest, may have employed the VC's choices of officers for a significant period (providing them income and experience), maybe purchased a great deal of tech from suppliers the VC officers are themselves invested in, and may have left valuable assets that could be snapped up.
The biggest problem a VC firm has is the five billion dollars of others' money they have to "place" in 30/60/90 days. What happens after placement is much less their problem. They know most of the placements are duds, but they and the actual investors knew that up front. Once the money is "placed", though, much of it can be siphoned off for the benefit of the VCs' cronies or one or other non-dud. Maybe a non-dud or non-startup buys up assets of a dud for pennies on the dollar, and extracts something usable, like patents or equipment. Sure, the investor lost that money, but somebody got it, and somebody got what it bought.
None of this is good for most people who do a startup, unless they are chosen as a non-dud. The chosen duds are valuable for money laundering, which few startup principals really meant to sign up to be. Some did.
>Don't these companies usually end up failing though?
Not always, it's of course in the VC's interest to push the company forward and secure further VC rounds. You could have a failed initial product but play the "product market fit" or Pivot game essentially until you run out of money or find something that sticks. You can easily raise money as an AI startup with lots of hype only to be shipping a half baked marketing platform a few years later and still raise more money from VCs.
I’ve noticed a similar situation in biotech. I think a big reason is when real money and government regulators are involved, the stakes are higher and the definition of success is less about “truth” and more about “Truth”.
I worked in a cancer lab as an undergrad. New PI had just come over to the school. He had us working on a protocol he'd developed in his last lab to culture a specific type of glandular tissue in a specific way.
I and two other students spent months trying to recreate his results, at his behest. When we couldn't get it to work, he'd shrug and say some shit like "keep plugging away at the variables," or "you didn't pipette it right." I don't even know what the fuck the second one means.
Then an experienced grad student joined the lab. He spent, I don't know, a week at it? and was successfully reproducing the culture.
I still don't know how he did it. I just know that he wasn't carrying a magic wand, and the PI certainly wasn't perpetrating a fraud against himself. It was just, I guess, experience and skill.