If you want reproducible science, the software needs to be open source

rgejman · on Feb 26, 2012

A Github for scientific code doesn't go nearly far enough. The transition from paper journals to electronic publications has only converted dead-paper into "electronic" paper. With some exceptions, e.g. video recordings, animations and supplementary figures/documents/spreadsheets/code, the document that you download from any major science publisher is a PDF that looks almost exactly like the printed publication. Most don't even include links to referenced publications[1]!

Today, we know a lot about how to make documents that have complex formatting (think micro formats, links) and even more about making abstract document formats that can be presented and styled in different ways (think XML and stylesheet type separation of data and presentation). Having a standardized scientific publication format (with open-source user or publisher generated extensions as needed) would completely change the way we produce and consume the literature. Imagine the possibilities for meta-analysis!

Yes, code should (in most cases) be released together with a paper. But even better would be if the code were released as part of a standardized data format that would allow you to, for instance, selectively download raw data and re-run the computational experiments on your own computer (think: re-running simulations in Mekentosj's Papers as you read the paper).

Even simpler (and possibly more useful): provide both low and original (high) resolution versions of figures that can be examined separately from the main document. I can't tell you how many times I've been annoyed by the low quality of the published images and wished I could zoom in to the level of detail I know was in the original image. Even more frustrating: why should I have to take screenshots of images in Preview to add to a figure in my lab meeting. Separate the presentation and the data!

[1]although some now include intra-document links from a citation in the text to the reference in the coda

toufka · on Feb 27, 2012

biochemist here...

I would love to release my data in an open format. But the software I'm using is proprietary - and the format is therefore closed.

Here's the problem: Scientific software sucks. It's stuck 20 years ago. Usually a pain in the ass. Barely works. Crashes. Proprietary. Hacked together. And worst of all - ridiculously expensive. And if it's not proprietary, then it was written by me. The script is just good enough to do exactly what I need it to. I've hardcoded file directories and tab-delimiters.

If I use a microscope to take images it can be in a Nikon file, a Canon file, a Zeiss file, or a number of other files. None of which are interoperable. TIF, in general, doesn't cut it - it doesn't communicate with the scope and generally can't store all the meta-data that must be kept with the image itself. If I were to dump my raw Nikon files it would help nothing - no one could read them without the $10k software that comes with the scope.

Open standards, even for the more commonly used scientific formats, simply don't really exist. And when they do, they're still hacky and ugly.

If someone would write some good open-source microscopy, genetic, and basic mathematical software it would make thousands of grad-students' lives easier. Someone PLEASE write a good plasmid viewer (an editor if you're feeling kind) for genbank files. An iPhoto/iTunes for microscopy images. A GUI for basic/common perl/python scripts. Some mathematical software that's not impenetrable (a la MatLab/Mathematica). If these softwares were out there and used, it'd be trivial to dump the raw data and allow anyone to use them.

(I'd love to have you guys make me some sweet software. It would take you a relatively short time - these things are standard and already spec'ed out. But sorry - we have no money - we can't really pay you.)

rflrob · on Feb 27, 2012

>I would love to release my data in an open format. But the software I'm using is proprietary - and the format is therefore closed. > >Here's the problem: Scientific software sucks. It's stuck 20 years ago. Usually a pain in the ass. Barely works. Crashes. Proprietary. Hacked together. And worst of all - ridiculously expensive. And if it's not proprietary, then it was written by me. The script is just good enough to do exactly what I need it to. I've hardcoded file directories and tab-delimiters.

Release the data and code anyways. I feel your pain with the proprietary formats, but ImageJ is pretty good about dealing with bizarre formats (and other scientists are more likely to have access to the $10k software, or at least know someone who does).

I feel your pain with the hacked together scripts. After sharing an analysis program with some collaborators, I had to go through about three rounds of testing to figure out all* of the unreliable assumptions I had made based on one particularly well-formed dataset. This process will make your code better, though, and it will make you a better coder. It will also help other people who are, like you and me, muddling through, trying to find any kind of usable starting point.

[*] Okay, "some"

eric_bullington · on Feb 27, 2012

As a practicing scientist interested in open source scientific software, you should consider learning Python if you don't know it already. If you do know some Python, then take a hard look at projects like:

[1] Sage: http://www.sagemath.org/ (Sage is scientific software that bundles many of the packages outlined below)

[2] SciPy: http://www.scipy.org/ Open-source software for mathematics, science, and engineering.

[3] NumPy: http://numpy.scipy.org/ Linear algebra for Python.

[4] Scikit-learn: http://scikit-learn.org/stable/ Machine learning in Python.

[5] Matplotlib: http://matplotlib.sourceforge.net/ Python graphing and plotting library.

[6] RPy: http://rpy.sourceforge.net/ Python bindings to the open source R statistical package (modeled after and compatible with S statistical software), which opens up the vast world of statistical computing to Python.

As a chemist, you might also be interested in some of the following Python packages, mentioned in a blog post of Python for chemists [1]:

Cheminformatics

OpenBabel (Pybel), RDKit, OEChem, Daylight (PyDaylight), Cambios Molecular Toolkit, Frowns, PyBabel and MolKit (both part of MGLTools)

Computational chemistry

OpenBabel, cclib, QMForge, GaussSum, PyQuante, NWChem, Maestro/Jaguar, MMTK

Visualisation

CCP1GUI, PyMOL, PMV, Zeobuilder, Chimera, VMD, Avogadro

Most of the packages listed above feature performance-critical code written in optimized C or Fortran, so they run fast -- much faster than most equivalent proprietary platforms. Really, if you've not looked closely at what Python has to offer, please do yourself a favor and take a close look. If you already know a programming language, you could probably be using Python comfortably in a week or two.

1. Python -- the scripting language of chemistry. http://baoilleach.blogspot.com/2008/03/python-scripting-lang...

yummyfajitas · on Feb 27, 2012

Just curious, some of these seem like really interesting potential OSS projects. In particular, I'd consider writing a couple of them just for practice - i.e., I might write your plasmid viewer/editor just as a project to learn clojure. I imagine I'm not the only one.

Why don't you post a more detailed list? The OSS community might surprise you.

toufka · on Feb 27, 2012

I guarantee that the best open-source plasmid editor/viewer will become a ubiquitous piece of software in every university. And eventually, even, homes.

http://www.mekentosj.com/science/enzymex - guys who made papers made part of it. Nicest interface, but missing lots of features.

http://biologylabs.utah.edu/jorgensen/wayned/ape/ - is what everyone uses. It's old. Functional, but missing SO much of what's possible.

https://www.dna20.com/genedesigner2/ - made by the best in the industry. Look at that shit.

http://bioclipse.net/screenshots - opensource, but look how difficult. How not right.

The rules are straightforward. It's a lot like writing out an API. A good manipulator of well-documented data structures.

yummyfajitas · on March 12, 2012

Please write to me, I'd love to discuss this with you in more detail. Contact info is in my profile.

mironathetin · on Feb 27, 2012

"these things are standard and already spec'ed out"

Oh my, forget it. I am a scientist (astronomy) and I write software for my project. Its all standard, but EVERY single astronomer that you ask, has a different interpretation of the standards. And everybody has his/her own special reason, why he/she prefers some hacked code from the internet to our beautifully perfect software that even provides guis for standard work. Either it is crappy, or it requires too many resources.

And then, scientists don't talk. Our colleagues are sitting in their office, waiting for software developers who can read their minds (it's all standard). It is incredibly hard to get good requirements that are accepted by more than one scientist as good requirements.

kaarlo_n · on Feb 27, 2012

> Some mathematical software that's not impenetrable (a la MatLab/Mathematica).

The Sage project is what you're after: http://www.sagemath.org/

RyanMcGreal · on Feb 27, 2012

> Scientific software sucks. It's stuck 20 years ago. Usually a pain in the ass. Barely works. Crashes. Proprietary. Hacked together. And worst of all - ridiculously expensive.

All excellent reasons to move to open source!

w1ntermute · on Feb 26, 2012

This is a great idea, but one of the issues is that scientists don't like to release raw data - it puts their conclusions at risk of reinterpretation by other scientists. They'd much rather release figures that are formatted such that they only allow one to draw conclusions that support their research.

cscheid · on Feb 26, 2012

Your point is a bit inflammatory, but it's at least actionable. If what you're saying is indeed true, then we're not looking at a technical problem; we're looking at a social problem.

Social problems cannot be solved by technical solutions. Not directly, at least.

During my PhD, I worked on a system whose idea was essentially to make the process of getting to the raw data from the figures easy. Scientists would be hooked by the ability to quickly create data analysis one-off scripts, but the system also kept track of the data (and the scripts required to get to the data) transparently, so it was as easy to create a "reproducible figure" than it was to create a "figure". Think 'integrated, transparent version control for data analysis and visualization'.

It works well, and some people like it. But the social problem remains: you really want network effects to kick in.

apl · on Feb 27, 2012

  >  If what you're saying is indeed true, then we're not
  > looking at a technical problem; we're looking at a social
  > problem.

Thinking otherwise is, quite frankly, delusional. The technology has been available for years. It's purely a matter of culture, and I'm frustrated by the persistent belief that science methodology needs better engineering.

aangjie · on Feb 27, 2012

Agreed.. Though, am very hesitant blame it upon scientists. Having been through a masters by research graduate course, and having seen the internal politics, i can empathize with scientists' reluctance to publish raw data.

kijin · on Feb 26, 2012

Also, the data could have been collected by unsound methods in the first place, so running the same data through the same computer program won't really amount to reproducing a result.

_csoz · on Feb 26, 2012

So, it appears this is a fundamental dishonesty with "publishable" science: that it becomes an ego trip for some scientists who like to overstate their results. While a groundbreaking discovery is sure to be scrutinized to the most minute detail by the community, an average "incremental science" paper will not, and will probably be published even if it contains errors. There are many published papers with unintented but grave computational errors that render even their main findings invalid, and oftentimes they are not retracted. This attitude should change, with people admitting that sometimes mistakes are only human, and they do not diminish the contribution of the research.

Complete data sharing would be a huge leap for science: Imagine all scientists digging into every experiment ever made for, say, cancer or HIV research and discovering hitherto unknown correlations, new interpretations of the data etc. It would provide huge shortcuts, given how many experiments get essentially repeated over the years.

w1ntermute · on Feb 26, 2012

> it becomes an ego trip for some scientists who like to overstate their results

I think you're looking at this the wrong way - while some scientists might indeed be doing it for the "ego trip," the vast majority are just trying to avoid being swallowed whole by the vicious academic research environment. Reaching the position of tenured professor at a major research university is extremely challenging.

Now, I'm not suggesting that what they're doing is right. I'm just saying that one needs to dig further to find the root causes of these problems.

Retric · on Feb 26, 2012

Complete data sharing is only a good idea if you wait a significant amount of time before doing so. Scientists are lazy animals and it's much easier to data-mine than do a new experiment, however scientifically speaking data-mining is literally worthless. When it becomes generally accepted it DESTROYS the credibility of entire disciplines see: the latest nutrition of the week fads, Economics, and a host of others.

The only other practice that's almost as bad requires three separate errors, working with small sample sizes, not publishing all experiments, and accepting significant statistical noise (P>.01 I am talking to you.)

_csoz · on Feb 26, 2012

It makes the field a lot more noisy, yes, but how can empirical studies ever destroy a field? If only we had some data to talk about, or even some papers to talk about. For example, i 'd rather be arguing about some papers that i 've read recently, but, alas, not only are they behind paywalls, they don't even have any meaningful comment boards.

Retric · on Feb 27, 2012

There are fields where well over 30% of published papers are contradicted by the next paper on the same subject. Getting into why this happens with empirical studies of large data-sets is complected, but boils down to looking at enough things until signal is indistinguishable from noise. For a simpler example assume this was actually done and they published their findings: http://xkcd.com/882/ Now assume other than the P>.05 there methods where impeccable what information have you gained?

A) The actually probability that green Jelly beans are actually linked to ACNE is impossible to tell. (http://en.wikipedia.org/wiki/Bayes%27_theorem) You might shift your expectations, but if you do the probability's the shift in expectations is tiny, because there is so much noise.

Now fill a field with that junk and suddenly reading a paper provides vary little information which slows everything down. You can discuss such things but it's about as meaningful as talking about who won the world cup. http://xkcd.com/904/ Worse yet, people rarely publish false results which means even reading a well done study is only meaningful if you can find some other logic to back it up. At which point it might be worth investigating, but the reason it's worth investigating your expectations and has next nothing to do with the paper you just read, and even if you find some deep truth the glory goes to the guy who was publishing noise.

PS: It get's worse. Because, contradicting a study is worth publishing and publishing is a numbers game, you have many people who simply reproduce research to pad their numbers and cut down on clutter. But, if your tolerances are loose enough say P >.05 and you have enough random crap in the hopper every 400 completely random papers can service two rounds of this get a lot of attention only to be discredited.

Edit: This is also why it takes a huge body of background reading and a deep understanding of statistics before you have the context to meaningfully discus a recent paper with a scientist.

kijin · on Feb 26, 2012

> It would provide huge shortcuts, given how many experiments get essentially repeated over the years.

Isn't that the whole point of insisting on reproducibility? Scientists are supposed to repeat experiments as many times as it takes to convince the rest of the scientific community that their results are valid. Reproducibility, not publication, is the final QA mechanism for science. Shortcuts are sometimes desirable (e.g. if people are dying right now), but they're the exception, not the norm.

_csoz · on Feb 27, 2012

I 'm not talking about reproducibility as validation, but about the fact that multiple studies can be combined to discover new regularities. An example from neuroscience: hundreds of labs recording from the brains of genetically similar mice doing very similar tasks under slightly different conditions, but we have no way of looking at the raw recordings in a systematic way.

bo1024 · on Feb 26, 2012

I completely agree that these things need to be open, but it's difficult to solve all these problems at once.

I'd tend to think that the original published pdf should remain as-is, a stand alone document.

However, I think the journal (and more importantly, the author's own website) ought to additionally provide a .zip download containing any supplemental materials. It would be nice if these generally included all figures in high-res, all nontrivial source code, and either the data or a link to the data online if it is large.

rgejman · on Feb 26, 2012

Right, but you can provide a platform for solving these problems over time if you make publishing work off of an extensible, data-presentation separate document model. PDF isn't it.

More directly to your point: I agree that the "telling a story" via a x-number page document is important. I'm not arguing for changing the general method by which scientific results are conveyed from an author to the audience (i.e. description of problem, description of approach, description of results, analysis of results and conclusion). It's just that once I've digested your results, I want to analyze them critically, re-use them, integrate them into my work, etc. Publishers should facilitate that.

Why can't I easily copy a list of genes from a table in YOUR paper so that I can check if they are present in MY work. Why don't terms and gene names and genic loci show up as links that take me to the genome browser of my choice when I click on them? These are simple problems to solve once you have a richer document format.

bo1024 · on Feb 26, 2012

I guess I just much prefer the idea of a "paper" as published being self-contained, and "supplementary" material being separate. A paper shouldn't become incomplete if it is printed on paper or if the links are broken.

> It's just that once I've digested your results, I want to analyze them critically, re-use them, integrate them into my work, etc. Publishers should facilitate that.

I completely agree! But I think this is a separate process from just reading the results, and is best done with separate tools. I think all the things in your last paragraph should be easy to do, but I'm not sure if the best way to do them is to embed them in the same document that's published in the journal.

For example, suppose I publish the results of an analysis on genomes of 100 fruit flies I raised in the lab. Here are some things you might be interested in acquiring:

- the text I published (say as pdf, or some other format) with included figures, references, etc (the thing that appears in the journal) -- say 4 megabytes

- the document used to generate the above document (e.g. LaTeX file) -- less than 1 MB

- high-resolution pictures of the flies' eyes up close -- say 10 more megabytes

- high-resolution plots of data -- say 2 more megabytes

- raw data used for these plots and analogous plots not presented -- lists of genes and statistics -- call it a megabyte.

- source code used to generate this data from the genomes and plot it -- not more than a megabyte.

- raw genomes of all flies in my study -- in the 10s of gigabytes.

Now how much of this data do we want to bundle in as part of the original document? Keep in mind that 95% of readers are only interested in the first item on the list -- the 4-MB document that was published in the journal. Should we also embed in it an additional 15 MB of data for those few who might be interested, and make the document somehow interactive so that this data is accessible by clicking? (Of course, we definitely can't embed the gigabytes of raw genome data, so we'll need a separate solution for distributing that anyway.)

I'd argue that a better solution is to simply bundle all of this "supplementary" stuff separately in a .zip and make it available for download separately. Again, very large files or datasets will still need their own solution -- for instance, I might host the genomes on my website and just provide you with the link.

This isn't to say that pdf itself is the be-all and end-all of portable document formats (although I think it's very good), but I do want to argue against bloating the published report document with what I view as supplementary information, because it's hard to tell where to draw the line, and this imposes a large memory cost on lots of people who don't need it. (As a sidenote, if I want someone's code and data, getting it from within the document by clicking links also seems a bit odd to me -- where in a published article should these links go?) So I really feel like separate downloads is the best solution.

rgejman · on Feb 26, 2012

Sorry, I didn't mean to propose that all of this data would be included when you download the document. A downloaded document could be as simple as a hash or a document identification # (or a magnet link :)). You open this document in a program like Papers and, after downloading the main text, presentation style sheet and main figures, could then selectively download whatever you'd like. So, if you want to zoom in on those fly eyes, you right on the image and get a context menu that allows you to see the original image.

These are implementation issues. The document format does not have to contain any supplementary or additional whatsoever. It could contain references to where the ancillary content is found. It is then up to the interpreting program to decide how it wants to allow you to download that additional content (e.g. automatically, selectively, based on heuristics, etc).

The practical upshot of this is that you get to keep all the data related to the paper in one place with one organizational tool, with relational information intact.

DennisP · on Feb 28, 2012

Hmm. Sorta like hypertext. We have a pretty popular implementation of that already, what's missing?

bo1024 · on Feb 27, 2012

Sounds tough to achieve, but that would certainly be a nice goal to shoot for!

AkThhhpppt · on Feb 27, 2012

I would kill for ubiquitous LaTeX file availability - my e-reader does not reflow PDF well, especially multi-columned. Being able to generate an ePub instead would be worth paying for. To be vaguely on-topic, producing better open-source software tools is exactly the kind of thing the folk at http://www.sciencehackday.com do. Encouraging making the produced tools publicly available might well help.

reitblatt · on Feb 26, 2012

There is a difference between reproducibility and repeatability. Reproduction is an independent experiment producing commensurate results. Repetition is the same lab repeating the experiment and finding the same results. Sharing code actually reduces the independence of experiments. Worse, sharing buggy code introduces systematic errors across "independent experiments". Scientists already deal with similar issues due to a small number of vendors of various tools, but software is pretty different. Systematic measuring biases can be detected and calibrated, but software bugs rarely lend themselves to such corrections. Because science depends upon independent reproducibility and NOT repeatability, there's an argument to be made that blindly sharing code is actually detrimental to scientific reproducibility.

The real question we should be asking is whether opening and sharing these code bases will result in an increase in quality that offsets the loss of experimental independence.

jgrahamc · on Feb 27, 2012

Nice to see that my (co-authored) paper is top news on Hacker News. Direct link to the paper: http://www.nature.com/nature/journal/v482/n7386/full/nature1...

rflrob · on Feb 26, 2012

There is the factor that a lot of the code scientists write is hacky, one-off, and fragile. The kinds of people who care about releasing their code also feel at least a little embarassed about the code quality. There's at least one license that recognizes and embraces this fact: http://matt.might.net/articles/crapl/

_delirium · on Feb 26, 2012

A more serious problem with that than embarrassment is that such code sometimes really shouldn't be uncritically reused, if we care about reproducibility. An independent reimplementation that reaches the same results is more convincing to me than a 2nd scientist getting the same results when they re-run the 1st scientist's hacky code. It's even more of a problem in the case of code that gets passed around and slowly accumulates ad-hoc additions because nobody wants to reimplement it.

mjwalshe · on Feb 26, 2012

Precisely you have to be able to reproduce the experiment with out using the first ones code

bo1024 · on Feb 26, 2012

This is true on its own, but seems like a bad standard to hold science to. With normal experiments, it's not enough to say "we ran an experiment that tested the pliability of different materials and found X is most pliable", then say, "you should be able to find that X is most pliable without reusing our method."

The point of publishing a method is so it can be critiqued; I think the same should hold with source code. This should not at all excuse people from trying to reproduce simulations with separate code.

Also, source code lies kind of halfway between experimental measures and mathematical proofs. Again, you are usually expected to give proofs of non-obvious mathematical results, at least in the supplementary section. Similarly, saying "there exists code which produces this result" shouldn't be sufficient unless it's very obvious.

jonhendry · on Feb 26, 2012

"The point of publishing a method is so it can be critiqued; I think the same should hold with source code. "

Except that source code can sometimes obfuscate the intent.

It's probably better to provide pseudocode. Don't provide source code for your binary sort, say you sorted the data and say on what it was sorted, and let other people use their own preferred sort implementation.

Especially since other labs may not use the same equipment, libraries, languages, etc, so source code may be useless.

rflrob · on Feb 27, 2012

"Except that source code can sometimes obfuscate the intent."

The source code, no matter how opaque and poorly written, can never really make things less clear. That's because it must be able to be interpreted by a computer. Good pseudocode and high level descriptions can help illuminate the code, but, as the saying goes, "If the code and the comments disagree, then both are probably wrong."

bo1024 · on Feb 26, 2012

I definitely think it would be great to require/expect pseudocode. I'm not sure if it would be better or worse than providing the original code (hey, why not both?), but it would be a very good standard to adopt.

Craiggybear · on Feb 27, 2012

"There is the factor that a lot of the code scientists write is hacky, one-off, and fragile"

Oh, absolutely. But the same goes for expensive, commercial software, too. In all walks of life, not just scientific software.

thomasballinger · on Feb 26, 2012

After three years of writing CRAPL-worthy code at an academic institution, I'm convinced this needs to be required of academic research. I've made plenty of mistakes that could have dramatically upset experimental conclusions - I assert that I've caught all the important bugs, but the odds will always say I haven't.

luriel · on Feb 27, 2012

And the journals it is published in should be open too.

I know it is offtopic, but it makes my blood boil that we allow scientific research, in great part paid for with tax dollars, to be locked up in what basically are proprietary journals only a few privileged have access to while they should be freely accessible to absolutely everyone.

Irishsteve · on Feb 26, 2012

I've a few publications out there and if I had to release my code I would. However the reason I don't instantly publish the code is because its kinda embarrassing. My code works and it has some level of unit test coverage to make sure numbers make sense etc. But the code itself has a number of inefficiencies or ridiculous variable names... or in some cases serious example of breaking DRY.

However if everyone had to publish their code, I know the elements of my code which cause me distress would be nothing compared to a variety of other implementations people create.

Oh also trying to reproduce someone else's algorithm from a paper is so painful. There are a number of experimental values that exist which aren't really mentioned in papers as they are deemed trivial so you've to do no amount of tinkering to get similar results.

AkThhhpppt · on Feb 27, 2012

Linked to above, and sounds perfect for you: the CRAPL licence. ",)

http://matt.might.net/articles/crapl/

larsberg · on Feb 26, 2012

The soon-to-be-released data retention policies for the NSF's CISE (basically, the arm of the National Science Foundation that funds all research) will most likely require complete and free access to not only the code for your implementation but also all scripts, input data, and configuration settings required to completely reproduce the experiments.

I can't wait. I've been doing some GPGPU research, and less than 10% of the authors of _published_ papers are willing to release their code or even a binary for benchmark comparisons.

eykanal · on Feb 27, 2012

It's worth noting that there already are many open-source research packages. My graduate and postdoc work was using magnetoencephalography in neuroscience, and the majority of the packages are open source. The authors were happy to welcome bug reports and source code contributions, and any code used for an analysis can be easily re-used.

By way of example, my postdoc work was all completed using FieldTrip (http://fieldtrip.fcdonders.nl/), free for both MATLAB or Octave. All the source code is on Github (https://github.com/eykanal/EEGexperiment), and anyone could reproduce the majority of my analysis on their dataset.

spitfire · on Feb 27, 2012

Fantastic! I'll do that just as soon as someone gives me an open source Mathematica, ansys, risk modelling packages.

On a serious note, I agree source should be available. But it isn't, because these sorts of specialized packages are very, very hard to write.

ec429 · on Feb 27, 2012

Try R. It's a powerful open source statistical language, also fairly good at linear algebra.

altxwally · on Feb 27, 2012

I'm not a researcher myself, but I have seen the efforts done in this field by the org-babel project very interesting. It is a literate programming mode for Emacs that attempts to make conducting this style of research more straightforward.

I attach here some links and example works done in the reproducible research style of org-mode.

"A Multi-Language Computing Environment for Literate Programming and Reproducible Research"

http://www.jstatsoft.org/v46/i03/paper

"THE EMACS ORG-MODE" Reproducible Research and Beyond

http://www.warwick.ac.uk/statsdept/user-2011/TalkSlides/Cont...

Example work: https://github.com/tsdye/hawaii-colonization

Org-babel wiki: http://orgmode.org/worg/org-contrib/babel/uses.html

bipolarla · on Feb 27, 2012

Health and codes are much like any business. The talent wants to maintain their edge. Many health companies, universities and even non profits find value in holding onto what they create. I understand we want everyone to be healthy but it will never work that way. So much cost and competition is involved that it will never be a "free" open source world. I would bet if Bill Gates and Warren Buffett each put 5 billion dollars toward find a cure or sharing code and paying creators there would be more people willing to share. Try asking Coca Cola for their secret recipe. Oh and tell them you won't pay them and will be using this recipe to make your own sodas to compete against them. I believe you will be waiting a long time for them to call with the info.

docmarionum1 · on Feb 27, 2012

Open-sourcing the code would only be the first step. To make the experiments truly reproducible, you also need to know the hardware and software configurations used to run it. Different package versions could lead to different results. And, for instance, if you're running your code on one of those old Pentium chips with an error in the FPU, that needs to be know.

I'm currently working on a platform for scientific programming. One of the ultimate goals is to include a provenance system which will be able to tell you everything about what generated the final results, including that of input data if it was derived on the system. That way you might be able to have a complete history of where a particular result comes from.

antirez · on Feb 27, 2012

Same code == less reliable independent verification. So open code is good but independent verifiers should try to reimplement the software needed to verify an experiment.

snowwrestler · on Feb 27, 2012

Totally agree, and I'll add that in my experience non-scientists frequently conflate "reproducible" with "verifiable". Simply downloading a data set and code, and rerunning it, is not really a scientific endeavor. To verify a scientific conclusion, other scientists need to design and run their own, independent experiments aimed at testing the same hypothesis. That said, it seems true that open source code can go a long way toward making that possible by reducing ambiguity about what, exactly, was being tested, and how.

_csoz · on Feb 26, 2012

A successful example of code sharing: ModelDB (http://senselab.med.yale.edu/modeldb/) , a database of neuronal models and mechanisms. It contains lots of validated, reviewed simulations that are now commonly shared in the comp-neuro community, making it extremely valuable.

Create · on Feb 26, 2012

some of it is available, like https://svnweb.cern.ch/trac/ or sometimes locally, as a http://gitorious.org/ instance combined with other tools, like http://dtk.inria.fr/

singingfish · on Feb 27, 2012

I deal with this problem in the social sciences, where the problem is even worse. Data analysis by convention with an overwhelming reliance on expensive propietary software ... I'm actually talking to a bunch of academics on this topic later this week, so this article is very timely.

disgruntledphd2 · on Feb 27, 2012

I work in the social sciences, and I have to say: no one cares about reproducibility or replication.

I write all my papers in LaTeX and R, using Sweave to ensure that my code matches my analysis. Typically, when I send PDF's or tex files to anyone else, they ask me for word files. No one ever cares about the code (even though I send it every time).

In fact, I (and other colleagues) have been asked to replicate our analyses done in R in SPSS as (apparently) R is open source, so it can't possibly be right. The sad part is that i started using R because many of the most useful psychometric models are not available in SPSS (and probably never will be).

To the second point, no one cares about replications. They aren't sexy enough, so they don't get published. If you find something strange, you'll get published in a good journal. The ten failed replications won't be published anywhere nearly as good, so scientists don't bother to replicate.

bryanh · on Feb 26, 2012

Is there a niche out there for the GitHub of science? My cofounder (mikeknoop on HN) puts a lot of his scholarly research code stuffs on GitHub, but perhaps a more specialized place with emphasis on peer review would be more appropriate.

bbgm · on Feb 26, 2012

There are many scientific groups where even today, no form of version control is used, even for internal work, so Github is way ahead of what is current practice in many places. There is a lot of good scientific code in various repositories and I don't see why anything special is required. As computation becomes even more important across many scientific areas, there is a lot of need for discipline. If scientists learn how to use version control and repositories just by default, that will go a long way towards reproducibility.

The Galaxy project does a great job of trying to foster such an environment:

http://genomebiology.com/2010/11/8/R86

http://galaxy.psu.edu/

https://bitbucket.org/galaxy

http://wiki.g2.bx.psu.edu/Tool%20Shed

kaarlo_n · on Feb 27, 2012

> There are many scientific groups where even today, no form of version control is used, even for internal work ...

Where I work (government research lab) people think of Subversion as a sporadic backup target, typically doing a commit every month or so, despite making frequent changes to operational code.

Paradoxically, one scientist I spoke to was scared about overwriting code if he did an incorrect commit, but he's perfectly happy to have mycode.py, mycode2.py, mycode_this_one_works.py, mycode_this_one_works3.py, ...

patrickod · on Feb 26, 2012

I would have thought things like the Issues and Pull requests features of GitHub would be perfect for such discussion of projects. What in GitHub is lacking that you think would merit a different site?

RK · on Feb 26, 2012

A general project wiki would be nice. I might be missing where the equivalent feature is because I am not a very heavy github user.

patrickod · on Feb 26, 2012

GH has wiki pages but having not used them myself I'm not sure if you can have them open for public edits.

cwhittle · on Feb 27, 2012

Speaking as a scientist who deals with genomic data, I wholeheartedly agree with many of the comments here. Code and raw data should be available at publication. I shouldn't have to try and figure out what you did from the three lines of text and poorly documented software you mention (that has been updated several times since you used it (no mention of version). Personally, I think pseudo-code would be most useful for reproducibility and for illustrating exactly what your program does.

Let me add to a few points here about the practical obstacles to this.

1) Journals don't support this data (raw data or software).

* You can barely include the directly relevant data in your paper let alone anything additional you might have done. Methods are fairly restricted and there is no format for supplemental data/methods. Unless your paper is about a tool, then they don't want the details, they just want benchmarks. Yes, you can put it on your website, but websites change; there are so many broken links to data/software in even relatively new articles.

* As many people have said, lots of scientific processing is one-off type scripting. I need this value or format or transform, so I write a script to get that.

2) Science turns over fast or faster than the lifetimes of most development projects.

* A postdoc or grad student wrote something to deal with their dataset at the time. Both the person and the data have since moved on. The sequencing data has become higher resolution or changed chemistry and output, so its all obsolete. The publication timeline of the linked article illustrates this. For an just an editorial article it took 8 1/2 months from submission to publication. Now add the time it took to handle the data and write the paper prior to that and you're several years back. The languages and libraries that were used have all been through multiple updates and your program only works with Python 2.6 with some library that is no longer maintained. Even data repositories such as GEO (http://www.ncbi.nlm.nih.gov/geo/) are constantly playing catch-up for the newest datatypes. Even their required descriptions for methodology for data-processing are lacking.

3) Many scientists (and their journals and funding institutions, which drive most changes) don't respect the time or resources it takes to be better coders and release that data/code in a digestible format.

* Why should I make my little program accept dynamic input or properly version with commentary if that work is just seen as a means to an end rather than as an integral part of the conclusions drawn. The current model of science encourages these problems. This last point might be specific to the biology-CS gap.

mjwalshe · on Feb 26, 2012

erm as an ex technical programmer and research assistant for a world leading rnd organization not sure I buy this for all experiments - an experiment needs to be reproducible yes but…

Most science is based on physical observation of the experiment the code is just a offshoot of the test equipment.

In the case where you are modelling some thing you do experiments to prove your mathematical model. I once spent a sweltering afternoon in a bunny suit and rubber gloves and mask helping prepare a dummy fuel rod from a Breeder Reactor so that we would do experiments to see if our model of two-phase flow was valid.

And surly saying you can reproduce my experiment but only using my code can everyone not see the danger here - you would want to repeat the experiment and implement ones own version of the maths behind it.

polemic · on Feb 26, 2012

Exactly what I was thinking. It is important that the original code is vetted - for example during a peer review process - but to say that the entire code source is necessary to reproduce the result seems self contradictory.

Surely it should be reproducible especially without the original source code.

(EDIT: that's not to say that you shouldn't provide source code - it probably depends on the experiment)

bo1024 · on Feb 26, 2012

> Most science is based on physical observation of the experiment the code is just a offshoot of the test equipment.

Even if we accept this as true, I don't see why it's an argument against publishing code for that science which does directly depend on the simulations you run.

> you would want to repeat the experiment and implement ones own version of the maths behind it.

That's a good point, but only valid if the exact mathematics and methods are clearly explained elsewhere. But as the article states, usually there's ambiguity. And if I try to reproduce your simulation and get different results, it's very difficult for me to get enough confidence to call you out on it (perhaps I'm the one who screwed up). If I find a bug in your code, it's easy.

rgejman · on Feb 26, 2012

This argument quickly tends towards absurdum. Even with all of the materials, equipment and original experimenters, you cannot reproduce the context of the original results. But we can do our best to ensure that we can try to reproduce the original results.

So, while I might not be able to reproduce your raw data, I can at least make sure your analysis is error-free (we can argue about its correctness). Going further, if I have the necessary equipement, you can send me the raw materials (especially in the life sciences) so that I can try to reproduce your raw data in my lab. Etc as remains practical (obviously I can't ask you to send me samples from your breeder cores).

Craiggybear · on Feb 27, 2012

All scientific software should be totally open and transparent -- indeed a lot of it already is. For years, for example, people were erroneously making the mistake of trusting Excel's statistical functions without being aware they were deeply flawed.

Software that is in use for scientific purposes must be open to review and assumptions about their efficacy or correctness should not just be taken for granted. They need to be checked and their outputs verified for correctness.

Even when flaws in commercial proprietary code are found it can take years (or never) before they are corrected. Chances are that if the same flaws show up in OS software they be fixed sooner. Failing that, you can fix 'em yourself -- or at least be in a position to potentially detect them and alert other users.