From Reproducibility to Over-Reproducibility (2016)

YeGoblynQueenne · on Oct 25, 2020

>> Oh, and I think there are some people who use version control for the text of their papers (almost certainly a proper subset of those who are for some reason writing their papers in Markdown or LaTeX). Unless your paper has a lot of math in it, I have no idea why anyone would subject themselves to this form of torture. Let me be the one to tell you that you are no less smart or tough if you use Google Docs. In fact, some might say you’re more smart, because you don’t let command-line ethos/ideology get in the way of actually getting things done… :)

I really don't understand where this is coming from. It sounds like the reaction of someone forced to do something difficult and time-intensive for no good reason other than because someone thinks it's "the right way", which I can get behind. But, from what I can tell, it's just doing a "git add" and "git commit" every once in a while. Does that really require "command-line ethos"?

This maybe because I don't really understand how using git to track changes to your code or text helps reproducibility. Once you have your experiment and plotting scripts where you want them, reproducibility should be a matter of making those scripts available to others. Before you are at that point, using git is useful because you can un-screw mistakes caused by big changes or changes two or three versions of your code ago. It doesn't help reproduce your work, just organise it so you don't end up making a big, steaming mess of it. So maybe I don't understand what the author means by "reproducibility"?

Finally, all the same points the author makes against using version control apply to software development also, or anyway I don't see why they shouldn't aply to software development also. Yet, an entire industry seems to agree that version control saves butts much more than it wastes time and one would probably be very hard pressed to convince software developers that they don't need to use version control. So either there's something that entire industry is getting awfully wrong, or software developers like to "torture" themselves, or the author is missing something, maybe?

CJefferson · on Oct 25, 2020

Google docs has history of documents. Also, while I use latex and git, it is awkward for lots of reasons -- in particular git is line based, so if two people edit a comma in a paragraph, you get a conflict. (I know some people do a line per sentence, but I hate reading that).

YeGoblynQueenne · on Oct 25, 2020

I confess I haven't used google docs- as per another comment I prefer to not tie down my research and my code in proprietary formats, especially one living on someone else's computer (the "cloud")!

About git's awkwardness, I recognise that people who are not trained as software developers may find the command line off-putting, but on the other hand having to keep track of changes to a couple dozen documents (scripts and article versions) by hand is, for me, the stuff of nightmares. This "awkwardness" pays for itself in spades when you realise that a change in a script you've written at the start of the project and haven't tested since is suddendly raising an error and you have no idea at which point it broke. This is a common thing that comes up in my projects. Git even has a special command to do this for you - "git bisect". I can't imagine the pain in the arse it would be to try and roll back maybe months of changes by hand, by copy-pasting backup files, I guess, or by making new untracked changes, until you have nothing to go back to that actually works.

I've been in situations like this, like I say. Having a bedrock of committed code that I can go back to anytime I like is a time saver of epic proportions. A bit of "awkwardness" or even worse once in a while, is totally worth it.

And conflicts happen for a reason. How do you merge work of two or three members of a team without something like version control? That sounds like an even worse nightmare.

CJefferson · on Oct 26, 2020

Google docs handles conflicts by requiring an always-on connection, and showing each author a live view of the document. You can use overleaf, which provides a similar multi-author live editor for latex.

This might not be a nice way to write code, but in my experience it is a great way to write large multi-author text documents, and requires no teaching of new tools -- git is (in my experience) far too hard to teach non-coders, when you are just doing one project together. Even with a GUI, we still have to teach them about merging/conflicts.

YeGoblynQueenne · on Oct 28, 2020

Overleaf is certainly very convenient and so is Google docs, but I don't think they are meant to replace git, or v.v. Once you write your latex and your code, you can then commit it to git so you can benefit from git's version control. That is, unless it's acceptable for the text or code of a paper to live permanently on Overlaf or Google's servers - for me that's not acceptable at all. So I'll keep my local copies under version control and update them from Overleaf once in a while.

I understand that git appears foreboding and hostile to people who aren't trained as programmers. Again all I have to say is that I don't find it that hard and I expect a researcher in a scientific field to be able to cope with it.

watwut · on Oct 26, 2020

It is not just "having to use command line". It is also that git ui is unintuitive and awkward even by command line standards.

Look at Git commands for merging and conflict solving. Look at those tutorials, they are typically bonkers complicated and don't even answer basic questions beginners have. Or, try to list all changes in one file only. In addition, you are way more likely to forget to add a file. In code, this shows up as error on jenkins or somewhere, but in document it wont.

YeGoblynQueenne · on Oct 26, 2020

All I can say is that I find the complexity manageable, which is obviously because I've spent enough time using the tool that it doesn't scare me. Perhaps that is the solution? Use the tool until one is comfortable with it? It seems to me the alternative is a bunch of ad-hoc, half-baked procedures that are going to cause a lot more pain down the line.

Or, of course, sitting down and writing a better tool than git (or svn, mercurial, etc). Personally, that's too much hassle given that someone has already done the work and all I have to do is learn how to use the program, which is much easier and much less time consuming.

In general, that's why I use other peoples' software, even when I disagree with the logic behind them and I would do a better job if I had a go at it (which is, of course, always).

watwut · on Oct 26, 2020

Even cvs and svn was easier to use then git. Git was created for a reason, it is good tool for some use cases.

However, its user interface for simple projects is step back against what existed previously. By step back I mean much harder to use.

empiko · on Oct 25, 2020

I don't find the argument against version control very convincing. They basically say that not forgetting git committing and pushing is too difficult. I don't think that it is actually that bad. They also claim that they do not need version control, because they never go back. Well, the thing is that you don't need to go back until you do. And when you do and you don't have version control, you can start redoing things from the scratch. Overall, they really did no convince me that putting all your code, data, figures etc. on DropBox is really that great of a deal. I think that they should instead hire a guy to create proper git-based processes for them and then train their people to use them.

YeGoblynQueenne · on Oct 25, 2020

I think it's also an awful idea to keep all your work in proprietary formats that you can't access unless you have a license for something.

ryukafalz · on Oct 25, 2020

Even worse - that the person trying to reproduce it can't access unless they have a license for something.

jmercouris · on Oct 25, 2020

This article complains that it takes too much time to make your research reproducible. I would instead suggest that if your research is not reproducible it is invalid. Anyone doing research without reproducibility in mind is doing invalid research.

zby · on Oct 25, 2020

Oh - come on - it starts with: "It's no secret that biomedical research is requiring more and more computational analyses these days, and with that has come some welcome discussion of how to make those analyses reproducible. On some level, I guess it's a no-brainer: if it's not reproducible, it's not science, right? And on a practical level, I think there are a lot of good things about making your analysis reproducible,"

It does not say making research reproducible is not worth it - it just wants one exception - about the figure layout (fonts, colors, etc) and then it argues against using version control which is orthogonal. I don't agree with him that version control is too complicated - but I agree that version control is not really needed for reproducibility.

robocat · on Oct 25, 2020

From the top of the article: “(vaguely ranked starting with what I consider most important): 1. Umm, that it’s reproducible. 2. ...”

datastoat · on Oct 25, 2020

The article starts with "if it's not reproducible, it's not science, right?" And then it goes on to talk about a limited idea of reproducibility, i.e. plain duplication and recipe-style rerunning an analysis.

I think there's a more important idea of reproducibility: your ideas have to be reproducible in different settings. We expect scientific rules to apply across a range of settings. If they don't, it's a recipe, it's not science. So the real burden of reproducibility is communicating your ideas so that other people can see how to apply them to other settings, and moreover so that they want to.

All the fetishizing of "give me a repo where I can just run 'make'" is, I think, distracting people from this. I wonder if making such a big deal of reproducibility-of-analyses is actually harmful: maybe it'd be a better test of reproducibility-of-science if other researchers had to reproduce your findings based on how you communicated your ideas, i.e. based on your writeup.

sega_sai · on Oct 25, 2020

I have multiple disagreements with the article. 1) I think the versioning has value outside of reproducibility and outside cases when multiple people are doing development. As a scientist who started working before git/svn were available I know there is a tendency of keeping multiple program versions like prog.py prog_new.py prog_supernew.py etc. Also there was a constant fear of breaking things and not remembering why something was changed and why. Now when my code is in git, I know I can retrace my steps (and I very rarely do, but knowing that I can is important). I am certainly annoyed by branches/cherry-picking/merges complexities which I am trying to avoid as much as I can, but overall there still a net benefit. 2) Regarding plots and scripts, that's what I teach all my students -- make the plots as separate scripts. Ideally you want the whole project from the beginning to the final plots to be executable as one big script. Having that makes it easy to fix issues early on in the pipeline as you are less tempted to sweep them under the rug or dismiss as likely not important because it is just too hard too remake everything from scratch.

kortex · on Oct 25, 2020

First, reproducibility is a spectrum, from a simple recipe to run `make` and get a byte-for-byte reproduction.

A cookie recipe is reproducible - follow these steps and get tasty cookies. But will they be the tastiest cookies? For that, you need more detail, but the write-ups almost never have the full story.

Having put dozens of academic ML algos into production, anything short of `make reproduce` which bare minimum gives you a table of results on some validation subset, is merely reproducible-ish. Not having vcs cause it's "hard" is straight-up slacking (use svn if git is too hard).

lordnacho · on Oct 25, 2020

I'm gonna disagree with this. I worked in non-academic research setting, quant trading, where you have a big data set and want to show that some trading strategy is useful. It's kinda like research in the sense that you're discovering things from doing a lot of data-fu. But you're not in the business of sharing the results with peers at other institutions.

What you end up with is endless versions, stemming from endless "what-ifs".

The argument that you won't go back is wrong. Often it's not that you want to reset everything to a specific version, it's that you forgot what the assumptions were at a certain time. You want to know why you were doing what you were doing as much as how you implemented it. Often it's only the comments you need.

What you also need is commit hooks to test that your assumptions continue to hold. Your dataset contained 1000 stocks? Boom, something changed and now for whatever reason it went down to 10. Now you know, and you didn't waste a bunch of time picking apart everything until you found it. Or mundane things like whether some data file has been downloaded, or whether the OS has enough free space to do the analysis. The more you have of this, the more you benefit in the long run.

The "git is hard" argument is a big red flag for me. Git is one of those things where I also had loads of weird issues when I was new to it. It's worth reading some books about how it works, and trying a few apps that help with it. However, I also get a strong "I just need it to work" vibe from this piece. It's unfortunate, but I think a couple of weeks worth of reading on how git works is well worth it. I'm bu no means a git guru, but I have read books about it, and I mostly stick to the simple commands.

The thing about the graphs seems a bit strange to me. There's libs for making graphs that you can customize to your liking. Once the graph (or any binary blob) is made you can hash it and check it in a unit test. I've done the same with video outputs, it tells me if something changed.

Dropbox seems like the cloud version of folder4/version8_final_last_ultimate_2.xls to me. I wouldn't touch it, I would do things in script.

> These rationales don’t really apply, though, to code that people will write for analyzing data for a particular project. In our lab, and I suspect most others, this code is typically written by one or two people, and if two, they’re typically working in very close contact.

This part is problematic. What are you gonna do when one of those people leaves the team? What's different about analyzing data? If anything, analyzing data for a project ought to be totally reproducible, since there's no moving parts. If you need a random seed, save that as well. When you move on, someone else should be able to clone your repo, and after slogging through all the test cases they should get the same numbers as you.