This is why I believe NixOS is so important. It allows one to completely freeze ...

aub3bhat · on May 1, 2016

As an academic researcher I find it absolutely hilarious that you think the complex social problem of incentive structure and competition will be solved by some Unix OS.

If you are interested just take a look at the complexity of licensing/ownership of code written by a PhD student at a Research University in United States.

If you look at most of my Open Source code, I use AWS AMIs to share both data as well as OS + code, however I can do that only for side projects. The main thesis projects are typically very high value and consequences of sharing it far more complex to understand.

https://github.com/akshayubhat/

Ericson2314 · on May 2, 2016

No that is not what I think at all, see my follow-up comments below. I just think the combination of shitty tools + incentive structure is even more insurmountable. This is a tough problem that should be attacked from as many fronts as possible.

> The main thesis projects are typically very high value and consequences of sharing it far more complex to understand.

Commercial value, the university is just more stringent on the licensing/ownership restrictions, or something else?

aub3bhat · on May 2, 2016

There are several factors.

1. Commercial value.

2. Future grant applications (a competing group not sharing code will have better chance winning the grant.)

3. Future of other students and collaborators in the group. If two PhD students write a paper, the junior student might wish to write extension papers without getting scooped.

And many more. Yet if a paper is important enough, independant researchers will often attempt at replication, this nowadays routinely happens in Machine Learning and Vision due to huge amount of interest. Also in several cases replication is fundamentally impossible, e.g. consider a machine learning paper that uses proprietary data from hospital attached to the university, etc.

jcrites · on May 2, 2016

I totally get that researchers' incentives are not aligned toward publishing it, so no need to explain that further. There are costs and downsides and probably not enough benefit to them. That's fine. Everyone works within their system of incentives.

If it's paid for by public dollars, then the code and data belong in the public domain eventually. I understand there are exceptions like hospital data affected by patient confidentiality - that's fine. However the code released by that researcher should be capable of reproducing their results with that data set plugged in (such as by someone else who has access to it).

As a taxpayer, my concern for publicly funded research is maximizing benefit to the public good. I understand your point about follow-on research, and I'm not saying that I'd expect the code and data to be made available immediately with publication, but that deserves to be the case some reasonable time afterward (like a year). I understand that researchers' incentives are not necessarily aligned toward making it public; I am saying that people who fund research (including taxpayers through the political process) should require and expect it. Keeping it private indefinitely is a degree of self-centeredness that does not strike an appropriate balance between benefit to the researcher and to the public in my opinion.

aub3bhat · on May 2, 2016

I never understood the meme about "public funding" translating into "public domain". Just because research is "publicly funded", does not means that the "public" owns it or even has a right to ownership. Public education is publicly funded does not means that government can ask for every drawing drawn by 9 year old in classroom to be in the public domain :) . In fact its actually opposite (https://en.wikipedia.org/wiki/Bayh%E2%80%93Dole_Act), given that Universities can and do patent inventions from publicly funded research.

Further funding arrangements themselves are very complex, a professor typically procures funding from University, NSF, NIH, private companies, donors etc. In such cases if NSF adopts a hard line approach that any research touching its dollars ought to release code under say GPL, it would make it impossible to collaborate. Finally all requirements aside, one can always release intentionally poorly written code in form of MatLab .m and compiled mex files. I have observed several such cases, where the code can demonstrate a concept but is intentionally crippled.

Finally graduate students, graduate and are paid for doing research which is publishing/presenting papers at peer reviewed conferences and journals. If what funding agencies really seek is ready made software they ought to fund / pay at the same level as software developers (as many companies do).

jcrites · on May 2, 2016

> Just because research is "publicly funded", does not means that the "public" owns it or even has a right to ownership.

I didn't make the argument that the public owns it or has a right to ownership, though I suppose that some people might and so I can see why you would touch on that point.

I would describe my view as like this: public funding is subject to the political process, and voting by taxpayers (directly or indirectly through voting of politicians or their appointees). As a taxpayer, I prefer to make public domain publication a requirement of publicly funded research, and I think every taxpayer should too. I consider the goal of public funding of science to be the benefit of public good, and believe that public good will best be served by public domain publication of all data, code, and analysis methods. (Whew, there's a lot of "pub" and "public" in there!)

One might reach my position by working backwards from, "Why do we as taxpayers agree to fund science with government money?" It's certainly not to give researchers prestige or jobs! (Those may be necessary parts of achieving public good, but they're not the goal which is the public good, and if they're in tension with public good then the public good probably needs to win.)

I don't seek ready made software; not at all. I only seek adequate disclosure of data and analysis methods sufficient for others to easily verify it and build on it. See for example the attempt at replication in http://www.peri.umass.edu/236/hash/31e2ff374b6377b2ddec04dea...

> In such cases if NSF adopts a hard line approach that any research touching its dollars ought to release code under say GPL, it would make it impossible to collaborate.

I will need to think more about this issue. I might be willing to accept the downside as a taxpayer. I'm not sure I understand it well enough what the friction would be to collaboration at the moment. If you're referring to the GPL specifically, then yes I agree that's probably the wrong license - public domain would be more appropriate.

I would be OK if this was simply an electronic log of the data as well as all machine commands that have been run on it - something that is recorded automatically by the operating environment. I am truly not looking for "working production code". But, those sequence of commands should be reproducible if someone "replays" them; a verifiable digital journal. Publishing an article that's difficult-to-reproduce feels like producing the least possible public good while still getting credit. Publishing an article that's fully and automatically reproducible, because it contains references to all of the data and code that yield the results as an executable virtual machine with source code, provides the maximum public good, and that's what I want science funded with public money (and ultimately all science) to work toward. (I realize that this is just like, my opinion man :)

aub3bhat · on May 2, 2016

You are correct in expecting return of public investment. Actually NIH has a policy that explicitly favors "Basic Scientific Research" over applied or application of research. According to NSF and NIH, the primary goal of government funded research is advancement of science, this is done via conducting experiments, and publishing results at peer reviewed venues. The peer review both during grant application and at publication stage factors heavily into assessment by funding agencies. If tomorrow NSF were to give significant weight to availability of source code (They actually do that to a small extent), it might set up perverse incentives. A small percentage of federally funded research goes into Computer Science and even a smaller fraction involves results where there is enough demand for software. Another aspect of academic funding that people don't get is that research grants unlike say contracts have a significantly different set of expectations associated with them. E.g. a student can get NSF Fellowship, claiming he want to cure diseases using machine learning, only to later spend 3 years working on music recommendation system (True Story!).

Regarding the economics study you linked to, I am very much familiar with that study having seen the interview of graduate student on Colbert Report. For non-CS fields the quality of code is anyway so bad that its much more difficult. Further several researchers rely on proprietary tools, which only make this task difficult.

In my opinion the correct way is not by having NSF impose rules, but rather by having venues that accept papers (Conference & Journals) insist on providing software. However this is easier said than done, since its a competitive two sided market.

Regarding actual licensing issues, I can assure you that GPL is second favorite license of choice favored by University IP departments, the first one being "All rights reserved with modification explicitly forbidden, except for reproduction of experiments."

Ericson2314 · on May 2, 2016

Ah, so theses are special insofar that they spawn more derivative commercial and paper-writing opportunities, and aren't singled out simply be virtue of being called "thesis".

joshvm · on May 2, 2016

Usually the university owns the code in some way and you have to check whether there are any IP issues.

kaeluka · on May 1, 2016

The problem here is (EDIT: IMHO) a social one. The question is not "Why does nobody USE NixOS?", it's "Why does nobody WANT NixOS?"

And the answer to that is that the incentives are set up such that reproducibility is a waste of time. As a CS researcher, I want to be idealistic. But the field is competitive and I'm not sure how much idealism I can afford.

There's considerable effort to bring artifact evaluation into the academic mainstream (I'm actually helping out at OOPSLA this year [1]), and I think this is a good way forward.

[1] http://2016.splashcon.org/track/splash-2016-artifacts

Ericson2314 · on May 1, 2016

I'm not trying to argue the wrong-headed institutional incentives are irrelevant. It is true that once one gets over the learning curve, Nix* gives you artifact reproducability for free, but that still leaves the problem of the learning curve, and caring about reproducability in the first place.

Thank you for yours, and others', work making artifact evaluation a priority. Concurrently, people (including myself) are trying to do something about that learning curve. Hopefully both efforts will massively succeed within a decade :).

kaeluka · on May 1, 2016

Sorry if I've come on too strong. I might've gotten a bit defensive. I've done work in the past that is hard to reproduce and that's not an aspect of my work I'm proud of -- to say the least.

Btw, I'm excited about NixOS. It sounds like you're involved. Thank you. I haven't used it, but I'm hoping to find the time soon.

Ericson2314 · on May 2, 2016

No you definitely didn't :). I'd say probably a common HN bias is to ignore institutional forces when it is convenient. Glad to hear you are excited about NixOS!

eli_gottlieb · on May 1, 2016

I'm currently in the process of rewriting some of my code for doing certain simulations via probabilistic programming in Haskell. I expected this to be a pain in the ass, but maybe make the code neater. I've actually found the unexpected benefit to be that the code runs deterministically and produces the same results each time, so I know that an apparent result is not going to go away with another run and a fresh PRNG stream.