I think this almost every time I read the paper. It’s like Linus’ “show me the code.” I just want papers now to “show me the data and the code.” And include a discussion about why these results are important. I think it’s a great time for the scientific community to improve transparency on these fronts.
Sincerely, someone who reads a lot of research but contributes none because I’m an amateur.
> I just want papers now to “show me the data and the code.”
But the code is secondary to the idea. The idea and the discussion around how it was arrived at and what it means is the key thing. The code is just there to implement it. You could code the same idea ten different ways.
I translated what you said in the context of mathematics:
“But the proof is secondary to theorem. The theorem and the discussion around how it was arrived at and what it means is the key thing. The proof is just there to show it’s true. You could write a proof for the same theorem ten different ways.”
Which is all true. But man it wastes so much time having to re-prove everything. Also some lemmas/theorems are so hard to prove. It’s much easier when you see some incredible statement and can’t believe it’s true to look at the proof and see where the mistake / contentious part is.
Yes! There is "reproducibility" in the sense that you can run the code and get the same answer, and this can be very useful. But I don't see that as the core domain of the research paper, which is to describe a new discovery. The papers job should be to explain the discovery and give appropriate supporting evidence. This leads to a stronger form or reproducibility, like you say, which is "I understand the discovery and I can do it too". That's not the same as generating the exact result the authors' show. And a paper that is reproducible in the first sense but not in the second is of limited scientific value.
If anything, I'd like to see more focus on giving evidence of generality of a result, vs just sharing everything needed to get back the same specific result
Yes but chances are it only appears to work because the analysis code has bugs. So first i want to check the code and that it works before i put effort into understanding the idea.
Those aren’t CS papers, social science and biology research have constraints that CS does not. I haven’t seen any evidence that there is anywhere close to that level of issue here. A couple of conferences adopted artifact review where an independent reviewer attempts to reproduce the experiments listed in the paper. Nearly all papers that participate do end up passing
One case where this happens a lot is papers that pick bad comparisons as state of the art. If you have the code, you can run it vs better configurations of existing tools to see if the promises still hold up.
Yes, that's what I tried to say. I can easily try a piece of code with my own data to verify that its results are plausible. I can't do that for a paper. So if I don't have the code, I might waste a significant amount of time trying to reproduce a fake paper.
Some papers are about an idea, but others (perhaps most others) are about results. And the results are very much dependent on the data and how you analyzed it.
If I've learned anything in my career it is that no, ideas are not valuable. There are vastly more bad ideas than good ideas. What makes an idea valuable is validation. Papers aren't to present ideas: papers are to present ideas that have been validated. We proposed an idea, we went and ran some experiments or gathered data some other way, and we concluded the idea was valid (or not valid). The point of this discussion here is that ideas that require huge amounts of computer effort to validate are prone to bugs. The conclusions cannot be relied upon to be validated without having the software available so that it, too, can be validated.
Really? I've read a lot of SIGGraph papers, and sure, they didn't provide all the code. But you know that the code exists. And certainly for those there's a lot of trust. I think we're talking here about something different. Not "hey, you can use quaternions for animation" but "If you factor in hippopotamus usage, young adults experience 22% more flange per square doughnut, and we got all this data, and we ran it through 25 separate computer programs written in lolcode, and look, proof!"
> But there are so many valuable papers in CS that just presented an idea. If you ignored them you’d be ignorant of how to do 90% of modern engineering.
You are right. But that is why we are having this discussions, so we can improve situation.
Having even bad code (and corresponding data) available is always better than not. You can always just ignore it, and read the papers like today.
Honestly I am ok with just zip file of project directory that you have anyway, with hopefully list of versions of os, libs and programs used.
We could do a lot better than just a zip file, but that would be a nice start.
For some papers. Others make a claim about some statistically significant look at data that might not have any basis in reality because the code is wrong. A famous example being the R&R paper in economics where a second look at the showed massive mistakes in the excel document they were using, invalidating the central thesis. Unfortunately not before being used by the world bank for years as a metric for forcing austerity on countries.
Not if the code is wrong, and therefore the conclusion may be wrong. I'm no scientist, but I don't think the point of scientific papers is to get unfounded ideas out into the world.
I can list many major influential papers in computer science that described an idea and didn't really give any concrete code, where we're still using the idea today.
For example the paper on polymorphic inline caching, which is the key idea for the performance of many programming languages today, just described the idea, and didn't present any code. How was it evaluated? People sat and thought about it. Holds up today.
You can reason about an idea through other things than concrete code. Code is transient and incidental. Ideas persist.
I think you're talking past each other. Both are true under different circumstances. In some cases an abstract idea is the important takeaway. In other cases the central point of a paper is to present conclusions that were arrived at based on analysis of some dataset. If the code used to generate or analyze the dataset is wrong then conclusions based on it likely worthless.
A lot of times the idea is wrong (and thus not valuable), and that can’t be proven either way without the code and data. So an idea that depends on code without the code is less valuable.
We can only possibly gain from publishing the code, and lose by not publishing.
It's not like it takes a whole lot of time to just dump your code in a github repo once you're done and link it somewhere on the paper (if you wrote code at all while working on the paper).
Sometimes I did just want to run my own experiments with different datasets, and those algorithms aren't always trivial to implement :|
Yep but we should still show we did actually simulate our idea, and the methodology that gave rise to the simulation. Not because of the code but to test at all a simulation we describe actually outputs what we propose
Not everyone is a programmer but they could find one to confirm, or better yet, invalidate code my team relied on
I agree that they should ideally come with raw data along with all code that was used to process it to produce the results as presented.
> but contributes none because I’m an amateur
I don't mean to be rude but it seems relevant to point out. Papers aren't written for the benefit of amateurs. They're written for experts who actively work in that specific field. I don't think there's anything wrong with that.
Yeah I agree, but I read papers mainly in domains I do have university level degrees in. So while I’m not as expert as a lifetime professor, I do know the fields relatively well.
And I don’t think it’s rude, that’s why I included that statement!
There are also legal and privacy concerns. I've worked on a few research papers where exactly one researcher had access to the data under a very strict NDA. And even they did not get full access to the raw data, only the ability to run vetted code against it and some subsets for development.
This is because the datasets were subscriber logs from mobile operators. They are both highly privacy sensitive and contain sensitive business knowledge. There is no way they will ever get published, even in some anonymized form.
Ultimately it always comes down to trust. You need to convince your peer reviewers to trust you that you have correctly done what you have claimed to have done. Of course, even when you publish datasets, you need to convince the peer reviewers to trust you that you didn't fake the data.
It doesn’t really work like that. For instance, imagine you have a simulation with billions of particles in it. To construct reduced data you may need to use many fields (position, temperature, composition) of all particles over many outputs (usually at different times).
In that case you shouldn’t need to ship the data at all. Just include the code for the simulation and let the rescuers run it to generate the data themselves.
Sorry I'm a bit late to this, but those simulations take 10s - 100s of millions of Cpu hours (i.e. costs of millions - 10s of millions of dollars), so that's not practical.
I think in astronomy they generate tens of terabytes per night and an experiment may involve automatically searching through the data for instances of something rare, like one star almost exactly behind another star, or an imminent supernova, or whatever. To test the program that does the searching you need the raw data, which until recently, at least, was stored on magnetic tape because they don't need random access to it: they read through all the archived data once per month (say) and apply all current experiments to it, so whenever you submit a new experiment you get the results back one month later.
I like the idea of publishing the data with the paper but it's not feasible in every case.
The GP is making a completely legitimate point here that broad sharing of large raw datasets is pretty hard, but I don't think anyone is arguing we should give up. Here's a few thoughts, though they're more directed at the general thread than the parent.
In my case I'm currently finishing up a paper where the raw data it's derived from comes to 1.5 PB. It is not impossible to share that, but it costs time and money (which academia is rarely flush with), and even if it was easy at our end, very few groups that could reproduce it have the spare capacity to ingest that. We do plan to publicly release it, but those plans have a lot of questions.
Alternatively we could try to share summary statistics (as suggested by a post above), but then we need to figure out at what level is appropriate. In our case we have a relevant summary statistic of our data that comes to about 1 TB that is now far easier to share (1 TB really isn't a problem these days, though you're not embedding it in a notebook). But a large amount of data processing was applied to produce that, and if I give you that summary I'm implicitly telling you to trust me that what we did at that stage was exactly what we said we'd done and was done correctly. Is that reproducibility?
You could also argue this the other way. What we've called "raw data" is just the first thing we're able to archive, but our acquisition system that generates it is a large pile of FPGAs and GPUs running 50k lines of custom C++. Without the input voltage streams you could never reproduce exactly what it did, so do you trust that? Then you're into the realm of is our test suite correct, and does it have good enough coverage?
I think we have a pretty good handle on one aspect of this, is our analysis internally reproducible? i.e. with access to the raw data can I reproduce everything you see in the paper? That's a mixture of systems (e.g. configs and git repo hashes being automatically embedded into output files), and culture (e.g. making sure no one things it's a good idea to insert some derived data into our analysis pipeline that doesn't have that description embedded; data naming and versioning).
But the external reproducibility question is still challenging, and I think it's better to think about it as being more of a spectrum with some optimal point balancing practicality and how much an external person could reasonably reproduce. Probably with some weighting for how likely is it that someone will actually want to attempt a reproduction from that level. This seems like the question that could do with useful debate in the field.
Sincerely, someone who reads a lot of research but contributes none because I’m an amateur.
Edit: when I say data, I mean the raw data.