Hacker News new | past | comments | ask | show | jobs | submit login

The GP is making a completely legitimate point here that broad sharing of large raw datasets is pretty hard, but I don't think anyone is arguing we should give up. Here's a few thoughts, though they're more directed at the general thread than the parent.

In my case I'm currently finishing up a paper where the raw data it's derived from comes to 1.5 PB. It is not impossible to share that, but it costs time and money (which academia is rarely flush with), and even if it was easy at our end, very few groups that could reproduce it have the spare capacity to ingest that. We do plan to publicly release it, but those plans have a lot of questions.

Alternatively we could try to share summary statistics (as suggested by a post above), but then we need to figure out at what level is appropriate. In our case we have a relevant summary statistic of our data that comes to about 1 TB that is now far easier to share (1 TB really isn't a problem these days, though you're not embedding it in a notebook). But a large amount of data processing was applied to produce that, and if I give you that summary I'm implicitly telling you to trust me that what we did at that stage was exactly what we said we'd done and was done correctly. Is that reproducibility?

You could also argue this the other way. What we've called "raw data" is just the first thing we're able to archive, but our acquisition system that generates it is a large pile of FPGAs and GPUs running 50k lines of custom C++. Without the input voltage streams you could never reproduce exactly what it did, so do you trust that? Then you're into the realm of is our test suite correct, and does it have good enough coverage?

I think we have a pretty good handle on one aspect of this, is our analysis internally reproducible? i.e. with access to the raw data can I reproduce everything you see in the paper? That's a mixture of systems (e.g. configs and git repo hashes being automatically embedded into output files), and culture (e.g. making sure no one things it's a good idea to insert some derived data into our analysis pipeline that doesn't have that description embedded; data naming and versioning).

But the external reproducibility question is still challenging, and I think it's better to think about it as being more of a spectrum with some optimal point balancing practicality and how much an external person could reasonably reproduce. Probably with some weighting for how likely is it that someone will actually want to attempt a reproduction from that level. This seems like the question that could do with useful debate in the field.




why not purchase a sufficient number of tapes or drives to capture the data and deposit it at the university library?

certainly sharing apparatus is hard but you could release the schematics, board designs and BOMs of the electronics involved.

The problem now is that 1) very few even try to reproduce 2) very little money is available for reproduction

fixing those incentives would help alot.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: