Why not just anonymize the user data part? There are startups selling *health da...

eli · on Dec 13, 2012

I don't think Reddit users would go for that. Also, it's surprisingly difficult to anonymize data effectively without removing nearly all of it.

otakucode · on Dec 13, 2012

I'm not so sure about that. I believe the problem has been solved, the solution just isn't widely known yet. I read a paper on arxiv probably a year ago that describes a method that seems pretty straightforward and secure, but I've never seen anything about it since. It involved, essentially, throwing out any records which could actually contribute to a change in a statistical measure. You basically end up finding what aspects of the data are actually identifiable, and throw out any records that contain that. It's guaranteed not to screw up your observations because, by definition, if something is statistically significant it has to show up often enough that it CANT be used to single out a source.

gwern · on Dec 14, 2012

Given the dismal history of anonymization, a paper on arvix is roughly up there with a blogger saying 'I've proven p!=np'...

> It's guaranteed not to screw up your observations because, by definition, if something is statistically significant it has to show up often enough that it CANT be used to single out a source.

What's 'statistically significant' here? The usual p<0.05 convention? You realize that there can be multiple measurements or pieces of data all of which individually have p>0.05 but together have p<<0.05... Information leakage should be measured in bits, not p-values.

(This kind of aggregation is one of the benefits of approaches like meta-analysis.)