Fair warning: anonymization is a hard problem. It is *never* easy, and you'd be ...

Aerroon · on Dec 2, 2020

Earlier this year we saw articles published about the finding of an Earth-sized rogue planet in our galaxy. The way it was discovered is remarkable. A telescope was looking at a distant star, then the star seemingly brightened over a period of 42 minutes. And that was basically it.

When a massive object passes between a distant star and an Earth-based observer the light coming from the star gets deflected and focused by the gravity of the massive object. The star seemingly brightens as a result. This is called gravitational microlensing. The duration and how the brightening happens allowed scientists to determine that it was likely a Mars-sized planet and that it likely had no star within 8 astronomical units. It's likely a rogue planet of roughly Earth size.

Think about how little actual information these astronomers had. Yet they were able to make a very credible prediction on what happened. You see this all over in science, particularly in physics, where the truth is coaxed out of very little direct data. This makes me think that similar things can probably be done with data about people. This would mean that effectively anonymizing people's data is very hard or maybe even impossible.

lelandbatey · on Dec 2, 2020

Indeed, the ability to very accurately infer lots from very small amount of data is something I've long though of as the "Sherlock Holmes" problem. If there's just a few people with the ability to deduce lots of things, then they can be amusing and offer some limited utility, like Sherlock Holmes.

Computers are powerful enough though, and people clever enough, that everyone can now have a Sherlock Holmes in their pocket. And large corporations can have smarts far beyond Sherlock Holmes. So what once was a "eh, there's no rule against it, if you have the information the smart guy should be able to use it", is now a case of "oh dang, hmmm maybe civilization is built on the idea that not everyone is Sherlock Holmes."

Maybe this should have been my own blog post somewhere, but anyway, it's one of the many new conundrums of our time.

jimktrains2 · on Dec 2, 2020

I think it's the same problem as surveillance. When tailing someone required a substantial investment of an actual person's time and a wiretap literally required physical handling of the physical lines to that person's phone, it makes/made sense to give the police wide leverage in who they put under a lens because it required a substantial investment of time from the police in a way that was intrinsically limited and unable to scale to the population level.

Enter today, when nearly everyone can be and is surveiled with detail cops of previous generation could only dream of, and we're still dealing with those same laws, but we removed the unwritten premises that they're based on: surveillance couldn't scale.

qchris · on Dec 2, 2020

I, for one, would be very interested in reading a more in-depth exploration of this idea, and would encourage you to write and publish that blog post.

selestify · on Dec 2, 2020

Seconded

jacquesm · on Dec 2, 2020

33 bits... that's all you need.

ellimilial · on Dec 2, 2020

This.

Alternatively look for an open-licensed dataset if one exists in your domain (e.g. using https://fairsharing.org or, shameless plug, https://biokeanos.com). With generated data you always add some assumptions, with more 'wild' data you have a chance to discover edge cases earlier.

kevin_nisbet · on Dec 2, 2020

Self plug, I wrote about some aspects of why this is a hard problem a couple years ago, https://goteleport.com/blog/hashing-for-anonymization/

T-A · on Dec 2, 2020

First sentence in [1]:

Amnesia is an application written in java and JavaScript and should be used locally for anonymizing a dataset.

[1] https://amnesia.openaire.eu/about-documentation.html

abhgh · on Dec 2, 2020

Agree with your point about anonymization, but here is a call out for generated data too: if the objective is to build a model, you might end up losing information, or worse, your model might end up modelling the assumptions in the generation process.

jacquesm · on Dec 2, 2020

Yes, that's absolutely valid. This too is hard. But then again, if it were easy everybody would be doing it so consider it a problem that when solved properly becomes part of your moat and a thing that you could easily mention in a sales process to ensure a level playing field with other parties pitching.

smarx007 · on Dec 2, 2020

https://github.com/dTsitsigkos/Amnesia

levesque · on Dec 2, 2020

What about building machine learning models that make predictions on said data? Can't just test on fake data.

abhgh · on Dec 2, 2020

I have made another comment on this page and this is indeed a problem barring specific cases where you have a "simulator" and you want to learn something applicable to some "aggregate of simulations". For ex. learning a Reinforcement Learning policy for tic-tac-toe is fine, because you can build a tic-tac-toe play simulator and you want a universal policy.

The problem goes beyond testing: your training might (1) be deprived of information your original data had, e.g., among a 1000 features, the key to the classification of interest might be one feature; how does your generator know to not distort this potentially at the cost of distorting the other 999 features? (2) might latch onto the assumptions made by the data generator, e.g., for a continuous valued feature, your model might bias itself towards the distribution moments that the generator assumed.

I am not sure generating good fake data is a problem different from good density determination. And in some cases, you might need to specify what parts of the original distribution you don't want the generator to mess with: consider an NLP dataset where your model must rely on sentence structure. Generating the right bag-of-words features might not help here: sequence matters. Or if you wanted to use contextual embeddings; sequence matters then too.

Even if you did manage to generate a "distributionally-compatible" version of the data, for cases where you perform some kind of data enrichment at a later stage, you could run into problems. For ex: if the original data has zipcodes that you wanted to mask, and your data generator substitutes them with arbitrary strings, then at a later point you cannot introduce a feature that measures the proximity of two locations.

fractionalhare · on Dec 2, 2020

This isn't my area of expertise, but I've spoken to computer vision researchers who apparently use generated data for training models for self-driving vehicle autonomy. Maybe they only use generated data for the train set and then do cross-validation on real data? I'd like to hear them chime in on this thread if any are reading here.

Theoretically speaking if the generated data has the same distribution and parameters as the real data [1], and encodes similar nonparametric features like seasonality and user activity, I think generated data might be fine. [2]

_________

1. Admittedly tricky if you have limited data and no insight into the underlying population distribution/features, just those of the sample. But then you have a worse problem for modeling diagnostics anyway.

2. In the sense that anything is "fine", which is a spectrum that requires some critical skepticism in statistics. There are always caveats but it may still be robust and useful.

levesque · on Dec 2, 2020

You could kickstart training on simulators and then do a transfer, i.e. make adjustments to your final model, on real world data. But to learn only on generated data the problem boils down to the nonparametric features you will be using to state that the generated data is similar to the real data. What is a complex enough feature to say that images are equivalent? They might be statistically equivalent according to your features, but are they really? I think this is a very hard problem, because if we did have a good answer to this question then Tesla & co. would already be training their models on perfect simulators and we wouldn't see the glitches currently found in autonomous driving applications.

fractionalhare · on Dec 2, 2020

That's what I figured, re: transfer modeling. Thanks for chiming in.

m0nster · on Dec 2, 2020

ARX (see other comment in this thread) also supports data anonymization for privacy-preserving machine learning.

jacquesm · on Dec 2, 2020

There is a German company that specializes in this: https://www.statice.ai/ , and that too is a path that you should only walk if you fully understand the subject matter.

So you could instead test on fake data that has the same (or as much as possible the same) statistical properties as the data that you would like to use.

yawz · on Dec 2, 2020

Hah! Came here to echo what you wrote. I’ve been in (too) many data anonymizing/sanitizing efforts. It’s nothing but easy. I would strongly consider investing in test data generation.

tingletech · on Dec 2, 2020

How are you going to do COVID-19 research with generated test data?

ska · on Dec 2, 2020

jacquesm was talking about test purposes, which is a different problem.

tingletech · on Dec 3, 2020

sure, but I don't think that is the use case for this tool