Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Fair warning: anonymization is a hard problem. It is never easy, and you'd be surprised how many bits can leak out of what you thought was properly anonymized data.

If you are using data for test purposes please use generated data, not anonymized data. This has the additional advantage that there is no potential path for live data to end up on a developers machine.

added in edit: And also realize that just using a service such as this or similar actually increases the chances that you are leaking sensitive data, in fact uploading somewhere it is the very best way in which you could ensure that at some point there is a breach. Don't take the 'made easy' line for granted if possible ask the company for their audit reports and what measures they have in place to ensure that your data doesn't end up elsewhere, a company providing such a service or a chunk of software that does this is - of course - a massive target. The only way you would stay away from that extra risk is by running this on your premises on a machine that is not connected directly or indirectly to the outside world.



Earlier this year we saw articles published about the finding of an Earth-sized rogue planet in our galaxy. The way it was discovered is remarkable. A telescope was looking at a distant star, then the star seemingly brightened over a period of 42 minutes. And that was basically it.

When a massive object passes between a distant star and an Earth-based observer the light coming from the star gets deflected and focused by the gravity of the massive object. The star seemingly brightens as a result. This is called gravitational microlensing. The duration and how the brightening happens allowed scientists to determine that it was likely a Mars-sized planet and that it likely had no star within 8 astronomical units. It's likely a rogue planet of roughly Earth size.

Think about how little actual information these astronomers had. Yet they were able to make a very credible prediction on what happened. You see this all over in science, particularly in physics, where the truth is coaxed out of very little direct data. This makes me think that similar things can probably be done with data about people. This would mean that effectively anonymizing people's data is very hard or maybe even impossible.


Indeed, the ability to very accurately infer lots from very small amount of data is something I've long though of as the "Sherlock Holmes" problem. If there's just a few people with the ability to deduce lots of things, then they can be amusing and offer some limited utility, like Sherlock Holmes.

Computers are powerful enough though, and people clever enough, that everyone can now have a Sherlock Holmes in their pocket. And large corporations can have smarts far beyond Sherlock Holmes. So what once was a "eh, there's no rule against it, if you have the information the smart guy should be able to use it", is now a case of "oh dang, hmmm maybe civilization is built on the idea that not everyone is Sherlock Holmes."

Maybe this should have been my own blog post somewhere, but anyway, it's one of the many new conundrums of our time.


I think it's the same problem as surveillance. When tailing someone required a substantial investment of an actual person's time and a wiretap literally required physical handling of the physical lines to that person's phone, it makes/made sense to give the police wide leverage in who they put under a lens because it required a substantial investment of time from the police in a way that was intrinsically limited and unable to scale to the population level.

Enter today, when nearly everyone can be and is surveiled with detail cops of previous generation could only dream of, and we're still dealing with those same laws, but we removed the unwritten premises that they're based on: surveillance couldn't scale.


I, for one, would be very interested in reading a more in-depth exploration of this idea, and would encourage you to write and publish that blog post.


Seconded


33 bits... that's all you need.


This.

Alternatively look for an open-licensed dataset if one exists in your domain (e.g. using https://fairsharing.org or, shameless plug, https://biokeanos.com). With generated data you always add some assumptions, with more 'wild' data you have a chance to discover edge cases earlier.


Self plug, I wrote about some aspects of why this is a hard problem a couple years ago, https://goteleport.com/blog/hashing-for-anonymization/


First sentence in [1]:

Amnesia is an application written in java and JavaScript and should be used locally for anonymizing a dataset.

[1] https://amnesia.openaire.eu/about-documentation.html


Agree with your point about anonymization, but here is a call out for generated data too: if the objective is to build a model, you might end up losing information, or worse, your model might end up modelling the assumptions in the generation process.


Yes, that's absolutely valid. This too is hard. But then again, if it were easy everybody would be doing it so consider it a problem that when solved properly becomes part of your moat and a thing that you could easily mention in a sales process to ensure a level playing field with other parties pitching.



What about building machine learning models that make predictions on said data? Can't just test on fake data.


I have made another comment on this page and this is indeed a problem barring specific cases where you have a "simulator" and you want to learn something applicable to some "aggregate of simulations". For ex. learning a Reinforcement Learning policy for tic-tac-toe is fine, because you can build a tic-tac-toe play simulator and you want a universal policy.

The problem goes beyond testing: your training might (1) be deprived of information your original data had, e.g., among a 1000 features, the key to the classification of interest might be one feature; how does your generator know to not distort this potentially at the cost of distorting the other 999 features? (2) might latch onto the assumptions made by the data generator, e.g., for a continuous valued feature, your model might bias itself towards the distribution moments that the generator assumed.

I am not sure generating good fake data is a problem different from good density determination. And in some cases, you might need to specify what parts of the original distribution you don't want the generator to mess with: consider an NLP dataset where your model must rely on sentence structure. Generating the right bag-of-words features might not help here: sequence matters. Or if you wanted to use contextual embeddings; sequence matters then too.

Even if you did manage to generate a "distributionally-compatible" version of the data, for cases where you perform some kind of data enrichment at a later stage, you could run into problems. For ex: if the original data has zipcodes that you wanted to mask, and your data generator substitutes them with arbitrary strings, then at a later point you cannot introduce a feature that measures the proximity of two locations.


This isn't my area of expertise, but I've spoken to computer vision researchers who apparently use generated data for training models for self-driving vehicle autonomy. Maybe they only use generated data for the train set and then do cross-validation on real data? I'd like to hear them chime in on this thread if any are reading here.

Theoretically speaking if the generated data has the same distribution and parameters as the real data [1], and encodes similar nonparametric features like seasonality and user activity, I think generated data might be fine. [2]

_________

1. Admittedly tricky if you have limited data and no insight into the underlying population distribution/features, just those of the sample. But then you have a worse problem for modeling diagnostics anyway.

2. In the sense that anything is "fine", which is a spectrum that requires some critical skepticism in statistics. There are always caveats but it may still be robust and useful.


You could kickstart training on simulators and then do a transfer, i.e. make adjustments to your final model, on real world data. But to learn only on generated data the problem boils down to the nonparametric features you will be using to state that the generated data is similar to the real data. What is a complex enough feature to say that images are equivalent? They might be statistically equivalent according to your features, but are they really? I think this is a very hard problem, because if we did have a good answer to this question then Tesla & co. would already be training their models on perfect simulators and we wouldn't see the glitches currently found in autonomous driving applications.


That's what I figured, re: transfer modeling. Thanks for chiming in.


ARX (see other comment in this thread) also supports data anonymization for privacy-preserving machine learning.


There is a German company that specializes in this: https://www.statice.ai/ , and that too is a path that you should only walk if you fully understand the subject matter.

So you could instead test on fake data that has the same (or as much as possible the same) statistical properties as the data that you would like to use.


Hah! Came here to echo what you wrote. I’ve been in (too) many data anonymizing/sanitizing efforts. It’s nothing but easy. I would strongly consider investing in test data generation.


How are you going to do COVID-19 research with generated test data?


jacquesm was talking about test purposes, which is a different problem.


sure, but I don't think that is the use case for this tool




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: