Hacker News new | past | comments | ask | show | jobs | submit login
Privacy Loss in Apple's Implementation of Differential Privacy on MacOS 10.12 (arxiv.org)
180 points by sohkamyung on Sept 12, 2017 | hide | past | favorite | 39 comments



While I like this work, the title may be misleading to people who think of "privacy loss" as something distinct from "differential privacy".

What they show is that as you use Apple's implementation, the differential privacy parameter grows (providing weaker guarantees as time passes). They don't show that they can bypass the mechanism and it's guarantees, just that Apple has rigged the implementation to decay the guarantees as you continue to use it (note: decay stops if you stop using Apple stuffs).


This is a known weakness with differential privacy. Depending on how much noise you inject into your data, your analytics either approach meaninglessness or the privacy decays.

The hype machine surrounding Apple did not want to hear this and were just caught up with the idea "Apple could data mine you while maintaining ur privacy".


Funny that almost everyone in this threat seems to "get" Differential Privacy and thinks of it as a good tool. But when it was discussed for Mozilla Firefox everybody was appalled and enraged. (Thread at https://news.ycombinator.com/item?id=15071492)


I think the big difference there is that Mozilla was proposing opt-out differentially private data collection, whereas I believe Apple has historically been opt-in (with the default being no data collection).


Do you believe an opt-in in necessary with a (properly implemented) differentially private collection mechanism? Just curious about your take on this.


I think that is the ethical thing to do, yes.

Differential privacy does statistically disclose information about you, and whether the quantitative bound used is up to your standards is a decision you should make before it happens. Given that user-level understanding of DP is so low, I don't think defaulting people in to levels chosen by others is a good idea. I 100% guarantee Mozilla doesn't have the background to make this choice responsibly, and doesn't have near the DP expertise the RAPPOR team has / had (who I also wouldn't trust to chose things for me).

Ideally, the use of differential privacy would make you more willing to opt in (when you have a choice) rather than being a smokescreen for organizations that simply want to harvest more data (which was Mozilla's stated motivation).

Edit: fwiw, there are some cool "recent" versions of differential privacy that let the users control the amount of privacy loss on a user-by-user basis. So, you could start at 0 and dial it up as you feel more comfortable with the tech. This incentivizes organizations to be more transparent with what they do, as it (in principle) increases turnout.

Edit2: For context, Apple's "default" values appear to be (from this paper) epsilon = 16 * days. That means that each day you are active, the posterior probability someone has about any fact about you can increase by a factor of exp(16) ~= 88 million. So, numbers matter and I am (i) glad Apple made it opt-in, (ii) super disappointed they aren't at all transparent about how it works, and (iii) thankful that the paper authors are doing this work.


And it seems that the "default" values will be increasing further in iOS 11 -- to 43 * days.


So now Apple's privacy system is only stupidly more secure than everyone else's instead of absurdly more secure.

So 16 per day sounds like a lot more than 1 or 2 per day, but what do these numbers mean? Presumably 16 per day is a theoretical maximum if you were to generate every kind of privacy related data ever day. But is 16 really a lot? How high would that have to cumulatively go in order to be useful for extracting reliable info on an individual? Wouldn't the info collected on an individual still have to be associated with them? Frankly I'm not really able to determine any of that from the paper.


Could someone please ELI5 how an "intimate" provider (such as Apple, Google or Microsoft) can collect any data ongoingly without eventual loss of privacy?


Let’s say you wanted to count how many of your online friends were dogs, while respecting the maxim that, on the Internet, nobody should know you’re a dog. To do this, you could ask each friend to answer the question “Are you a dog?” in the following way. Each friend should flip a coin in secret, and answer the question truthfully if the coin came up heads; but, if the coin came up tails, that friend should always say “Yes” regardless. Then you could get a good estimate of the true count from the greater-than-half fraction of your friends that answered “Yes”. However, you still wouldn’t know which of your friends was a dog: each answer “Yes” would most likely be due to that friend’s coin flip coming up tails.

Source: Google's RAPPOR project

I pointed to some open source repos on my blog post from 2015 https://www.quantisan.com/a-magical-promise-of-releasing-you...


The proper way is: you flip a coin, if it comes up heads, you say the truth, otherwise you say whatever you want.

Then you infer an estimate using Bayes' theorem.

Otherwise it is not private, as a reply has pointed out.


" The proper way is: you flip a coin, if it comes up heads, you say the truth, otherwise you say whatever you want."

Saying "whatever you want" will incur a very large sampling error, especially as the population of those saying whatever increases.

What is needed is a notion of scalable privacy, where as the population of those saying "whatever" increases the privacy strength also increases yet the absolute error remains at worst constant.

https://arxiv.org/abs/1708.01884


Mmmhhh... I was trying to ELI5. I understand the sampling error may be large but I cannot see the inherent problem. Could you explain please? (I mean, what do I have to do if I get tails? or do we change the coin).

Honest question, just too lazy to read the manuscript you link...


copying verbatim relevant sections

" the estimation error quickly increases with the population size due to the underlying truthful distribution distortion. For example, say we are interested in how many vehicles are at a popular stretch of the highway. Say we configure flip1 = 0.85 and flip2 = 0.3. We query 10,000 vehicles asking for their current location and only 100 vehicles are at the particular area we are interested in (i.e., 1% of the population truthfully responds “Yes"). The standard deviation due to the privacy noise will be 21 which is slightly tolerable. However, a query over one million vehicles (now only 0.01% of the population truthfully responds “Yes") will incur a standard deviation of 212. The estimate of the ground truth (100) will incur a large absolute error when the aggregated privatized responses are two or even three standard deviations (i.e., 95% or 99% of the time) away from the expected value, as the mechanism subtracts only the expected value of the noise."

"In this paper, our goal is to achieve the notion of scalable privacy. That is, as the population increases the privacy should strengthen. Additionally, the absolute error should remain at worst constant. For example, suppose we are interested in understanding a link between eating red meat and heart disease. We start by querying a small population of say 100 and ask “Do you eat red meat and have heart disease?". Suppose 85 truthfully respond “Yes". If we know that someone participated in this particular study, we can reasonably infer they eat red meat and have heart disease regardless the answer. Thus, it is difficult to maintain privacy when the majority of the population truthfully responds “Yes".

Querying a larger and diverse population would protect the data owners that eat red meat and have heart disease. Let’s say we query a population of 100,000 and it turns out that 99.9% of the population is vegetarian. In this case, the vegetarians blend with and provide privacy protection of the red meat eaters. However, we must be careful when performing estimation of a minority population to ensure the sampling error does not destroy the underlying estimate."


OK, so the idea is to keep the estimate useful without harming the privacy. Got it.

Thanks.


In that case though, you would know that the "No" friends are definitely not dogs, and the "Yes" friends are possibly a dog, so it seems like the dogs would still not be completely anonymous. Wouldn't the dogs be better off not partaking in the survey and being narrowed down into a group of possible dogs?


The implementation of this should give a random answer when not being truthful.

If the coin comes you heads you answer truthfully. If it comes up tails, you flip the coin again and answer if the yes if the coin is heads and no if the coin is tails. You can then no longer know if anybody is (or is not) a dog.

The probabilities can be adjusted to provide more or less privacy (while making the data less or more useful). For example, if you only answer truthfully 0.1% of the time it would be hard to know anything about anyone, at the cost of knowing the total number of dogs less precisely.


This often helps tracking people's opinions indeed. "Here's a neat trick: If you want to work out whether your favorite celebrity is a republican, just Google their name and see if they talk about politics. If the answer is no, then they're a Republican."


It doesn't really protect privacy, unless your first coin is highly biased toward the random answer. You can still infer that there is a higher likelihood that this individual is a dog. A few 60-70% reliability inferences on various dog related characteristics and you can identify a dog with 95% chance.


This would be a flawed implementation. The method still works, you just have to be extremely careful to avoid exposing users through correlations when collecting anything more than a single value just once. The aforementioned RAPPOR paper[1] covers this in Section 6.

[1] https://arxiv.org/abs/1407.6981


If I understand correctly, they just call it a limitation of the approach and invite to sample the user base rather than collect data systematically (plausible deniability is a bit of a moot point, once you have been exposed as a dog with 95% confidence, pleading that there is still a small chance you might not be one doesn't really help).

I am not sure how any of that helps in a mass collection of data like OS telemetry.


I suspect you did not understand correctly? I just read their section 6, and they neither "just call it a limitation" (the section is three pages long, with several recommendations) nor invite anyone to sample the user base (this generally just focuses the privacy loss on the sampled people).

Which text were you reading that lead you to this conclusion?


On sampling:

> It is likely that some attackers will aim to target specific users by isolating and analyzing reports from that user, or a small group of users that includes them. Even so, some randomly-chosen users need not fear such attacks at all...

For the limitation, the whole section 6.1 explains that this only protects a single question. If you collect more than single question, you must rely on other techniques to protect the privacy.


Yes, I think you've misunderstood.

The text you've quoted is about how a random subset of the population is already immune to the issue of repeated queries, not that subsampling the population helps in any way. If you don't interrupt the quotation mid-sentence, it reads:

> Even so, some randomly-chosen users need not fear such attacks at all: with probability (1/2 f)^h, clients will generate a Permanent randomized response B with all 0s at the positions of set Bloom filter bits. Since these clients are not contributing any useful information to the collection process, targeting them individually by an attacker is counter-productive.

The whole of section 6.1 is not about how it only protects a single question, it is about how one ensures that the single-question protections generalize to larger surveys, concluding that

> This issue, however, can be mostly handled with careful collection design.


Sampling and taking a random subset of the population are synonymous.

But my point is precisely that this technique helps with a single question. As soon as you are doing continuous mass collection you don't really get any privacy protection from this technique, and you have to rely on other techniques (encryption, etc).


There are two kinds of sampling involved here: selecting whom to ask a question, and individuals selecting their responses. The random subset of the population is determined by their own choices, that lead them to never say anything useful. An attacker has no influence on this, so if their target is within that group, the attack can't succeed.


OK, thanks :)

But ... [please see reply to omarforgotpwd].


Is there a name for this algorithm?


The entire approach is called differential security, as in the headline of the thing we're commenting upon ;-)

But, as someone mentioned, if the coin comes tails you should answer with another coin flip, not "yes".


The coin flipping approach is called "Randomized Response" and dates back to the 60s.

https://en.wikipedia.org/wiki/Randomized_response


Anything collected might break privacy, because that information may be something sensitive. Differential privacy relies upon obfuscating reported data in a statistically meaningful way.

Accordingly, you need to analyse aggregate statistics only, add random inaccuracy and apply data binning, anonymise reports. And you need to calibrate the noise. The latter is what the paper seems to be mainly focused on.

> We call for Apple to make its implementation of privacy-preserving algorithms public and to make the rate of privacy loss fully trans-parent and tunable by the user.

"Calibrating Noise to Sensitivity in Private Data Analysis" is about the matter.

- https://www.microsoft.com/en-us/research/publication/calibra...

- https://link.springer.com/content/pdf/10.1007%2F11681878_14....


Let's say i'm collecting a simple yes / no piece of data such as... did this user open Google Chrome today. Every day my analytics engine sends the data back up: yes, no, no, yes, yes, yes, no. Someone could look at this data and know whether you used Chrome or not on a given day, but when Apple sends the data up it randomly flips the answer for a certain number of data points in such a way that the flipping effects will cancel each other out at scale and the overall stats will be more or less the same. Now, even if I were able to look at all the data your computer sent up about your Chrome usage I can't actually be sure which days you used Chrome because I know some of the answers are flipped. The more changes get made, the more privacy the user has.


> The more changes get made, the more privacy the user has.

This is a minor, pedantic point, but what you really mean to say is "the closer the changes get to 50%, the more privacy the user has". If you change all results, then it is easy to flip them all back.

This distinction trips up several folks, where the research world initially believed (and some in official stats still believe[1]) that a part of privacy is literally not publishing the true answer (e.g. above: literally flip every output).

What you actually want is

    Pr[output | input] ~= Pr[output | input']
which may mean that you should leave things alone, which feels weird but is important.

[1]: Noise addition is a common way to obscure real-valued data, and some official stats bureaus have the ridiculous rule that "you always add noise, and you never add less than X in absolute value", leading to releases where you can be 100% confident that the true value is not in a range around the published number.


OK, thanks :)

But still, there are some questions that you'd arguably never want to say "yes" to. Such as, did you visit some verboten site (terrorist, child porn, etc) today?

So how can an algorithm "know" which questions it's safe to use differential privacy with, and which it isn't?

Or would you argue that it's safe enough to use differential privacy with even such questions?


You're not really saying "yes", you're saying "I either visited this site or my coin came up heads.". Of course this only works if the other person believes that you flipped a coin beforehand, but in this case it's part of the data-collection procedure, so not much trust is necessary.


Presumably almost nobody visits the verboten sites, and so if you answer "yes", it's far more likely that you answered "yes" because your coin came up tails than because you actually visited it.

Or, in other words, the question is not "Did you visit this site," the question is "Did your coin come up tails or did you visit this site," which is a perfectly safe question to say "yes" to.


I don't think this approach applies in such situations, and wouldn't be a good idea if it did.

It may help with categories of sites, like "did you visit a news site today" or "did you watch porn".


Systems that don't deal well with edge cases aren't so great.

For people who don't really need to care about privacy, I guess that differential privacy is good enough. But there's a gotcha there, for people who ought to care, but are clueless. I'm reminded of that ex cop in Philadelphia, who believed Freenet's claims about plausible deniability.

Edit: A prudent standard for calling something "[foo] privacy" is arguably PGP.


On an ongoing basis? Probably can't. Differential privacy may work OK for one-off measurements, but it's known to leak when doing ongoing measurements of data that's somewhat correlated over time such as emoji use. It's not clear there's any way around this.




Consider applying for YC's Spring batch! Applications are open till Feb 11.

Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: