The proper way is: you flip a coin, if it comes up heads, you say the truth, oth...

dpriv123 · on Sept 12, 2017

" The proper way is: you flip a coin, if it comes up heads, you say the truth, otherwise you say whatever you want."

Saying "whatever you want" will incur a very large sampling error, especially as the population of those saying whatever increases.

What is needed is a notion of scalable privacy, where as the population of those saying "whatever" increases the privacy strength also increases yet the absolute error remains at worst constant.

https://arxiv.org/abs/1708.01884

pfortuny · on Sept 12, 2017

Mmmhhh... I was trying to ELI5. I understand the sampling error may be large but I cannot see the inherent problem. Could you explain please? (I mean, what do I have to do if I get tails? or do we change the coin).

Honest question, just too lazy to read the manuscript you link...

dpriv123 · on Sept 12, 2017

copying verbatim relevant sections

" the estimation error quickly increases with the population size due to the underlying truthful distribution distortion. For example, say we are interested in how many vehicles are at a popular stretch of the highway. Say we configure flip1 = 0.85 and flip2 = 0.3. We query 10,000 vehicles asking for their current location and only 100 vehicles are at the particular area we are interested in (i.e., 1% of the population truthfully responds “Yes"). The standard deviation due to the privacy noise will be 21 which is slightly tolerable. However, a query over one million vehicles (now only 0.01% of the population truthfully responds “Yes") will incur a standard deviation of 212. The estimate of the ground truth (100) will incur a large absolute error when the aggregated privatized responses are two or even three standard deviations (i.e., 95% or 99% of the time) away from the expected value, as the mechanism subtracts only the expected value of the noise."

"In this paper, our goal is to achieve the notion of scalable privacy. That is, as the population increases the privacy should strengthen. Additionally, the absolute error should remain at worst constant. For example, suppose we are interested in understanding a link between eating red meat and heart disease. We start by querying a small population of say 100 and ask “Do you eat red meat and have heart disease?". Suppose 85 truthfully respond “Yes". If we know that someone participated in this particular study, we can reasonably infer they eat red meat and have heart disease regardless the answer. Thus, it is difficult to maintain privacy when the majority of the population truthfully responds “Yes".

Querying a larger and diverse population would protect the data owners that eat red meat and have heart disease. Let’s say we query a population of 100,000 and it turns out that 99.9% of the population is vegetarian. In this case, the vegetarians blend with and provide privacy protection of the red meat eaters. However, we must be careful when performing estimation of a minority population to ensure the sampling error does not destroy the underlying estimate."

pfortuny · on Sept 12, 2017

OK, so the idea is to keep the estimate useful without harming the privacy. Got it.

Thanks.