Labelia (ex Substra Foundation)

View Original

The nuts and bolts of Differential Privacy (Part 1/2)

Does the term Differential Privacy ring a bell? If it doesn’t then you’re in for a treat! The first part of this article provides a quick introduction to the notion of Differential Privacy, a new robust mathematical approach to formulate the privacy guarantee of data related tasks. We will be covering some of its use cases, untangle its mechanisms and key properties and see how it works in practice. 

This article has been written by Heytem Boumaza, intern at Substra Foundation.

Photo by Dayne Topkin on Unsplash

The issue of privacy has been the center of debate since the dawn of the digital age. With the exponential growth of data-driven services, almost not a day goes by without some privacy breach scandal hitting the headlines, consequently, we have seen a major shift towards privacy preserving technologies. 

One definition of privacy has risen to prominence in the past several years, that is Differential Privacy. It is an increasingly hot topic among privacy experts within the academic circles as well as the tech community in general, so much so that it keeps receiving widespread adoption in numerous real-world applications. Tech giants like "Google" and "Apple" are incorporating Differential Privacy into their own products and services whenever possible:

Google was ahead of the curve as usual:

  • They deployed their first Differentially Private technology called RAPPOR inside the Chrome browser in 2014, while at the same time working on expanding the use of this technology across other products like Google Maps, the Assistant and Google Play.

  • In 2020 and as the world was struggling against the Covid19 pandemic, Google published the COVID-19 Community Mobility Reports, to help gain insights on the spread of Covid19 and facilitate decision making for public health officials. The data in these reports was collected from users location history, aggregated and anonymized using Differential Privacy.

Apple on the other side employs a privacy-preserving system to improve user experience, while adhering to the challenge of preserving the privacy of its individual users. Not to mention many open source software for Differential Privacy in Machine Learning and data analysis (more on that later).

Although the question of how to analyse data while preserving privacy can be considered an old question in the literature, the term "Differential Privacy" was first coined in 2006 by Cynthia Dwoark, Frank McSherry, Kobbi Nissim and Adam Smith in a landmark paper. They introduced a mathematical definition of Differential Privacy, along with a general framework of how to build mechanisms that satisfy this definition. Fast forward to 2013, Cynthia Dwoark co-wrote "The algorithmic foundation of Differential Privacy", this is the definitive work on the subject.


Privacy attacks on machine learning: a bomb waiting to go off?

The reason behind the continuous rise of interest in privacy preserving technologies such as Differential Privacy can be traced back to the growing concerns over the sometimes excessive use of sensitive data by machine learning algorithms, often without proper consideration for the underlying privacy risks. 

Despite the prominent impact that machine learning has had on shaping the technology that powers various industries; just like any other piece of software, it has vulnerabilities.

In the past several years, there were examples of malicious attempts to commit cyberattacks against machine learning algorithms. These attacks exploit the unique characteristics of Machine Learning models; such characteristics can be summed up in two key properties:

  • ML models are ‘data hungry’, i.e, they require large volumes of data to achieve a satisfying level of accuracy

  • Complex ML models are often treated as ‘black box models’, which means that most of the time, we don’t know for sure why they function the way they do or why they fail.

The most relevant type of Machine Learning attacks in the context of Differential Privacy is the Membership Inference Attack, which aims at ‘determining whether a given record was used as part of a given model’s training set. The technical details of this attack are beyond the scope of this already overly long introduction so feel free to later check out this paper to know more.

By now you might be wondering:  just what the heck is Differential Privacy and how does it work? Before we get to the scary math part, it's better to start off with a rough idea on the subject to help you build up an intuition. According to Cynthia Dwork, Differential Privacy in plain English is:

 "The outcome of any analysis is essentially equally likely; independent of whether any individual joins, or refrains from joining the dataset."

This might sound complicated at first but trust me it’s really not, let me unpack this definition for you with a simple example, that is the ‘hello, world!’ of Differential Privacy: randomized response.

 Did you ever cheat in an exam?

Imagine that we wish to conduct a survey to estimate the percentage of people who cheated in at least one exam sometime during their studies. Naturally, some participants who did cheat will find this question a little embarrassing, they will either refuse to answer or lie. In fact, admitting to cheating might get you in trouble as it’s considered a criminal offense in some countries. To cope with this issue, we’re going to use a famous mechanism derived from social sciences called randomized response. It’s used to study the prevalence of certain behaviors in a population while preserving the confidentiality of the individual participants, it works like this:

Instead of answering the question directly, the interviewer will ask the participant to answer the question according to the outcome of some random process, for example a coin flip. The interviewer himself can’t see the result of the coin flip, this is essential! The participant flips the coin and answers accordingly:

  • If the coin lands on tails, the participant must answer the question truthfully(yes or no)

  • If the coins lands on heads, the participant flips it a second time and answers according to this second flip:

        -   This time if the coin lands on heads, they answer yes, and if it lands on tails they answer no (regardless of what the true answer is)

            So what’s the point of introducing the coin flip? The trick here is that even if a participant answers ‘yes’ to the question, and ‘yes’ here is the embarrassing answer, they can always claim that they said that because they got heads in the first flip followed by a second heads on the second flip, as a result, their privacy is protected. 

             Of course, there’s a cost to this layer of protection: wrong answers add noise to your survey results, but if the number of participants is large enough, this noise will cancel itself out and the interviewer will be able to obtain a useful estimate of the true percentage of cheaters.

             This process (randomized response) is actually Differentially Private, but why and how do we prove it? Based on the above definition, the outcome of our survey question (the percentage of people who did cheat) should not change too much if any participant answered ‘yes’ instead of ‘no’ and vice versa. Formally speaking:  

the ratio between the probability of the outcome if your true answer is a ‘yes’ versus the probability of the outcome if your true answer is a ‘no’ should be bounded by some value. Let’s do the math: 

  • The probability that the response would be a ‘yes’ given that the true answer is a ‘Yes’: Pr[Response=Yes|True answer=Yes]=3/4

(50% chance to get tails in the first coin flip plus 25% chance to get heads in the second coin flip)

  • The probability that the response would be a ‘yes’ given that the true answer is a ‘no’: Pr[Response=Yes|True answer=No]=1/4

(25% chance to get tails in the second coin flip)  

             The ratio is 3/4 ÷ 1/4 =3, thus we proved that the randomized response mechanism in this example is (ln(3)) Differentially Private. This log ratio variable is denoted as ε(the symbol ε was chosen to represent this ratio for mathematical convenience as it’s usually used to describe small numerical quantities).

             We would've actually got the same probability ratio if the response was a ‘no’ instead of a ‘yes’, it works both ways, i'll leave this part as an exercise for you. (Hint: compute the ratio between Pr[Response=No|True answer=No] and Pr[Response=No|True answer=Yes]).

Congratulations! This is your first ever Differentially Private mechanism, go ahead and try it, all you need is a coin and an embarrassing question.

       Now we've seen how the randomized response mechanism translates itself nicely into the language of Differential Privacy. However, there’s got to be more to Differential Privacy than just survey questions, don’t you think? What about other possible mechanisms used for data analysis, say statistical queries such as aggregation functions, here is another example:   

Imagine we own a medical database, and we want to publish how many people suffer from a certain medical condition using a counting query. We consider this statistical quantity to be sensitive information so we can't just publish it. Why? Because that would compromise the privacy of the individuals in the database. How? If we were to remove an individual from the database (or change, replace, etc. you get the idea) then we will systematically perturb the output of the query to the extent that the presence of the deleted individual in the database can be inferred. So what do we do? We make this query Differentially Private by adding some random noise to its output.

To further consolidate this example, put yourself in the shoes of an attacker. Let's say you're a "data creeper" trying to figure out whether your "target" is in the database or not, and you're in possession of all the individuals in the database except your "target". To do that you decide to query the database before the removal of your target, and after its removal. Next, you compare the distributions of the two outputs, and then measure how similar these two distributions are. If the query is Differentially Private enough then they will look almost the same.

You might be wondering why we are using distributions, after all, the result of our query is just a number (the number of people with a given medical condition), right? Indeed, but notice the careful use of the word "essentially" in the definition. The process used to compute this statistical quantity (in our case it was a simple counting query) is usually randomized, so it might output different results when running it multiple times; hence, we used probability distributions.

But what type of noise should we add? How much and based on what?

Don’t worry, we will be answering these questions as we go along, but since i promised you a scary math part, well here it is:

Differential Privacy definition: A mechanism f satisfies (ε, δ)-Differential Privacy for two non-negative numbers ε and δ if for all neighbors d(D, D′)=1, and all subset S of f’s range, as long as the following probabilities are well-defined, there holds:

 P( f(D) ∈ S) ≤ δ + exp(ε) P( f(D′) ∈ S).

The above definition was taken from this paper.

  • The mechanism f can be anything: A query like the one we used in our previous example, an anonymization technique, a survey question, a machine learning model, etc.

  • D and D’ are neighboring databases (they’re also called adjacent databases), meaning that they ‘differ on at most one entry’

  • d(D,D’) represents the distance between D and D’, that is the number of changes required on one of them to make it identical to the other one

  • P[f(D) ∈ S]: the probability that when you run the mechanism f on the database D, the output belongs to the subset S

  • The original definition of Differential Privacy doesn't include the term δ, this was added later to count for the fact that using Gaussian noise instead of Laplacian noise introduces a risk of an "information leak" by a probability of δ. As a result, Differential Privacy comes in two flavors:

  1.  (ε, δ)-Differential Privacy: this is the one that was introduced in the defintion, it has both a multiplicative difference exp(ε) and an additive difference δ, therefore it’s strictly weaker but it allows us to do more stuff

  2.   (ε)-Differential Privacy: this one has only a multiplicative difference exp(ε) and it provides stronger privacy protection

  • Intuitively, you want both ε and δ to be small, the smaller they are, the closer the two probabilities are, resulting in a stronger privacy protection

  • In the Differential Privacy literature, the variable ε is called Privacy Budget

In addition to the above definition, Differential Privacy has some desirable properties that make it stand out from other anonymisation techniques:

  • Differentially Private algorithms are "unlikely to become obsolete": meaning they’re resilient to post-processing attacks and they work no matter what the attacker knows or will know about our data. This is because it’s not the output that is Differentially Private, but the mechanism that produced it is. No matter what kind of computation the attacker conducts on the output of a Differentially Private mechanism, the privacy guarantee is maintained, forever.

  • We can quantify the privacy loss or to put it differently, we can quantify the maximal information gain by the attacker. 

  • Composition: this is such a powerful property of Differential Privacy; privacy is maintained over multiple runs of the mechanism. In other words, if multiple attackers conspire with each other and compare their unique databases,  the resulting data is still protected by Differential Privacy, except that the privacy is now weaker, nevertheless, we’re still able to quantify how much information they gained.  To put it more formally:

If a mechanism f is (ε, δ) Differentially Private then a combination of K mechanisms operating on the same database is (K*ε, K*δ) Differentially Private. This basically means that we can split our privacy budget among the K mechanisms. However, if the K mechanisms operate on unique partitions of the same database then their combination is (ε, δ) Differentially Private, awesome!

     We’re still missing a crucial piece of the puzzle: the noise adding part. We stated in our previous example that to make the process Differentially Private we need to inject some random noise into the output of the query, but we didn’t specify how much and from which distribution this noise should be sampled. Before we get to that there’s one last notion that we need to explore: the notion of function sensitivity.

Definition 

The sensitivity of a function f, denoted asf’, describes how much its output will change between two runs over all possible pairs of neighboring databases, taking on the maximum value that this change can have.

                                              ∇ f= max ||f(D’) - f(D)||

For example, a simple counting query: removing an entry from the database will result in a change in the output of this query of at most 1.     

Finally, there exists multiple mechanisms you can use to generate your noise, we’ll be looking at two of them.

            

Laplacian Mechanism 


  This mechanism adds Laplacian noise, that is a noise sampled from the Laplacian distribution, where its scale b is  calibrated by two parameters( the sensitivity of f and Epsilon): 

                                                   Lap(b) where  b = f/ ε 

                             

              The Laplacian mechanism uses the L1 sensitivity, based on the Manhattan distance


Gaussian Mechanism

This mechanism adds Gaussian noise sampled form the Gaussian distribution with a mean 0 and a standard deviation calibrated by the sensitivity of f ‘f’, Epsilon and Delta:

The Gaussian mechanism uses the L2 sensitivity, based on the Euclidean distance

Which one of these two mechanisms should you choose? It depends:

  • Can you afford a risk of an information leak (δ  > 0)? If the answer is no then you can’t use the Gaussian mechanism

  • If any single sample in the database can impact at most one statistic, use the Laplacian mechanism, otherwise, the Gaussian mechanism is better. As the impact of a single sample grows (let’s say the sample’s presence in the database is not unique, or multiple samples are related to each other like family members) the amount of the noise needed to protect privacy grows:

  • The Laplacian noise needs to be scaled by a factor of N, where N is the maximum number of unique statistics that a single sample can impact

  • The Gaussian noise on the other hand needs to be scaled by a factor of √N, meaning it scales much slower and therefore it’s less noisy (this is because it uses a different type of sensitivity)

        The second point is the reason why the Gaussian mechanism is preferred in Differentially Private machine learning over its Laplacian counterpart. Think about it: a single sample can influence many if not all the parameters of the model, so the amount of noise generated by the Gaussian mechanism will be much lower compared to the Lapalcian mechanism. Another advantage of the Gaussian mechanism is related to the properties of the Gaussian Distribution. It makes it harder for the attacker to tell which noise comes from the mechanism and which noise comes from the dataset since the error noise and the generated noise both come from the same distribution.

         

Differential Privacy meets machine learning: finding a middle ground

Machine learning algorithms are designed to extract insights from data by capturing useful information. On the other hand, privacy is all about protecting information. Thus, it’s difficult to reconcile these two seemingly conflicting objectives, but is it really? The goal of machine learning is to produce models which are able to generalize on previously unseen data, without relying too heavily on any single sample. In other words, we'd like our models to capture the rules that describe the distribution of the training dataset as a whole without getting accustomed to the individual samples, sounds familiar? This is exactly the goal of Differential Privacy. So after all, these two objectives go hand in hand with each other.

Since we agreed that Private Learning is possible, how do we do it? Obviously there has to be a noise addition somewhere, but where exactly?. Researchers have identified and experimented with three possible ways:

  •  Inject the noise into the input of the algorithm

  •  Inject the noise into the objective function

  •  Inject the noise into the output of the algorithm

There’s no clear cut answer of which strategy to use, it all depends on the type of learning task at hand; be it a regression analysis, a classification problem, a clustering scheme, or any other supervised or unsupervised algorithm. 


One popular approach for private Deep Learning is the Differentially Private Stochastic Gradient Descent (DP-SGD), which modifies the original stochastic Gradient Descent Algorithm to satisfy the constraints imposed by Differential Privacy. More precisely, two additional steps are required  before the model’s weights get updated:

  • Gradient Clipping:  This operation is often used to remedy the problem of the exploding gradients, however, in the context of Differential Privacy, the motivation is a bit more subtle. We want to limit the sensitivity of the gradients to individual samples

  • Noise Adding:  here we add random noise to the clipped gradients

Differential Privacy limitations

         OK! Before you start enjoying your fair share of Differential Privacy goodness by adding noise left and right, you must be aware of some of its limitations. These limitations can be examined in two different contexts:

  • Statistical queries: Differential Privacy works best for queries with low sensitivity such as count queries. Other types of queries like Sum and Max where the sensitivity can be very high at times will force you to add large amounts of noise, as a result, you end up destroying the utility of your query.  There is another issue which is related to the allocated privacy budget; so far we made the case for a user issuing at most two queries, but what if he asks a series of queries? In this case and after a certain number of answered queries, our privacy budget will be consumed. You’re left with two options: either limit the number of queries a single user can ask or again, add more noise.

To avoid this pitfall always keep the fundamental law of information recovery at the back of your mind: “overly accurate answers to too many questions will destroy privacy in a spectacular way”.

  • Machine learning:

Firstly, consider the privacy/fairness tradeoff: Gradient clipping and noise adding operations can impact the model’s accuracy on the underrepresented classes or subgroups of the input domain, which can be detrimental to the fairness aspect of model construction. Differential Privacy guarantees privacy not fairness, what’s even worse is that it can aggravate the problem if the non-private model is already unfair in its predictions. For more details on this issue make sure to check out this paper.

Secondly, introducing Differential Privacy into your model may impact the performance significantly due to the additional computational steps, so that’s another tradeoff for you to consider.

I hope that the notion of Differential Privacy has become clearer and less noisy to you (pun intended), and that you’re eager to get your hands on it in practice, you want tools and tools you shall have. 

In the second part of this article, we will present to you three different libraries to implement your favorite Differential Privacy mechanisms, make sure to check it out!


References

When presenting the mathematics behind Differential Privacy, direct quotes and definitions have been taken from these papers(links to some of them have already been included in the article):