Labelia (ex Substra Foundation)

View Original

Going Differentially Private: techniques and tools of the trade (Part 2/2)

Going Differentially Private: techniques and tools of the trade (Part 2/2)

In the first part of this article, you had your first encounter with Differential Privacy and learned why it’s so awesome. In this second part, we’ll present to you three python libraries for implementing Differential Privacy: Difflibpriv, TensorFlow-Privacy and Opacus.

The rapid adoption of Differential Privacy by Machine Learning practitioners coupled with the high demand for privacy preserving Machine Learning models have both led to the emergence of new software tools and frameworks that aim to ease the design and implementation of such models.

Whether you're a machine learning veteran or just a curious data enthusiast, starting from scratch sounds like a daunting thought. Moreover, you might get lost while scouring Github repositories looking for a solution to start from.

Luckily, we’ve done the heavy lifting for you, and we stumbled upon three libraries for working with Machine learning under Differential Privacy. We will present the features and shortcomings of each of the three libraries. For an effective comparison we’ll focus on the following aspects:

Community support and ongoing contributions/updates: this is particularly important if you're planning on adopting a long term solution into your projects. Besides, the sight of a dead Github repo is sad.

Documentation: accessibility is important

Implementation of different Differential Privacy mechanisms 

Compatibility and integration with mainstream ML libraries such as pytorch, keras and scikit-learn

Model variety: it’s essential that these libraries offer support for several types of machine learning models

 Before we proceed to our comparative analysis, you should keep a few remarks in mind:

  •  At the time of writing of these lines, these are the 3 libraries that we deemed worth experimenting with in the context of Differentially Private Machine Learning. This list is by no means exhaustive. As this field is not fully mature yet and evolving at a fast pace, you should keep an eye out for new tools and frameworks in the near future

  • We solely focus on software solutions written in the Python programming language, as it's still the best fit for Machine Learning and AI based projects

  • A quick disclaimer: for each library, we’ve included a table that contains explanations for privacy parameters and their usage. However, the official documentation remains the go-to reference. Libraries get updated at one point in the future and the information in these tables will consequently become obsolete (some variables will have been renamed, refactored or dropped, etc.)

1- Diffprivlib

 ● The IBM Differential Privacy Library, this is a general purpose, open source python library for writing Differential Privacy applications. 

● The library offers a handful of tutorials that cover most of what it has to offer in terms of functionality 

● The library offers implementations for several supervised and unsupervised machine learning algorithms:

See this content in the original post

● The library implements an extensive collection of Differential Privacy mechanisms:

Implementation of Privacy Budget Accountant that enables tracking the evolution of the privacy budget plus the possibility to visualize the trade-off of accuracy and privacy

     

● The library is integrated with Scikit-Learn: minimal code changes are needed to run Differentially Private scikit-learn models compared to their vanilla counterparts. In fact, most of them can be run with a single line of code. 

Here it is an example of how to initialize a Differentially Private KMeans model using Diffprivlib:   

from diffprivlib.models import KMeans

dp_model = KMeans(epsilon=6, n_clusters=3)

It doesn’t get any easier than that, does it?

● The library offers support for Scikit-Learn pipelines with Differential Privacy

● In addition to doing Differentially Private machine learning, it’s possible to do Differentially Private data exploration using statistical queries and histograms

TensorFlow-privacy:

  • This is an open source Python library built on top of Tensorflow that enables the training of neural networks under the constraints of Differential Privacy. 

  • The library is active and undergoing continuous development

  • A comprehensive tutorial is provided to guide newcomers

  • The library implements Differential Privacy by wrapping the vanilla TensorFlow optimizers into their Differentially Private counterparts. Any optimizer that is a child of tf.compat.v1.train.Optimizer’ can be made into a Differentially Private one (DPAdagradGaussianOptimizer, DPAdamGaussianOptimizer, DPGradientDescentGaussianOptimizer, DPRMSPropGaussianOptimizer)

  • The library offers the possibility to track the privacy budget spent during training

  • The library offers an implementation for the membership inference attack that allows running tests on the DP-models to measure their privacy protection as well as expose possible vulnerabilities

Privacy parameters for the DP-SGD algorithm implementation:

Privacy parameters to compute the privacy guaranteed using privacy accountant:

3- Opacus:

  • Supported by Facebook, this is an open source Python library built on top of Pytorch that makes it possible to easily train models with Differential Privacy

  • The library is under continuous development and backed by an active community of contributors

  • The library offers a handful of tutorials and scripts to get you up and running quickly; it also comes with an accompanying website 

  • The library is only compatible with Pytorch models

  • Opacus implements the Differentially Private Stochastic Gradient Descent algorithm for the private training. Privacy Accounting is done under the framework of Rényi Differential Privacy. The implementation of the accounting procedure is based on that of TensorFlow-privacy. 

The core component of Opacus is the Privacy Engine; by attaching it to your optimizer you can train pytorch models with Differential Privacy

To save you the hassle of going through the source code in order to figure out what each parameter means and how to use it, the table below contains a detailed explanation:

Explanation

  • Before passing the model to the privacy engine, we must verify whether it’s valid or not using the inspector functionality, the inspector checks if all the layers of the model are compatible with the Privacy Engine:

from opacus.dp_model_inspector import DPModelInspector

             inspector = DPModelInspector() # instantiate the model    inspector

             inspector.validate(model)  # check the validity of the model

   

    Compatible nn.Modules must meet the following requirements:

  • The module must be supported by auto_grad_sample

  • The module must be in train mode

  • The number of groups in ``nn.Conv2d`` layer must be valid

  • The module must not contain BatchNorm layers

  • If the module contains ``InstanceNorm`` layer then    track_running_stats`` must be set to False

  • The module must not be LSTM


  • noise_multiplier: Used to compute the standard deviation of the Gaussian distribution from which the noise is sampled, in this case it’s the ratio of the standard deviation of the Gaussian noise to the L2-sensitivity of the function to which the noise is added: mean of the distribution is zero, standard deviation = noise_multiplier * L2-sensitivity (represented by the max_grad_norm)

  • batch_size, sample_size and sample_rate are used in the privacy accounting procedure, if the sample_rate is provided then the other two parameters will be ignored, otherwise, they’re mandatory

  • If the noise_multiplier is not provided then you must provide the target epsilon, target_delta and epochs. Based on these three parameters, the privacy engine will compute the level of noise to be added at the end of each epoch to achieve (target_epsilon,target_delta) privacy guarantee. This is a big plus for Opacus compared to tf-privacy, if you’re not certain of what value to choose for the noise_multiplier, then you can just specify your target privacy budget epsilon and Opacus will automatically compute the corresponding noise_multiplier value

  • How to choose target_delta: A rule of thumb is to choose a value no bigger than the inverse of the size of the dataset (target_delta <= 1/sample_size)

  • max_grad_norm: The gradients that exceed this threshold will be clipped. This parameter should be chosen based on the sample that is at most risk of a privacy breach; take as an example an outlier or a sensitive sample. The intuition behind this is that we should be able to guarantee the privacy of the largest possible per-sample gradient

  • How to initialize the Privacy Engine: 

# basic initializing the privacy engine

             from opacus import PrivacyEngine

             privacy_engine = PrivacyEngine(

             model, # the pytorch model

             noise_multiplier=0.6,

             max_grad_norm=1.0

             sample_rate = batch_size/len(training_dataset)

             )

             privacy_engine.attach(optimizer) # attach the engine to the optimizer

             # second initialization without using noise_multiplier 

             from opacus import PrivacyEngine

             privacy_engine = PrivacyEngine(

             model, # the pytorch model

             max_grad_norm=1.0

             sample_rate=batch_size/len(training_dataset),

            target_delta=1e-6,

            target_epsilon=6

              )

            privacy_engine.attach(optimizer) # attach the engine to the optimizer

  • To cope with the memory footprint induced by per-sample gradients computation, Opacus uses an approach based on the idea of ‘Virtual steps’ or ‘Virtual batches’ as opposed to the microbatch approach used by TF-privacy

  • You can get the privacy budget spent at each step during training by simply calling the function get_privacy_spent and passing the target_delta as parameter

How to choose and interpret your privacy budget Epsilon?

Although Differential Privacy has a strong theoretical background, how to make it work in practice is a different story. In fact, there’s no golden rule for choosing the value of Epsilon, it is almost always assumed to be given, or chosen arbitrarily.

However, there are few guidelines that you can follow when choosing the value of Epsilon:

  • On a high level, you should be thinking about balancing the trade-off between utility and privacy as well as performance and privacy, however, keep in mind  that in many cases privacy can be achieved for free with little to no impact on the model’s utility

  • The upper bound of the maximum privacy protection needed should be based on the most vulnerable entry in the dataset. If this approach results in a large amount of added noise that hurts the accuracy of your model, then you should consider removing these entries or sampling from a different distribution

  • Consider running membership inference attacks simulations to test your model’s privacy guarantee

For the people in a hurry, here is a summary of our comparison: