Going Differentially Private: techniques and tools of the trade (Part 2/2)
Going Differentially Private: techniques and tools of the trade (Part 2/2)
In the first part of this article, you had your first encounter with Differential Privacy and learned why it’s so awesome. In this second part, we’ll present to you three python libraries for implementing Differential Privacy: Difflibpriv, TensorFlow-Privacy and Opacus.
The rapid adoption of Differential Privacy by Machine Learning practitioners coupled with the high demand for privacy preserving Machine Learning models have both led to the emergence of new software tools and frameworks that aim to ease the design and implementation of such models.
Whether you're a machine learning veteran or just a curious data enthusiast, starting from scratch sounds like a daunting thought. Moreover, you might get lost while scouring Github repositories looking for a solution to start from.
Luckily, we’ve done the heavy lifting for you, and we stumbled upon three libraries for working with Machine learning under Differential Privacy. We will present the features and shortcomings of each of the three libraries. For an effective comparison we’ll focus on the following aspects:
● Community support and ongoing contributions/updates: this is particularly important if you're planning on adopting a long term solution into your projects. Besides, the sight of a dead Github repo is sad.
● Documentation: accessibility is important
● Implementation of different Differential Privacy mechanisms
● Compatibility and integration with mainstream ML libraries such as pytorch, keras and scikit-learn
● Model variety: it’s essential that these libraries offer support for several types of machine learning models
Before we proceed to our comparative analysis, you should keep a few remarks in mind:
At the time of writing of these lines, these are the 3 libraries that we deemed worth experimenting with in the context of Differentially Private Machine Learning. This list is by no means exhaustive. As this field is not fully mature yet and evolving at a fast pace, you should keep an eye out for new tools and frameworks in the near future
We solely focus on software solutions written in the Python programming language, as it's still the best fit for Machine Learning and AI based projects
A quick disclaimer: for each library, we’ve included a table that contains explanations for privacy parameters and their usage. However, the official documentation remains the go-to reference. Libraries get updated at one point in the future and the information in these tables will consequently become obsolete (some variables will have been renamed, refactored or dropped, etc.)
1- Diffprivlib
● The IBM Differential Privacy Library, this is a general purpose, open source python library for writing Differential Privacy applications.
● The library offers a handful of tutorials that cover most of what it has to offer in terms of functionality
● The library offers implementations for several supervised and unsupervised machine learning algorithms:
● The library implements an extensive collection of Differential Privacy mechanisms:
● Implementation of Privacy Budget Accountant that enables tracking the evolution of the privacy budget plus the possibility to visualize the trade-off of accuracy and privacy
● The library is integrated with Scikit-Learn: minimal code changes are needed to run Differentially Private scikit-learn models compared to their vanilla counterparts. In fact, most of them can be run with a single line of code.
Here it is an example of how to initialize a Differentially Private KMeans model using Diffprivlib:
from diffprivlib.models import KMeans
dp_model = KMeans(epsilon=6, n_clusters=3)
It doesn’t get any easier than that, does it?
● The library offers support for Scikit-Learn pipelines with Differential Privacy
● In addition to doing Differentially Private machine learning, it’s possible to do Differentially Private data exploration using statistical queries and histograms
TensorFlow-privacy:
This is an open source Python library built on top of Tensorflow that enables the training of neural networks under the constraints of Differential Privacy.
The library is active and undergoing continuous development
A comprehensive tutorial is provided to guide newcomers
The library implements the Differentially Private Stochastic Gradient Descent Algorithm (DP-SGD)
The library implements Differential Privacy by wrapping the vanilla TensorFlow optimizers into their Differentially Private counterparts. Any optimizer that is a child of ‘tf.compat.v1.train.Optimizer’ can be made into a Differentially Private one (DPAdagradGaussianOptimizer, DPAdamGaussianOptimizer, DPGradientDescentGaussianOptimizer, DPRMSPropGaussianOptimizer)
The library offers a new vectorized version of DP-SGD optimizers that brings about better performance
The library offers the possibility to track the privacy budget spent during training
The library offers an implementation for the membership inference attack that allows running tests on the DP-models to measure their privacy protection as well as expose possible vulnerabilities
Privacy parameters for the DP-SGD algorithm implementation:
Privacy parameters to compute the privacy guaranteed using privacy accountant:
3- Opacus:
Supported by Facebook, this is an open source Python library built on top of Pytorch that makes it possible to easily train models with Differential Privacy
The library is under continuous development and backed by an active community of contributors
The library offers a handful of tutorials and scripts to get you up and running quickly; it also comes with an accompanying website
The library is only compatible with Pytorch models
Opacus implements the Differentially Private Stochastic Gradient Descent algorithm for the private training. Privacy Accounting is done under the framework of Rényi Differential Privacy. The implementation of the accounting procedure is based on that of TensorFlow-privacy.
The core component of Opacus is the Privacy Engine; by attaching it to your optimizer you can train pytorch models with Differential Privacy
To save you the hassle of going through the source code in order to figure out what each parameter means and how to use it, the table below contains a detailed explanation:
Explanation
Before passing the model to the privacy engine, we must verify whether it’s valid or not using the inspector functionality, the inspector checks if all the layers of the model are compatible with the Privacy Engine:
from opacus.dp_model_inspector import DPModelInspector
inspector = DPModelInspector() # instantiate the model inspector
inspector.validate(model) # check the validity of the model
Compatible nn.Modules must meet the following requirements:
The module must be supported by auto_grad_sample
The module must be in train mode
The number of groups in ``nn.Conv2d`` layer must be valid
The module must not contain BatchNorm layers
If the module contains ``InstanceNorm`` layer then track_running_stats`` must be set to False
The module must not be LSTM
noise_multiplier: Used to compute the standard deviation of the Gaussian distribution from which the noise is sampled, in this case it’s the ratio of the standard deviation of the Gaussian noise to the L2-sensitivity of the function to which the noise is added: mean of the distribution is zero, standard deviation = noise_multiplier * L2-sensitivity (represented by the max_grad_norm)
batch_size, sample_size and sample_rate are used in the privacy accounting procedure, if the sample_rate is provided then the other two parameters will be ignored, otherwise, they’re mandatory
If the noise_multiplier is not provided then you must provide the target epsilon, target_delta and epochs. Based on these three parameters, the privacy engine will compute the level of noise to be added at the end of each epoch to achieve (target_epsilon,target_delta) privacy guarantee. This is a big plus for Opacus compared to tf-privacy, if you’re not certain of what value to choose for the noise_multiplier, then you can just specify your target privacy budget epsilon and Opacus will automatically compute the corresponding noise_multiplier value
How to choose target_delta: A rule of thumb is to choose a value no bigger than the inverse of the size of the dataset (target_delta <= 1/sample_size)
max_grad_norm: The gradients that exceed this threshold will be clipped. This parameter should be chosen based on the sample that is at most risk of a privacy breach; take as an example an outlier or a sensitive sample. The intuition behind this is that we should be able to guarantee the privacy of the largest possible per-sample gradient
How to initialize the Privacy Engine:
# basic initializing the privacy engine
from opacus import PrivacyEngine
privacy_engine = PrivacyEngine(
model, # the pytorch model
noise_multiplier=0.6,
max_grad_norm=1.0
sample_rate = batch_size/len(training_dataset)
)
privacy_engine.attach(optimizer) # attach the engine to the optimizer
# second initialization without using noise_multiplier
from opacus import PrivacyEngine
privacy_engine = PrivacyEngine(
model, # the pytorch model
max_grad_norm=1.0
sample_rate=batch_size/len(training_dataset),
target_delta=1e-6,
target_epsilon=6
)
privacy_engine.attach(optimizer) # attach the engine to the optimizer
To cope with the memory footprint induced by per-sample gradients computation, Opacus uses an approach based on the idea of ‘Virtual steps’ or ‘Virtual batches’ as opposed to the microbatch approach used by TF-privacy
You can get the privacy budget spent at each step during training by simply calling the function get_privacy_spent and passing the target_delta as parameter
How to choose and interpret your privacy budget Epsilon?
Although Differential Privacy has a strong theoretical background, how to make it work in practice is a different story. In fact, there’s no golden rule for choosing the value of Epsilon, it is almost always assumed to be given, or chosen arbitrarily.
However, there are few guidelines that you can follow when choosing the value of Epsilon:
On a high level, you should be thinking about balancing the trade-off between utility and privacy as well as performance and privacy, however, keep in mind that in many cases privacy can be achieved for free with little to no impact on the model’s utility
The upper bound of the maximum privacy protection needed should be based on the most vulnerable entry in the dataset. If this approach results in a large amount of added noise that hurts the accuracy of your model, then you should consider removing these entries or sampling from a different distribution
Consider running membership inference attacks simulations to test your model’s privacy guarantee
For the people in a hurry, here is a summary of our comparison: