Labelia (ex Substra Foundation)

View Original

[Part 2/2] Using Distributed Learning for Deepfake Detection

[Part 2/2] Using Distributed Learning for Deepfake Detection

[Part 2/2] Implementation of a Deepfake Detector with the Substra Framework

by Fabien Gelus

In the first part of this article, we introduced a secure, traceable and distributed ML approach for a deepfake detection benchmark using the Substra framework. In this second part, we will present the technical details of the Substra framework and see the whole process of implementing an example on Substra, in order to allow you to submit your own algorithms or add your dataset to the Substra network.

Substra Framework

Basic schema of the deepfake detector benchmark process using the Substra framework, as described in Part 1.

The Substra framework is built to manage assets in a multi-partner approach. These assets have to be implemented by the users using the framework-agnostic library substratools (Available on PyPI).

With a simple Python script, you can easily upload your detection algorithm and execute training and testing tasks remotely on our public instance “TestNet” thanks to the Substra SDK and CLI.

How does it works?

The principles driving the development of Substra are:

  • Data locality: Datasets remain in their owner’s data stores and are never transferred! AI models travel from one dataset to another to be trained according to the compute plan design.

  • Decentralized trust: All operations are orchestrated by the distributed ledger technology Hyperledger Fabric. There is no need for a single trusted actor or third party: security arises from the network. Data shared across this network is only non critical metadata, in fact, it only contains information about the compute plan: which data was used with which model for x% performance. It is consensually built and can not be corrupted.

  • Traceability: An immutable audit trail (the ledger) registers all the operations realized on the platform, simplifying certification of models, which can help following a model genealogy.

Modularity: Substra framework is highly flexible: various permission regimes and workflow structures can be enforced corresponding to every specific use case.


Substra is decentralized: it runs on, and connects to, a set of machines in a private network. It is made of three parts: distributed nodes, a metadata network and a model network

All of the concepts below are assets (basically a set of files) which are associated with a unique identifier (hash) on the platform. To see more details about each asset, see the documentation.

Class diagram gathering all the assets and the links between each other. test_only is a boolean indicating if data is dedicated to testing.The testtuple associated with this data is then certified.

If you only want to quickly train/test a detection algorithm:

In our public TestNet, all the assets are already implemented and available open-source, except for the Algo part. You can download them on the example’s repository.

In order to submit your detection algorithm, you just need to implement your model as well as the train/predict() and save/load_model() methods in a Python script following the Algo.py template. Don’t forget to add the dependencies of your script in a Dockerfile and a description of your algorithm.

└── assets/

    └── my_algo/

        ├── Dockerfile

        ├── algo.py

        └── description.md

You can now test your Algo locally using substratools commands, and run a local test in a Docker environment with the CLI command substra run-local.

If you successfully ran a local test, you can then upload your Algo on the public node and train/test your model using the Substra SDK.

└── scripts/

    ├── add_algo.py

    ├── train_algo.py

    └── test_algo.py

During this process, you can follow the progress of your training or testing task using the Substra CLI.

You can see a Substra example of a deepfake detection Algo on our repository.

If you want to add your dataset to the TestNet:

  • If you want to host the data on your machine, you will need to create a local Substra node in order to join the network and allow users to train and/or test their models on your data without ever letting them access it. If you’re interested, please check our installation tutorial and do not hesitate to reach us on Slack!

  • Otherwise, you can directly upload your dataset on our public node “TestNet” using the Substra CLI.

In both cases, you will need to create your assets following the substratools template. 

Here are the steps for the creation of a “Substra Example”:

  • Initialise your example’s repository: create the following folders and gitignore/ readme files.

└── my_example/

    ├── assets/

    ├── data/

    ├── scripts/

    ├── .gitignore

    └── README.md

  • add your dataset(s) to your example:

└── data/

    └── my_dataset/

        ├── video1.mp4

        ├── video2.mp4

        ├── ...

        └── labels.json

The features and labels of your data will be managed by the generate_data_samples script. If in your dataset the labels are indicated elsewhere, you will have to modify the script accordingly.

  • implement a script to generate data samples for each dataset. This script will generate data to be registered into Substra. 

└── scripts/

    └── generate_data_samples.py

This should create two sub-folders in the assets folder: train_data_samples and test_data_samples. They contain sub-folders, one for each train/test data sample and each containing features and labels for every data point of the sample.

└── assets/

    ├── train_data_samples/

    │   ├── data_sample_0/

    │   │   ├── x_data_sample_0.npy

    │   │   └── y_data_sample_0.npy

    │   └── ...

    └── test_data_samples

The reason we have multiple train data samples is that, this way, users will be able to finely select their training set later on.

  • implement the Opener to read features and labels (return features as paths for huge objects such as videos). Provide a granular description of your data in a description file so the end users can implement an Algo without ever accessing your data.

└── assets/

    └── dataset/

        ├── opener.py

        └── description.md

  • implement the Algo (cf. previous part, or use the basic Algo given in our repository in order to test your assets).

└── assets/

    └── algo_test/

        ├── Dockerfile

        ├── algo.py

        └── description.md

  • the Objective is common for each dataset: detect deepfakes, more precisely predict a probability of a video being a deepfake. You can find the objective as well as the implemented metric (log-loss) in our repository.

└── assets/

    └── objective/

        ├── Dockerfile

        ├── metrics.py

        └── description.md

  • test each of your assets locally using substratools commands, and run a local test in a Docker environment with the CLI command substra run-local.

  • implement a script to add your dataset to the Substra public node using the Substra SDK.

└── scripts/

    └── add_dataset.py

  • follow the progress and check the results using the CLI.

Feel free to check our Examples Repository for step-by-step tutorials on Titanic data, MNIST, MNIST-DP and the Deepfake Detection example used on our TestNet.

My deepfake detection example

This example is a Substra implementation of a deepfake detector. It is based on the DFDC challenge from Kaggle and uses samples from the DFDC dataset. I implemented a first Algo based on the inference demo Kaggle notebook from humananalog because of its light but efficient ResNet model.

The structure of this example is inspired from Substra's Titanic Example. As for now, we do not have the right to distribute the DFDC dataset publicly, so you will need to download the data samples directly from Kaggle if you want to test it. 

However, the Algo should work on any dataset, the only condition being that its input must be video files paths. I made this choice to lighten the allocated memory for the algorithm when it loads several videos per batch. The loading and preprocessing tasks are then managed by the Algo only. According to this condition, the script to generate data samples puts videos as features, and the Opener only gives the paths to these videos and the corresponding labels to the Algo.

Once I tested all my assets in my local Python environment with the substratools CLI, I could use substra run-local to run them in a Docker container. It ran a training task on 320 videos, a prediction task on 80 videos and returned the score after at least one hour (running on 4 CPUs).

I could now add the objective, opener and data samples to a Substra public node (more infos in the next section) with a single script, using the substra python sdk. It's main goal is to create assets, get their keys and use these keys in the creation of other assets.

With another script, I pushed my Algo to substra and then used the assets_keys.json file we just generated to train it against the dataset and objective we previously set up. It will then update the assets_keys.json file with the newly created assets keys (Algo, traintuple and testtuple).

It will end by providing a couple of commands you can use to track the progress of the train and test tuples tasks as well as the associated scores. Alternatively, you can browse the frontend to look up progress and scores.

A screenshot of the Substra Frontend, accessible with your browser

To watch it “live”, you can use:

Note: this example has been fully tested on a first Testnet running Substra 0.6, with a score of 0.16 (log-loss metric) after a 2 hours compute plan on 4 CPUs (and no GPU).

Testnet presentation 

If you want to create your own Substra node to host data, you will need to create a dedicated server to deploy the substra backend. This server will need the following requirements: 

  • Hardware requirements

    • CPU: Minimum: 2-core, Recommended: 4-core

    • GPU: Minimum: None, Recommended: 8GB Memory (NVIDIA GTX 1080 or equivalent)

    • RAM: Minimum: 8 GB for Kubernetes, Recommended: 12 GB

    • Hard drive: Minimum: 50 GB of free space for docker and Kubernetes images + the size of your dataset

  • Software configuration

Please refer to the general documentation and the setup section.

For the first launch of our public tesnet, we used a VM with the minimum hardware requirements. If you plan to train/test your algo, upload a dataset or add a substra node to our testnet, contact us as we will need to configure your credentials.

Opening

One strength of this project is that it relies on open source software. This means that the code remains auditable, usable and remixable: you can inspect it, use it, fork it, contribute to it, reach its contributors and community to raise any question!

Another great point is the collaborative approach embodied by the Substra Foundation that goes beyond only technical perspectives. In fact, we, at Substra Foundation, seek to gather a broad community of actors interested in responsible data science, privacy preserving machine learning and impactful multi partner achievements!

You can help us improving the deepfake detection and make the internet a better place. 

  • By submitting your dataset, you can improve the detection models and contribute to its recognition when models trained on your data achieve great performance. You will also improve our benchmark by making the testset more heterogeneous and representative of the diversity of deepfakes. 

  • By submitting your algorithm, you can improve your model by training it on several datasets and be recognized for its performance on a private and challenging testset. Your score can be part of our public leaderboard and your model’s traceability will be certified.

  • By contributing to the open source project Substra, you can help us make a more secure and documented framework and be recognized as an official contributor of this growing project.

Keep in touch

Need some help or additional information? Come chat on Slack!