This example includes instructions on running FedAvg with streaming of TensorBoard metrics to the server during training and homomorphic encryption. It uses the provisioning and the admin API to submit jobs, similar to how one would set up experiments in real-world deployment. For more information on real-world FL see here.
For instructions of how to run CIFAR-10 with FL simulator to compare different FL algorithms, see the example on "Simulated Federated Learning with CIFAR-10".
Install required packages for training
pip install --upgrade pip
pip install -r ./requirements.txt
NOTE: We recommend either using a containerized deployment or virtual environment, please refer to getting started.
Set PYTHONPATH
to include custom files of this example:
export PYTHONPATH=${PWD}/..
To speed up the following experiments, first download the CIFAR-10 dataset:
./prepare_data.sh
NOTE: This is important for running multitask experiments or running multiple clients on the same machine. Otherwise, each job will try to download the dataset to the same location which might cause a file corruption.
The next scripts will start the FL server and 8 clients automatically to run FL experiments on localhost. In this example, we run all 8 clients on one GPU with at least 8 GB memory per job.
The project file for creating the secure workspace used in this example is shown at ./workspaces/secure_project.yml.
To create the secure workspace, please use the following to build a package and copy it
to secure_workspace
for later experimentation.
cd ./workspaces
nvflare provision -p ./secure_project.yml
cp -r ./workspace/secure_project/prod_00 ./secure_workspace
cd ..
For more information about secure provisioning see the documentation.
In this example, we assume N_GPU
local GPUs, each with at least 8 GB of
memory, are available on the host system. To find the available number of
GPUs, run
export N_GPU=$(nvidia-smi --list-gpus | wc -l)
echo "There are ${N_GPU} GPUs available."
We can change the clients' local GPUResourceManager
configurations to
show the available N_GPU
GPUs at each client.
Each client needs about 1 GB of GPU memory to run an FL experiment with the CIFAR-10 dataset. Therefore, each client needs to request 1 GB of memory such that 8 can run in parallel on the same GPU.
To request the GPU memory, set the "mem_per_gpu_in_GiB"
value in the job's
meta.json
file.
To update the clients' available resources, we copy resource.json.default
to resources.json
and modify them as follows:
n_clients=8
for id in $(eval echo "{1..$n_clients}")
do
client_local_dir=workspaces/secure_workspace/site-${id}/local
cp ${client_local_dir}/resources.json.default ${client_local_dir}/resources.json
sed -i "s|\"num_of_gpus\": 0|\"num_of_gpus\": ${N_GPU}|g" ${client_local_dir}/resources.json
sed -i "s|\"mem_per_gpu_in_GiB\": 0|\"mem_per_gpu_in_GiB\": 1|g" ${client_local_dir}/resources.json
done
In the meta.json
of each job, we can request 1 GB of memory for each client.
Hence, the FL system will schedule at most N_GPU
jobs to be run in parallel.
For starting the FL system with 8 clients in the secure workspace, run
./start_fl_secure.sh 8
To run FL experiments in POC mode, create your local FL workspace the below command. In the following experiments, we will be using 8 clients. Press y and enter when prompted.
nvflare poc prepare -n 8
By default, POC will create startup kits at /tmp/nvflare/poc
.
NOTE: POC stands for "proof of concept" and is used for quick experimentation with different amounts of clients. It doesn't need any advanced configurations while provisioning the startup kits for the server and clients.
The secure workspace on the other hand is needed to run experiments that require encryption keys such as the homomorphic encryption (HE) one shown below. These startup kits allow secure deployment of FL in real-world scenarios using SSL certificated communication channels.
We can apply the same resource management settings in POC mode as in secure mode above.
Note, POC provides the resources.json, so copying the default file is not necessary.
Either set the available N_GPU
manually or use the command above.
n_clients=8
for id in $(eval echo "{1..$n_clients}")
do
client_local_dir=/tmp/nvflare/poc/site-${id}/local
sed -i "s|\"num_of_gpus\": 0|\"num_of_gpus\": ${N_GPU}|g" ${client_local_dir}/resources.json
sed -i "s|\"mem_per_gpu_in_GiB\": 0|\"mem_per_gpu_in_GiB\": 1|g" ${client_local_dir}/resources.json
done
Then, start the FL system with 8 clients by running
export PYTHONPATH=${PWD}/..
nvflare poc start
For details about resource management and consumption, please refer to the documentation.
Next, we will submit jobs to start FL training automatically.
The submit_job.sh script follows this pattern:
./submit_job.sh [job] [alpha]
If you want to use the poc workspace, append --poc
to this command, e.g,:
./submit_job.sh [job] [alpha] --poc
In this simulation, the server will split the CIFAR-10 dataset to simulate each client having different data distributions.
The job
argument controls which experiment job to submit.
The respective folder under jobs
will be submitted using the admin API with submit_job.py for scheduling.
The admin API script (submit_job.py) also overwrites the alpha value inside the
job configuration file depending on the provided commandline argument.
Jobs will be executed automatically depending on the available resources at each client (see "Multi-tasking" section).
For details about jobs, please refer to the documentation.
We use an implementation to generated heterogeneous data splits from CIFAR-10 based a Dirichlet sampling strategy
from FedMA (https://github.com/IBM/FedMA), where alpha
controls the amount of heterogeneity,
see Wang et al. For more information on how alpha
impacts model FL training,
see the example here.
In a real-world scenario, the researcher won't have access to the TensorBoard events of the individual clients.
In order to visualize the training performance in a central place, AnalyticsSender
,
ConvertToFedEvent
on the client, and TBAnalyticsReceiver
on the server can be used.
For an example using FedAvg and metric streaming during training, run:
./submit_job.sh cifar10_fedavg_stream_tb 1.0
Using this configuration, a tb_events
folder will be created under the [JOB_ID]
folder of the server that includes
all the TensorBoard event values of the different clients.
NOTE: You can always use the admin console to manually abort a running job. using
abort_job [JOB_ID]
. For a complete list of admin commands, see here.
To log into the POC workspace admin console no username is required (use "admin" for commands requiring conformation with username).
For the secure workspace admin console, use username "[email protected]"
After training, each client's best model will be used for cross-site validation. The results can be downloaded and shown with the admin console using
download_job [JOB_ID]
where [JOB_ID]
is the ID assigned by the system when submitting the job.
The result will be downloaded to your admin workspace (the exact download path will be displayed when running the command). You should see the cross-site validation results at
[DOWNLOAD_DIR]/[JOB_ID]/workspace/cross_site_val/cross_val_results.json
Next we run FedAvg using homomorphic encryption (HE) for secure aggregation on the server in non-heterogeneous setting (alpha=1
).
NOTE: For HE, we need to use the securely provisioned workspace. It will also take longer due to the additional encryption, decryption, encrypted aggregation, and increased encrypted messages sizes involved.
FedAvg with HE:
./submit_job.sh cifar10_fedavg_he 1.0
NOTE: Currently, FedOpt is not supported with HE as it would involve running the optimizer on encrypted values.
You can use ./run_experiments.sh
to submit all above-mentioned experiments at once if preferred.
This script uses the secure workspace to also support the HE experiment.
Let's summarize the result of the experiments run above. First, we will compare the final validation scores of the global models for different settings. In this example, all clients compute their validation scores using the same CIFAR-10 test set. The plotting script used for the below graphs is in ./figs/plot_tensorboard_events.py
To use it, download all job results using the download_job
admin command and specify the download_dir
in
./figs/plot_tensorboard_events.py.
NOTE: You need to install ./plot-requirements.txt to plot.
With a data split using alpha=1.0
, i.e. a non-heterogeneous split, we achieve the following final validation scores.
One can see that FedAvg can achieve similar performance to central training and
that HE does not impact the performance accuracy of FedAvg significantly while adding security to the aggregation step
with relative minor computational overhead.
Config | Alpha | Val score | Runtime |
---|---|---|---|
cifar10_fedavg | 1.0 | 0.8854 | 41m |
cifar10_fedavg_he | 1.0 | 0.8897 | 1h 10m |