This example includes instructions on running FedAvg, FedProx, FedOpt, and SCAFFOLD algorithms using NVFlare's FL simulator.
For instructions of how to run CIFAR-10 in real-world deployment settings, see the example on "Real-world Federated Learning with CIFAR-10".
Install required packages for training
pip install --upgrade pip
pip install -r ./requirements.txt
NOTE: We recommend either using a containerized deployment or virtual environment, please refer to getting started.
Set PYTHONPATH
to include custom files of this example:
export PYTHONPATH=${PWD}/..
To speed up the following experiments, first download the CIFAR-10 dataset:
./prepare_data.sh
NOTE: This is important for running multitask experiments or running multiple clients on the same machine. Otherwise, each job will try to download the dataset to the same location which might cause a file corruption.
We are using NVFlare's FL simulator to run the following experiments.
The output root of where to save the results is set in ./run_simulator.sh as RESULT_ROOT=/tmp/nvflare/sim_cifar10
.
We use an implementation to generated heterogeneous data splits from CIFAR-10 based a Dirichlet sampling strategy
from FedMA (https://github.com/IBM/FedMA), where alpha
controls the amount of heterogeneity,
see Wang et al..
We use set_alpha.sh
to change the alpha value inside the job configurations.
To simulate a centralized training baseline, we run FL with 1 client for 25 local epochs but only for one round. It takes circa 6 minutes on an NVIDIA TitanX GPU.
./run_simulator.sh cifar10_central 0.0 1 1
Note, here alpha=0.0
means that no heterogeneous data splits are being generated.
You can visualize the training progress by running tensorboard --logdir=${RESULT_ROOT}
FedAvg (8 clients). Here we run for 50 rounds, with 4 local epochs. Corresponding roughly to the same number of iterations across clients as in the central baseline above (50*4 divided by 8 clients is 25): Each job will take about 40 minutes, depending on your system.
You can copy the whole block into the terminal, and it will execute each experiment one after the other.
./run_simulator.sh cifar10_fedavg 1.0 8 8
./run_simulator.sh cifar10_fedavg 0.5 8 8
./run_simulator.sh cifar10_fedavg 0.3 8 8
./run_simulator.sh cifar10_fedavg 0.1 8 8
Next, let's try some different FL algorithms on a more heterogeneous split:
FedProx adds a regularizer to the loss used in CIFAR10ModelLearner
(fedproxloss_mu
)`:
./run_simulator.sh cifar10_fedprox 0.1 8 8
FedOpt uses a new ShareableGenerator to update the global model on the server using a PyTorch optimizer. Here SGD with momentum and cosine learning rate decay:
./run_simulator.sh cifar10_fedopt 0.1 8 8
SCAFFOLD uses a slightly modified version of the CIFAR-10 Learner implementation, namely the CIFAR10ScaffoldLearner
, which adds a correction term during local training following the implementation as described in Li et al.
./run_simulator.sh cifar10_scaffold 0.1 8 8
If you have several GPUs available in your system, you can run simulations in parallel by adjusting CUDA_VISIBLE_DEVICES
.
For example, you can run the following commands in two separate terminals.
export CUDA_VISIBLE_DEVICES=0
./run_simulator.sh cifar10_fedavg 0.1 8 8
export CUDA_VISIBLE_DEVICES=1
./run_simulator.sh cifar10_scaffold 0.1 8 8
NOTE: You can run all experiments mentioned in Section 3 using the
run_experiments.sh
script.
Let's summarize the result of the experiments run above. First, we will compare the final validation scores of the global models for different settings. In this example, all clients compute their validation scores using the same CIFAR-10 test set. The plotting script used for the below graphs is in ./figs/plot_tensorboard_events.py
NOTE: You need to install ./plot-requirements.txt to plot.
With a data split using alpha=1.0
, i.e. a non-heterogeneous split, we achieve the following final validation scores.
One can see that FedAvg can achieve similar performance to central training.
Config | Alpha | Val score |
---|---|---|
cifar10_central | 1.0 | 0.8798 |
cifar10_fedavg | 1.0 | 0.8854 |
We also tried different alpha
values, where lower values cause higher heterogeneity.
This can be observed in the resulting performance of the FedAvg algorithms.
Config | Alpha | Val score |
---|---|---|
cifar10_fedavg | 1.0 | 0.8854 |
cifar10_fedavg | 0.5 | 0.8633 |
cifar10_fedavg | 0.3 | 0.8350 |
cifar10_fedavg | 0.1 | 0.7733 |
Finally, we compare an alpha
setting of 0.1, causing a high client data heterogeneity and its
impact on more advanced FL algorithms, namely FedProx, FedOpt, and SCAFFOLD.
FedProx and SCAFFOLD achieve better performance compared to FedAvg and FedProx with the same alpha
setting.
However, FedOpt and SCAFFOLD show markedly better convergence rates.
SCAFFOLD achieves that by adding a correction term when updating the client models, while FedOpt utilizes SGD with momentum
to update the global model on the server.
Both achieve better performance with the same number of training steps as FedAvg/FedProx.
Config | Alpha | Val score |
---|---|---|
cifar10_fedavg | 0.1 | 0.7733 |
cifar10_fedprox | 0.1 | 0.7615 |
cifar10_fedopt | 0.1 | 0.8013 |
cifar10_scaffold | 0.1 | 0.8222 |