Skip to content

Latest commit

 

History

History
 
 

vertical_xgboost

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 

Vertical Federated XGBoost

This example shows how to use vertical federated learning with NVIDIA FLARE on tabular data. Here we use the optimized gradient boosting library XGBoost and leverage its federated learning support.

Before starting please make sure you set up a virtual environment and install the additional requirements:

python3 -m pip install -r requirements.txt

Preparing HIGGS Data

In this example we showcase a binary classification task based on the HIGGS dataset, which contains 11 million instances, each with 28 features and 1 class label.

Download and Store Dataset

First download the dataset from the HIGGS link above, which is a single zipped .csv file. By default, we assume the dataset is downloaded, uncompressed, and stored in DATASET_ROOT/HIGGS.csv.

Vertical Data Splits

In vertical federated learning, sites share overlapping data samples (rows), but contain different features (columns). In order to achieve this, we split the HIGGS dataset both horizontally and vertically. As a result, each site has an overlapping subset of the rows and a subset of the 29 columns. Since the first column of HIGGS is the class label, we give site-1 the label column for simplicity's sake.

vertical fl diagram

Run the following command to prepare the data splits:

./prepare_data.sh DATASET_ROOT

NOTE: make sure to put the correct path for DATASET_ROOT.

Private Set Intersection (PSI)

Since not every site will have the same set of data samples (rows), we can use PSI to compare encrypted versions of the sites' datasets in order to jointly compute the intersection based on common IDs. In this example, the HIGGS dataset does not contain unique identifiers so we add a temporary uid_{idx} to each instance and give each site a portion of the HIGGS dataset that includes a common overlap. Afterwards the identifiers are dropped since they are only used for matching, and training is then done on the intersected data. To learn more about our PSI protocol implementation, see our psi example.

NOTE: The uid can be a composition of multiple variables with a transformation, however in this example we use indices for simplicity. PSI can also be used for computing the intersection of overlapping features, but here we give each site unique features.

Create the psi job using the predefined psi_csv template:

nvflare job create -j ./jobs/vertical_xgb_psi -w psi_csv -sd ./code/psi -force

Run the psi job to calculate the dataset intersection of the clients at psi/intersection.txt inside the psi workspace:

nvflare simulator ./jobs/vertical_xgb_psi -w /tmp/nvflare/vertical_xgb_psi -n 2 -t 2

Vertical XGBoost Federated Learning with FLARE

This Vertical XGBoost example leverages the recently added vertical federated learning support in the XGBoost open-source library. This allows for the distributed XGBoost algorithm to operate in a federated manner on vertically split data.

For integrating with FLARE, we can use the predefined XGBFedController to run the federated server and control the workflow.

Next, we can use FedXGBHistogramExecutor and set XGBoost training parameters in config_fed_client.json, or define new training logic by overwriting the xgb_train() method.

Lastly, we must subclass XGBDataLoader and implement the load_data() method. For vertical federated learning, it is important when creating the xgb.Dmatrix to set data_split_mode=1 for column mode, and to specify the presence of a label column ?format=csv&label_column=0 for the csv file. To support PSI, the dataloader can also read in the dataset based on the calculated intersection, and split the data into training and validation.

NOTE: For secure mode, make sure to provide the required certificates for the federated communicator.

Run the Example

Create the vertical xgboost job using the predefined vertical_xgb template:

nvflare job create -j ./jobs/vertical_xgb -w vertical_xgb -sd ./code/vertical_xgb -force

Run the vertical xgboost job:

nvflare simulator ./jobs/vertical_xgb -w /tmp/nvflare/vertical_xgb -n 2 -t 2

The model will be saved to test.model.json.

(Feel free to modify the scripts and jobs as desired to change arguments such as number of clients, dataset sizes, training params, etc.)

GPU Support

By default, CPU based training is used.

In order to enable GPU accelerated training, first ensure that your machine has CUDA installed and has at least one GPU. In config_fed_client.json set "use_gpus": true and "tree_method": "hist" in xgb_params. Then, in FedXGBHistogramExecutor we can use the device parameter to map each rank to a GPU device ordinal in xgb_params. If using multiple GPUs, we can map each rank to a different GPU device, however you can also map each rank to the same GPU device if using a single GPU.

We can create a GPU enabled job using the job CLI:

nvflare job create -j ./jobs/vertical_xgb_gpu -w vertical_xgb \
-f config_fed_client.conf \
-f config_fed_server.conf use_gpus=true \
-sd ./code/vertical_xgb \
-force

This job can be run:

nvflare simulator ./jobs/vertical_xgb_gpu -w /tmp/nvflare/vertical_xgb_gpu -n 2 -t 2

Results

Model accuracy can be visualized in tensorboard:

tensorboard --logdir /tmp/nvflare/vertical_xgb/server/simulate_job/tb_events

An example training (pink) and validation (orange) AUC graph from running vertical XGBoost on HIGGS: (Used an intersection of 50000 samples across 5 clients each with different features, and ran for ~50 rounds due to early stopping.)

Vertical XGBoost graph