Skip to content

Commit

Permalink
adding task runner api example (#1199)
Browse files Browse the repository at this point in the history
* adding task runner example

* corrected typo error

* adding readme

* adding task runner example

* corrected typo error

* adding readme

* update README.md

* removed unwanted part

* requested changes done

* moved to torch_cnn_mnist

* changes done

* added changes

* fixing code format

* fixing issue

* modified changes requested

* update plan.yaml

* updated plan.yaml
  • Loading branch information
rajithkrishnegowda authored Dec 11, 2024
1 parent 7e742dd commit f622efa
Show file tree
Hide file tree
Showing 12 changed files with 501 additions and 161 deletions.
7 changes: 4 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,12 +40,13 @@ OpenFL supports two APIs to set up a Federated Learning experiment:
- [Task Runner API](https://openfl.readthedocs.io/en/latest/about/features_index/taskrunner.html):
Define an experiment and distribute it manually. All participants can verify model code and [FL plan](https://openfl.readthedocs.io/en/latest/about/features_index/taskrunner.html#federated-learning-plan-fl-plan-settings) prior to execution. The federation is terminated when the experiment is finished. This API is meant for enterprise-grade FL experiments, including support for mTLS-based communication channels and TEE-ready nodes (based on Intel® SGX).

The quickest way to start testing the [TaskRunner API](https://openfl.readthedocs.io/en/latest/about/features_index/taskrunner.html) for managing and automating your tasks efficiently is to follow the steps outlined in the documentation. The TaskRunner API provides a simple and flexible interface to define, execute, and monitor tasks, making it an ideal choice for users looking to quickly integrate task automation into their projects.
<br/>
Read the [GitHub README File](https://github.com/securefederatedai/openfl/tree/develop/openfl-workspace/torch_cnn_mnist/README.md) explaining steps to train a model with OpenFL. <br/>

- [Workflow API](https://openfl.readthedocs.io/en/latest/about/features_index/workflowinterface.html) ([*experimental*](https://openfl.readthedocs.io/en/latest/developer_guide/experimental_features.html)):
Create complex experiments that extend beyond traditional horizontal federated learning. This API enables an experiment to be simulated locally, then seamlessly scaled to a federated setting. See the [experimental tutorials](https://github.com/securefederatedai/openfl/blob/develop/openfl-tutorials/experimental/workflow/) to learn how to coordinate [aggregator validation after collaborator model training](https://github.com/securefederatedai/openfl/tree/develop/openfl-tutorials/experimental/workflow/102_Aggregator_Validation.ipynb), [perform global differentially private federated learning](https://github.com/psfoley/openfl/tree/experimental-workflow-interface/openfl-tutorials/experimental/workflow/Global_DP), measure the amount of private information embedded in a model after collaborator training with [privacy meter](https://github.com/securefederatedai/openfl/blob/develop/openfl-tutorials/experimental/workflow/Privacy_Meter/readme.md), or [add a watermark to a federated model](https://github.com/securefederatedai/openfl/blob/develop/openfl-tutorials/experimental/workflow/301_MNIST_Watermarking.ipynb).

The quickest way to test OpenFL is to follow the [online documentation](https://openfl.readthedocs.io/en/latest/index.html) to launch your first federation.<br/>
Read the [blog post](https://medium.com/openfl/from-centralized-machine-learning-to-federated-learning-with-openfl-b3e61da52432) explaining steps to train a model with OpenFL. <br/>

## Requirements

OpenFL supports popular NumPy-based ML frameworks like TensorFlow, PyTorch and Jax which should be installed separately.<br/>
Expand Down
2 changes: 0 additions & 2 deletions openfl-workspace/torch_cnn_mnist/.workspace

This file was deleted.

212 changes: 212 additions & 0 deletions openfl-workspace/torch_cnn_mnist/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,212 @@
## Instantiating a Workspace from Torch Template
To instantiate a workspace from the torch_cnn_mnist template, you can use the fx workspace create command. This allows you to quickly set up a new workspace based on a predefined configuration and template.

1. Ensure the necessary dependencies are installed.
```
pip install virtualenv
mkdir ~/openfl-quickstart
virtualenv ~/openfl-quickstart/venv
source ~/openfl-quickstart/venv/bin/activate
pip install openfl
```
2. Creating the Workspace Folder

```
cd ~/openfl-quickstart
fx workspace create --template torch_template --prefix fl_workspace
cd ~/openfl-quickstart/fl_workspace
```

## Directory Structure
The taskrunner workspace has the following file structure:
```
taskrunner
├── requirements.txt # defines the required software packages
└── plan
├── plan.yaml # the Federated Learning plan declaration
├── cols.yaml # holds the list of authorized collaborators
├── data.yaml # holds the collaborator data set path
├── defaults # path to the default values for the FL plan
├── src
├── __init__.py # treat src as a Python package
└── cnn_model.py # centralized CNN model, ready for use in federated learning
├── dataloader.py # data loader module
└── taskrunner.py # task runner module
```

## Directory Breakdown:
* requirements.txt: Lists all the Python dependencies required to run the TaskRunner API and its components. Ensure you install these dependencies by running pip install -r requirements.txt.
* plan: Contains configuration files for federated learning:
- plan.yaml: The main Federated Learning plan declaration, defining the structure of the federated learning workflow.
- cols.yaml: A list of authorized collaborators for the federated learning task.
- data.yaml: Specifies the path to the data set for each collaborator.
- defaults: Path to the default configuration values for the federated learning plan.
* src: Contains the Python modules used for federated learning:
- init.py: Marks the src directory as a Python package, allowing you to import modules within the directory.
- cnn_model.py: Defines the Convolutional Neural Network (CNN) model for federated learning.
- dataloader.py: A module responsible for loading and processing datasets for the federated learning task.
- taskrunner.py: The core task runner module that manages the execution of federated learning tasks.

## Defining the Data Loader
The data loader in OpenFL is responsible for batching and iterating through the dataset that will be used for local training and validation on each collaborator node. The PyTorchMNISTInMemory class is responsible for batching and iterating through the MNIST data set, additionally sharded "on the fly".

To customize the PyTorchMNISTInMemory class, you need to implement the load_mnist_shard() function to process the dataset available at data_path on the local file system. The data_path parameter represents the data shard number used by the collaborator. This setup allows each collaborator to work with a specific subset of the data, facilitating distributed training.

The load_mnist_shard() function is responsible for loading the MNIST dataset, dividing it into training and validation sets, and applying necessary transformations. The data is then batched and made ready for the training process.

# Modify the dataloader to support "Bring Your Own Data"
You can either try to implement the placeholders by yourself, or get the solution from [dataloader.py](https://github.com/securefederatedai/openfl-contrib/blob/main/openfl_contrib_tutorials/ml_to_fl/federated/src/dataloader.py)
Also, update the data loader class name in plan.yaml accordingly.

```
import numpy as np
from typing import Iterator, Tuple
from openfl.federated import PyTorchTaskRunner
from openfl.utilities import Metric
import torch.optim as optim
import torch.nn.functional as F
from src.cnn_model import DigitRecognizerCNN, train_epoch, validate
class MNISTShardDataLoader(PyTorchDataLoader):
def __init__(self, data_path, batch_size, **kwargs):
super().__init__(batch_size, **kwargs)
# Load the dataset using the provided data_path and any additional kwargs.
X_train, y_train, X_valid, y_valid = load_dataset(data_path, **kwargs)
# Assign the loaded data to instance variables.
self.X_train = X_train
self.y_train = y_train
self.X_valid = X_valid
self.y_valid = y_valid
def load_dataset(data_path, train_split_ratio=0.8, **kwargs):
dataset = MNISTDataset(
root=data_path,
transform=Compose([Grayscale(num_output_channels=1), ToTensor()])
)
n_train = int(train_split_ratio * len(dataset))
n_valid = len(dataset) - n_train
ds_train, ds_val = random_split(
dataset, lengths=[n_train, n_valid], generator=manual_seed(0))
X_train, y_train = list(zip(*ds_train))
X_train, y_train = np.stack(X_train), np.array(y_train)
X_valid, y_valid = list(zip(*ds_val))
X_valid, y_valid = np.stack(X_valid), np.array(y_valid)
return X_train, y_train, X_valid, y_valid
class MNISTDataset(ImageFolder):
"""Encapsulates the MNIST dataset"""
FOLDER_NAME = "mnist_images"
DEFAULT_PATH = path.join(path.expanduser('~'), '.openfl', 'data')
def __init__(self, root: str = DEFAULT_PATH, **kwargs) -> None:
"""Initialize."""
makedirs(root, exist_ok=True)
super(MNISTDataset, self).__init__(
path.join(root, MNISTDataset.FOLDER_NAME), **kwargs)
def __getitem__(self, index):
"""Allow getting items by slice index."""
if isinstance(index, Iterable):
return [super(MNISTDataset, self).__getitem__(i) for i in index]
else:
return super(MNISTDataset, self).__getitem__(index)
```

## Defining the Task Runner
The Task Runner class defines the actual computational tasks of the FL experiment (such as local training and validation). We can implement the placeholders of the TemplateTaskRunner class (src/taskrunner.py) by importing the DigitRecognizerCNN model, as well as the train_epoch() and validate() helper functions from the centralized ML script. The template also provides placeholders for providing custom optimizer and loss function objects.

## How to run this tutorial (local simulation):
The fx plan initialize command bootstraps the workspace by first setting the initial weights of the aggregate model. It then parses the plan, updates the aggregator address if necessary, and produces a hash of the initialized plan for integrity and auditing purposes.

To help OpenFL calculate the initial model weights, we need to provide the shape of the input tensor as an additional parameter. For the MNIST data set of grayscale (single-channel) 28x28 pixel images, the input tensor shape is [1,28,28]. We will also use a locally deployed aggregator (localhost). Thus, the workspace initialization command for our local federation becomes:

```
mkdir save
fx plan initialize --input_shape [1,28,28] --aggregator_address localhost
```

We can now perform a test run with the following commands for creating a local PKI setup and starting the aggregator and the collaborators on the same machine:

```
cd ~/openfl/openfl-tutorials/taskrunner/
# This will create a local certificate authority (CA), so the participants communicate over a secure TLS Channel
fx workspace certify
#################################################################
# Step 1: Setup the Aggregator #
#################################################################
# Generate a Certificate Signing Request (CSR) for the Aggregator
fx aggregator generate-cert-request --fqdn localhost
# The CA signs the aggregator's request, which is now available in the workspace
fx aggregator certify --fqdn localhost --silent
################################
# Step 2: Setup Collaborator 1 #
################################
# Create a collaborator named "collaborator1" that will use data path "data/1"
# This command adds the collaborator1,data/1 entry in data.yaml
fx collaborator create -n collaborator1 -d 1
# Generate a CSR for collaborator1
fx collaborator generate-cert-request -n collaborator1
# The CA signs collaborator1's certificate, adding an entry to the authorized cols.yaml
fx collaborator certify -n collaborator1 --silent
################################
# Step 3: Setup Collaborator 2 #
################################
# Create a collaborator named "collaborator2" that will use data path "data/2"
# This command adds the collaborator2,data/2 entry in data.yaml
fx collaborator create -n collaborator2 -d 2
# Generate a CSR for collaborator2
fx collaborator generate-cert-request -n collaborator2
# The CA signs collaborator2's certificate, adding an entry to the authorized cols.yaml
fx collaborator certify -n collaborator2 --silent
##############################
# Step 4. Run the Federation #
##############################
fx aggregator start & fx collaborator start -n collaborator1 & fx collaborator start -n collaborator2
```

A successful local simulation of the FL workspace involves the aggregator and collaborators completing a round of training, saving the best-performing model under save/best.pbuf, and exiting with a unanimous “End of Federation reached…”:

## Sample output
```
INFO Round: 1, Collaborators that have completed all tasks: ['collaborator2', 'collaborator1']
METRIC {'metric_origin': 'aggregator', 'task_name': 'aggregated_model_validation', 'metric_name': 'accuracy', 'metric_value':
0.8915090382660382, 'round': 1}
METRIC Round 1: saved the best model with score 0.891509
METRIC {'metric_origin': 'aggregator', 'task_name': 'train', 'metric_name': 'training loss', 'metric_value': 0.2952194180338876,
'round': 1}
METRIC {'metric_origin': 'aggregator', 'task_name': 'locally_tuned_model_validation', 'metric_name': 'accuracy', 'metric_value':
0.9181734901767464, 'round': 1}
INFO Saving round 1 model...
INFO Experiment Completed. Cleaning up...
INFO Waiting for tasks...
INFO Sending signal to collaborator collaborator1 to shutdown...
INFO End of Federation reached. Exiting...
INFO Waiting for tasks...
INFO Sending signal to collaborator collaborator2 to shutdown...
INFO End of Federation reached. Exiting...
```
5 changes: 3 additions & 2 deletions openfl-workspace/torch_cnn_mnist/plan/cols.yaml
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
# Copyright (C) 2020-2021 Intel Corporation
# Copyright (C) 2020-2024 Intel Corporation
# Licensed subject to the terms of the separately executed evaluation license agreement between Intel Corporation and you.

collaborators:

- collaborator1
- collaborator2
11 changes: 4 additions & 7 deletions openfl-workspace/torch_cnn_mnist/plan/data.yaml
Original file line number Diff line number Diff line change
@@ -1,9 +1,6 @@
## Copyright (C) 2020-2021 Intel Corporation
# Copyright (C) 2020-2024 Intel Corporation
# Licensed subject to the terms of the separately executed evaluation license agreement between Intel Corporation and you.

# all keys under 'collaborators' corresponds to a specific colaborator name the corresponding dictionary has data_name, data_path pairs.
# Note that in the mnist case we do not store the data locally, and the data_path is used to pass an integer that helps the data object
# construct the shard of the mnist dataset to be use for this collaborator.

# collaborator_name ,data_directory_path
one,1
# collaborator_name,data_directory_path
collaborator1,1
collaborator2,2
3 changes: 1 addition & 2 deletions openfl-workspace/torch_cnn_mnist/plan/defaults
Original file line number Diff line number Diff line change
@@ -1,2 +1 @@
../../workspace/plan/defaults

../../workspace/plan/defaults
110 changes: 67 additions & 43 deletions openfl-workspace/torch_cnn_mnist/plan/plan.yaml
Original file line number Diff line number Diff line change
@@ -1,45 +1,69 @@
# Copyright (C) 2020-2021 Intel Corporation
# Copyright (C) 2020-2024 Intel Corporation
# Licensed subject to the terms of the separately executed evaluation license agreement between Intel Corporation and you.

aggregator :
defaults : plan/defaults/aggregator.yaml
template : openfl.component.Aggregator
settings :
init_state_path : save/torch_cnn_mnist_init.pbuf
best_state_path : save/torch_cnn_mnist_best.pbuf
last_state_path : save/torch_cnn_mnist_last.pbuf
rounds_to_train : 10
log_metric_callback :
template : src.utils.write_metric


collaborator :
defaults : plan/defaults/collaborator.yaml
template : openfl.component.Collaborator
settings :
delta_updates : false
opt_treatment : RESET

data_loader :
defaults : plan/defaults/data_loader.yaml
template : src.dataloader.PyTorchMNISTInMemory
settings :
collaborator_count : 2
data_group_name : mnist
batch_size : 256

task_runner :
defaults : plan/defaults/task_runner.yaml
template : src.taskrunner.PyTorchCNN

network :
defaults : plan/defaults/network.yaml

assigner :
defaults : plan/defaults/assigner.yaml

tasks :
defaults : plan/defaults/tasks_torch.yaml

compression_pipeline :
defaults : plan/defaults/compression_pipeline.yaml
aggregator:
settings:
best_state_path: save/best.pbuf
db_store_rounds: 2
init_state_path: save/init.pbuf
last_state_path: save/last.pbuf
rounds_to_train: 2
write_logs: false
template: openfl.component.aggregator.Aggregator
assigner:
settings:
task_groups:
- name: train_and_validate
percentage: 1.0
tasks:
- aggregated_model_validation
- train
- locally_tuned_model_validation
template: openfl.component.RandomGroupedAssigner
collaborator:
settings:
db_store_rounds: 1
delta_updates: false
opt_treatment: RESET
template: openfl.component.collaborator.Collaborator
compression_pipeline:
settings: {}
template: openfl.pipelines.NoCompressionPipeline
data_loader:
settings:
batch_size: 64
collaborator_count: 2
template: src.dataloader.PyTorchMNISTInMemory
network:
settings:
agg_addr: localhost
agg_port: 59583
cert_folder: cert
client_reconnect_interval: 5
require_client_auth: true
hash_salt: auto
use_tls: true
template: openfl.federation.Network
task_runner:
settings: {}
template: src.taskrunner.TemplateTaskRunner
tasks:
aggregated_model_validation:
function: validate_task
kwargs:
apply: global
metrics:
- acc
locally_tuned_model_validation:
function: validate_task
kwargs:
apply: local
metrics:
- acc
settings: {}
train:
function: train_task
kwargs:
epochs: 1
metrics:
- loss
2 changes: 0 additions & 2 deletions openfl-workspace/torch_cnn_mnist/requirements.txt
Original file line number Diff line number Diff line change
@@ -1,4 +1,2 @@
tensorboard
torch==2.3.1
torchvision==0.18.1
wheel>=0.38.0 # not directly required, pinned by Snyk to avoid a vulnerability
4 changes: 2 additions & 2 deletions openfl-workspace/torch_cnn_mnist/src/__init__.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,3 @@
# Copyright (C) 2020-2021 Intel Corporation
# Copyright (C) 2020-2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0
"""You may copy this file as the starting point of your own model."""
"""You may copy this file as the starting point of your own model."""
Loading

0 comments on commit f622efa

Please sign in to comment.