Skip to content

Commit

Permalink
Clean up deployment
Browse files Browse the repository at this point in the history
  • Loading branch information
JMGaljaard committed Sep 4, 2022
1 parent 1d19c29 commit 969dd5c
Show file tree
Hide file tree
Showing 5 changed files with 60 additions and 24 deletions.
57 changes: 48 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# Kubernetes - Federation Learning Toolkit ((K)FLTK)
[![License](https://img.shields.io/badge/license-BSD-blue.svg)](LICENSE)
[![Python 3.6](https://img.shields.io/badge/python-3.7-blue.svg)](https://www.python.org/downloads/release/python-370/)
[![Python 3.6](https://img.shields.io/badge/python-3.8-blue.svg)](https://www.python.org/downloads/release/python-380/)
[![Python 3.6](https://img.shields.io/badge/python-3.9-blue.svg)](https://www.python.org/downloads/release/python-390/)
[![Python 3.7](https://img.shields.io/badge/python-3.7-blue.svg)](https://www.python.org/downloads/release/python-370/)
[![Python 3.8](https://img.shields.io/badge/python-3.8-blue.svg)](https://www.python.org/downloads/release/python-380/)
[![Python 3.9](https://img.shields.io/badge/python-3.9-blue.svg)](https://www.python.org/downloads/release/python-390/)

This toolkit can be used to run Distributed and Federated experiments. This project makes use of
Pytorch Distributed (Data Parallel) ([docs](https://pytorch.org/tutorials/beginner/dist_overview.html))
Expand All @@ -14,7 +14,7 @@ This project builds on the work by Bart Cox, on the Federated Learning toolkit d
Docker Compose ([repo](https://github.com/bacox/fltk))


This project is tested with Ubuntu 20.04 and Arch Linux and Python {3.7, 3.8, 3.9}.
This project is tested with Ubuntu 20.04, Arch Linux, MacOS, with Python {3.7, 3.8, 3.9}. Python 3.9 is recommended.

## Global idea
Pytorch Distributed works based on a `world_size` and `rank`s. The ranks should be between `0` and `world_size-1`.
Expand All @@ -32,7 +32,6 @@ extension of the project is planned to implement a `FederatedClient` that allows
2. Clients prepare needed data and model and synchronize using PyTorch Distributed.
1. `WORLD_SIZE = 1`: Client performs training locally.
2. `WORLD_SIZE > 1`: Clients run epochs with DistributedDataParallel together.
3. (FUTURE: ) Your federated learning experiment.
3. Client logs/reports progress during and after training.

**Important notes:**
Expand Down Expand Up @@ -82,6 +81,8 @@ Structure with important folders and files explained:

```
project
├── terraform # Contains terraform charts for deployment on GKE
├── jupyter # Contains jupyter notebook files for setup and loading tensorboard files
├── charts # Templates for deploying projects with Helm
│ ├── extractor - Template for 'extractor' for centralized logging (using NFS)
│ └── orchestrator - Template for 'orchestrator' for launching distributed experiments
Expand All @@ -102,7 +103,7 @@ project
```

## Execution modes
Federatd Learning experiments can be set up in various ways (Simulation, Emulation, or fully distributed). Not all have the same requirements and thus some setup are more suited then others depending on the experiment.
Federated Learning experiments can be set up in various ways (Simulation, Emulation, or fully distributed). Not all have the same requirements and thus some setup are more suited then others depending on the experiment.

### Simulation
With the method as single machine is used to execute all the different nodes in the system.
Expand Down Expand Up @@ -143,14 +144,15 @@ The following tools need to be set up in your development environment before wor
* Docker ([docs](https://www.docker.com/get-started)) (with support for BuildKit [docs](https://docs.docker.com/develop/develop-images/build_enhancements/))
* Kubectl ([docs](https://kubernetes.io/docs/setup/))
* Helm ([docs](https://helm.sh/docs/chart_template_guide/getting_started/))
* Kustomize (3.2.0) ([docs](https://kubectl.docs.kubernetes.io/installation/kustomize/))
* (Terraform installation) Terraform
* (Manual installation) Kustomize (3.2.0) ([docs](https://kubectl.docs.kubernetes.io/installation/kustomize/))
* Local execution (single machine):
* MiniKube ([docs](https://minikube.sigs.k8s.io/docs/start/))
* It must be noted that certain functionality might require additional steps to work on MiniKube. This is currently untested.
* Google Cloud Environment (GKE) execution:
* GCloud SDK ([docs](https://cloud.google.com/sdk/docs/quickstart))
* Your own cluster provider:
* A Kubernetes cluster supporting Kubernetes 1.16+.
* A Kubernetes cluster supporting Kubernetes `>1.15,<=1.22`.

## Getting started

Expand All @@ -165,7 +167,44 @@ To download the models, execute the following command from the [project root](.)
python3 -m fltk extractor ./configs/example_cloud_experiment.json
```

## Deployment
## Deployment (Terraform)
To setup the the test-bed using Terraform, the following setup needs to be done. This can be achieved through following
the steps described in [`jupyter/terraform_notebook.ipynb`](jupyter/terraform_notebook.ipynb).

### Prerequisites

Before starting the jupyter notebook server locally, make sure to have the following dependencies installed.
We will create a virtual environment capable of running a jupyter notebook server with a `bash_kernel`.

For windows users, make sure to run the following commands in a `bash` capable terminal, e.g. using
Windows Subsystem for Linux (WSL).


```bash
python3 -m venv venv-jupyter
source venv-jupyter/bin/activate

# Install python dependencies for running the notebook
pip3 install jupyter ipython bash_kernel
# Install bash kernel to use for the notebook
python3 -m bash_kernel.install
```

When running the notebook (through an IDE or browser), make sure to set the kernel to the freshly installed
`bash_kernel`. Otherwise, the cells will be ran as Python code...

### Running the notebook

To start working in the notebook, run the following command in a bash shell, and follow the steps in the notebook.

```bash
cd jupyter
jupyter notebook
```

Click on the link that is displayed in the output, default is `localhost:8888`, and open the terraform notebook.

## Deployment (Manual)

This deployment guide will provide the general process of deploying an example deployment on
the created cluster. It is assumed that you have already set up a cluster (or emulation tool like MiniKube to execute the
Expand Down
1 change: 1 addition & 0 deletions charts/fltk-values.yaml
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
fltk:
outputDir: output
configDir: config
workDir: /opt/federation-lab
provider:
Expand Down
2 changes: 1 addition & 1 deletion charts/orchestrator/templates/fl-server-pod.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,7 @@ spec:
memory: {{ (.Values.orchestrator.memory | int) }}
volumeMounts:
- name: fl-server-log-volume
mountPath: {{ .Values.fltk.workDir }}/output
mountPath: {{ .Values.fltk.workDir }}/{{ .Values.fltk.outputDir }}
readOnly: true
- name: fltk-orchestrator-config-volume
mountPath: {{ .Values.fltk.workDir }}/{{ .Values.fltk.configDir }}
Expand Down
11 changes: 4 additions & 7 deletions jupyter/terraform_notebook.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@
"\n",
"Make sure to install a recent version of each of the dependencies.\n",
"\n",
"\n",
" * (Windows only) It is strongly recommended to install every dependency in a Windows Subsystem for Linux shell. For installation refer to [here](https://docs.microsoft.com/en-us/windows/wsl/install).\n",
" * GCloud SDK\n",
" - Follow the installation instructions [here](https://cloud.google.com/sdk/docs/install)\n",
" - Initialize the SDK with `gcloud init`\n",
Expand All @@ -20,15 +22,10 @@
" * Kubectl\n",
" * Helm\n",
" * Terraform\n",
" * (Windows o\n",
" * Python3.9\n",
" * Jupyter\n",
"```bash\n",
"pip3 install jupyter\n",
"```\n",
" * bash_kernel\n",
" * Jupyter, ipython, bash_kernel\n",
"```bash\n",
"pip3 install bash_kernel\n",
"pip3 install jupyter ipython bash_kernel\n",
"python3 -m bash_kernel.install\n",
"```\n",
"\n",
Expand Down
13 changes: 6 additions & 7 deletions terraform/terraform-gke/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ module "gke" {
source = "terraform-google-modules/kubernetes-engine/google"
project_id = var.project_id
name = var.cluster_name
# Create a ZONAL cluster, dissallowing the cluster to span multiple regions in a zone.
# Create a ZONAL cluster, disallowing the cluster to span multiple regions in a zone.
# Alternatively, for scheduling cross-regions, utilize `zone` and `regions` instead of `regional` and `region`
regional = false
region = var.project_region
Expand All @@ -23,7 +23,7 @@ module "gke" {
kubernetes_version = var.kubernetes_version


node_pools = [
node_pools = [
{
name = "default-node-pool"
machine_type = "e2-medium"
Expand Down Expand Up @@ -94,14 +94,13 @@ module "gke" {

node_pools_taints = {
all = []

default-node-pool = []
default-node-pool = [] # Default nodepool that will contain all the other pods

medium-fltk-pool-1 = [
{
key = "medium-fltk-pool-1"
value = true
effect = "PREFER_NO_SCHEDULE"
key = "fltk.node" # Taint is used in fltk pods
value = "medium-e2" # In case more explicit matching is required
effect = "PREFER_NO_SCHEDULE" # Other Pods are preferably not scheduled on this pool
},
]
}
Expand Down

0 comments on commit 969dd5c

Please sign in to comment.