Add second tutorial session material

JMGaljaard · Sep 18, 2022 · 99ad6d0 · 99ad6d0
1 parent c279b03
commit 99ad6d0
Show file tree

Hide file tree

Showing 2 changed files with 284 additions and 5 deletions.
diff --git a/jupyter/experiment_notebook.ipynb b/jupyter/experiment_notebook.ipynb
@@ -0,0 +1,279 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "source": [
+    "# Getting started\n",
+    "\n",
+    "First, we enable the cluster to scale up. Note that if you run an auto-scaling cluster,\n",
+    "Google will suspend your nodes. Make sure to have the experiment prepared before running the commands.\n",
+    "\n",
+    "The following is assumed ready:\n",
+    "* GKE/Kubernetes cluster (see also `terraform/terraform_notebook.ipynb`)\n",
+    "    * 2 nodes pools (default for system & dependencies, experiment pool)\n",
+    "* Docker image (including dataset, to speed-up starting experiments).\n",
+    "    * First run the extractor (locally) `python3 -m extractor configs/example_cloud_experiment.json`\n",
+    "        *  This downloads datasets to be included in the docker image.\n",
+    "    * Build the container `DOCKER_BUILDKIT=1 docker build --platform linux/amd64 . --tag gcr.io/$PROJECT_ID/fltk`\n",
+    "    * Push to your gcr.io repository `docker push gcr.io/$PROJECT_ID/fltk`\n",
+    "\n",
+    "\n",
+    "With that setup, first set some variables used throughout the experiment.\n"
+   ],
+   "metadata": {
+    "collapsed": false,
+    "pycharm": {
+     "name": "#%% md\n"
+    }
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "outputs": [],
+   "source": [
+    "PROJECT_ID=\"test-bed-fltk\"\n",
+    "CLUSTER_NAME=\"fltk-testbed-cluster\"\n",
+    "DEFAULT_POOL=\"default-node-pool\"\n",
+    "EXPERIMENT_POOL=\"medium-fltk-pool-1\"\n",
+    "REGION=\"us-central1-c\"\n",
+    "\n",
+    "# In case we do not yet have the credentials/kubeconfig\n",
+    "gcloud container clusters get-credentials $CLUSTER_NAME --region $REGION --project $PROJECT_ID"
+   ],
+   "metadata": {
+    "collapsed": false,
+    "pycharm": {
+     "name": "#%%\n"
+    }
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "Scale the default-node-pool up."
+   ],
+   "metadata": {
+    "collapsed": false,
+    "pycharm": {
+     "name": "#%% md\n"
+    }
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "outputs": [],
+   "source": [
+    "# These commands might take a while to complete.\n",
+    "gcloud container clusters resize $CLUSTER_NAME --node-pool $DEFAULT_POOL \\\n",
+    "     --num-nodes 2 --region us-central1-c --quiet\n",
+    "\n",
+    "gcloud container clusters resize $CLUSTER_NAME --node-pool $EXPERIMENT_POOL \\\n",
+    "    --num-nodes 3 --region us-central1-c --quiet"
+   ],
+   "metadata": {
+    "collapsed": false,
+    "pycharm": {
+     "name": "#%%\n"
+    }
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "## Preparation\n",
+    "In case you have already tested something or ran another experiment, we have to remove the deployment of the Orchestrator. This will not delete any experiment data, as this persists on one of the ReadWriteMany PVCs.\n",
+    "\n",
+    "\n",
+    "Currently, the Orchestrator is deployed using a `Deployment` definition, a future version will replace this with a `Deployment` definition, to make this step unnecessary. For experiments this means the following:\n",
+    "\n",
+    "1. A single deployment can exist at a single time in a single namespace. This includes 'completed' experiments.\n",
+    "2. For running batches of experiments, a BatchOrchestrator is provided.\n",
+    "\n",
+    "\n",
+    "ℹ️ This will not remove any data, but if your orchestrator is still/already running experiments, this will stop the deployment. Running training jobs will not be stopped, for this you can use `kubectl`. ConfigMaps created by the Orchestrator (to provide experiment configurations), will not be removed. See the commented code in the cell below."
+   ],
+   "metadata": {
+    "collapsed": false,
+    "pycharm": {
+     "name": "#%% md\n"
+    }
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "outputs": [],
+   "source": [
+    "# If you want to delete all pytorch trainjobs, uncomment the command below.\n",
+    "#  kubectl delete pytorchjobs.kubeflow.org --all --namespace test\n",
+    "\n",
+    "# If you want to delete all existing configuration map objects in a namespace, run teh command below\n",
+    "# kubectl delete configmaps --all --namespace test\n",
+    "\n",
+    "helm uninstall -n test flearner"
+   ],
+   "metadata": {
+    "collapsed": false,
+    "pycharm": {
+     "name": "#%%\n"
+    }
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "## Define experiment configuration files\n",
+    "\n",
+    "Deployment of experiments is currently done through a Helm Deployment. A future release (™️) will rework this to a Job definition, as this allows to re-use the template more easily.\n",
+    "\n",
+    "The `EXPERIMENT_FILE` will contain the description of the experiments\n",
+    "The `CLUSTER_CONFIG` will contain shared configurations for logging, Orchestrator configuration and replication information."
+   ],
+   "metadata": {
+    "collapsed": false,
+    "pycharm": {
+     "name": "#%% md\n"
+    }
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "outputs": [],
+   "source": [
+    "EXPERIMENT_FILE=\"configs/federated_tasks/example_arrival_config.json\"\n",
+    "CLUSTER_CONFIG=\"configs/example_cloud_experiment.json\""
+   ],
+   "metadata": {
+    "collapsed": false,
+    "pycharm": {
+     "name": "#%%\n"
+    }
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "## Setup experiment variables\n",
+    "Next, we will deploy the experiments.\n",
+    "\n",
+    "\n",
+    "We provide a configuration file, `charts/fltk-values.yaml`, in here change the values under the `provider` block. Change `projectName` to your Google Cloud Project ID.\n",
+    "\n",
+    "```yaml\n",
+    "provider:\n",
+    "    domain: gcr.io\n",
+    "    projectName: CHANGE_ME!\n",
+    "    imageName: fltk:latest\n",
+    "```\n",
+    "\n",
+    "We use the `--set-file` flag for `helm`, as currently, Helm does not support using files outside of the chart root directory (in this case `charts/orchestrator`). Using `--set-file` we can dynamically provide these files. See also issue [here](https://github.com/helm/helm/issues/3276)\n"
+   ],
+   "metadata": {
+    "collapsed": false,
+    "pycharm": {
+     "name": "#%% md\n"
+    }
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "outputs": [],
+   "source": [
+    "helm uninstall experiment-orchestrator -n test\n",
+    "helm install experiment-orchestrator charts/orchestrator --namespace test -f charts/fltk-values.yaml\\\n",
+    "  --set-file orchestrator.experiment=$EXPERIMENT_FILE,orchestrator.configuration=$CLUSTER_CONFIG\n"
+   ],
+   "metadata": {
+    "collapsed": false,
+    "pycharm": {
+     "name": "#%%\n"
+    }
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "outputs": [],
+   "source": [
+    "# To get logs from the orchestrator\n",
+    "kubectl logs -n test fl-learner"
+   ],
+   "metadata": {
+    "collapsed": false,
+    "pycharm": {
+     "name": "#%%\n"
+    }
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "outputs": [],
+   "source": [
+    "# To get logs from learners (example)\n",
+    "kubectl logs -n test trainjob-eb056010-7c33-4c46-9559-b197afc7cb84-master-0\n",
+    "\n",
+    "# To get logs from learners (federated learning)\n",
+    "kubectl logs -n test trainjob-eb056010-7c33-4c46-9559-b197afc7cb84-worker-0"
+   ],
+   "metadata": {
+    "collapsed": false,
+    "pycharm": {
+     "name": "#%%\n"
+    }
+   }
+  },
+  {
+   "cell_type": "markdown",
+   "source": [
+    "# Wrapping up\n",
+    "\n",
+    "To scale down the cluster nodepools, run the cell below.\n"
+   ],
+   "metadata": {
+    "collapsed": false,
+    "pycharm": {
+     "name": "#%% md\n"
+    }
+   }
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "outputs": [],
+   "source": [
+    "gcloud container clusters resize $CLUSTER_NAME --node-pool $DEFAULT_POOL \\\n",
+    "     --num-nodes 0 --region us-central1-c --quiet\n",
+    "\n",
+    "gcloud container clusters resize $CLUSTER_NAME --node-pool $EXPERIMENT_POOL \\\n",
+    "    --num-nodes 0 --region us-central1-c --quiet"
+   ],
+   "metadata": {
+    "collapsed": false,
+    "pycharm": {
+     "name": "#%%\n"
+    }
+   }
+  }
+ ],
+ "metadata": {
+  "title": "Experiment deployment",
+  "kernelspec": {
+   "display_name": "Bash",
+   "language": "bash",
+   "name": "bash"
+  },
+  "language_info": {
+   "codemirror_mode": "shell",
+   "file_extension": ".sh",
+   "mimetype": "text/x-sh",
+   "name": "bash"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 1
+}
diff --git a/terraform/terraform-gke/main.tf b/terraform/terraform-gke/main.tf
@@ -29,8 +29,8 @@ module "gke" {
       machine_type       = "e2-medium"
       node_locations     = "us-central1-c"
       auto_scaling       = false
+      node_count         = 3
       min_count          = 0
-      max_count          = 1
       local_ssd_count    = 0
       spot               = false
       disk_size_gb       = 64
@@ -42,15 +42,15 @@ module "gke" {
       auto_upgrade       = true
       service_account    = local.terraform_service_account
       preemptible        = false
-      initial_node_count = 1
+      initial_node_count = 0
     },
     {
       name               = "medium-fltk-pool-1"
-      machine_type       = "e2-medium"
+      machine_type       = "e2-highcpu-8"
       node_locations     = "us-central1-c"
-      auto_scaling       = false
+      auto_scaling       = false              # Make sure to set min/max count if you change this
+      node_count         = 4
       min_count          = 0
-      max_count          = 1
       local_ssd_count    = 0
       spot               = false
       disk_size_gb       = 64