Skip to content

Commit

Permalink
Add second tutorial session material
Browse files Browse the repository at this point in the history
  • Loading branch information
JMGaljaard committed Sep 18, 2022
1 parent c279b03 commit 99ad6d0
Show file tree
Hide file tree
Showing 2 changed files with 284 additions and 5 deletions.
279 changes: 279 additions & 0 deletions jupyter/experiment_notebook.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,279 @@
{
"cells": [
{
"cell_type": "markdown",
"source": [
"# Getting started\n",
"\n",
"First, we enable the cluster to scale up. Note that if you run an auto-scaling cluster,\n",
"Google will suspend your nodes. Make sure to have the experiment prepared before running the commands.\n",
"\n",
"The following is assumed ready:\n",
"* GKE/Kubernetes cluster (see also `terraform/terraform_notebook.ipynb`)\n",
" * 2 nodes pools (default for system & dependencies, experiment pool)\n",
"* Docker image (including dataset, to speed-up starting experiments).\n",
" * First run the extractor (locally) `python3 -m extractor configs/example_cloud_experiment.json`\n",
" * This downloads datasets to be included in the docker image.\n",
" * Build the container `DOCKER_BUILDKIT=1 docker build --platform linux/amd64 . --tag gcr.io/$PROJECT_ID/fltk`\n",
" * Push to your gcr.io repository `docker push gcr.io/$PROJECT_ID/fltk`\n",
"\n",
"\n",
"With that setup, first set some variables used throughout the experiment.\n"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"PROJECT_ID=\"test-bed-fltk\"\n",
"CLUSTER_NAME=\"fltk-testbed-cluster\"\n",
"DEFAULT_POOL=\"default-node-pool\"\n",
"EXPERIMENT_POOL=\"medium-fltk-pool-1\"\n",
"REGION=\"us-central1-c\"\n",
"\n",
"# In case we do not yet have the credentials/kubeconfig\n",
"gcloud container clusters get-credentials $CLUSTER_NAME --region $REGION --project $PROJECT_ID"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"Scale the default-node-pool up."
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"# These commands might take a while to complete.\n",
"gcloud container clusters resize $CLUSTER_NAME --node-pool $DEFAULT_POOL \\\n",
" --num-nodes 2 --region us-central1-c --quiet\n",
"\n",
"gcloud container clusters resize $CLUSTER_NAME --node-pool $EXPERIMENT_POOL \\\n",
" --num-nodes 3 --region us-central1-c --quiet"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"## Preparation\n",
"In case you have already tested something or ran another experiment, we have to remove the deployment of the Orchestrator. This will not delete any experiment data, as this persists on one of the ReadWriteMany PVCs.\n",
"\n",
"\n",
"Currently, the Orchestrator is deployed using a `Deployment` definition, a future version will replace this with a `Deployment` definition, to make this step unnecessary. For experiments this means the following:\n",
"\n",
"1. A single deployment can exist at a single time in a single namespace. This includes 'completed' experiments.\n",
"2. For running batches of experiments, a BatchOrchestrator is provided.\n",
"\n",
"\n",
"ℹ️ This will not remove any data, but if your orchestrator is still/already running experiments, this will stop the deployment. Running training jobs will not be stopped, for this you can use `kubectl`. ConfigMaps created by the Orchestrator (to provide experiment configurations), will not be removed. See the commented code in the cell below."
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"# If you want to delete all pytorch trainjobs, uncomment the command below.\n",
"# kubectl delete pytorchjobs.kubeflow.org --all --namespace test\n",
"\n",
"# If you want to delete all existing configuration map objects in a namespace, run teh command below\n",
"# kubectl delete configmaps --all --namespace test\n",
"\n",
"helm uninstall -n test flearner"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"## Define experiment configuration files\n",
"\n",
"Deployment of experiments is currently done through a Helm Deployment. A future release (™️) will rework this to a Job definition, as this allows to re-use the template more easily.\n",
"\n",
"The `EXPERIMENT_FILE` will contain the description of the experiments\n",
"The `CLUSTER_CONFIG` will contain shared configurations for logging, Orchestrator configuration and replication information."
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"EXPERIMENT_FILE=\"configs/federated_tasks/example_arrival_config.json\"\n",
"CLUSTER_CONFIG=\"configs/example_cloud_experiment.json\""
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"## Setup experiment variables\n",
"Next, we will deploy the experiments.\n",
"\n",
"\n",
"We provide a configuration file, `charts/fltk-values.yaml`, in here change the values under the `provider` block. Change `projectName` to your Google Cloud Project ID.\n",
"\n",
"```yaml\n",
"provider:\n",
" domain: gcr.io\n",
" projectName: CHANGE_ME!\n",
" imageName: fltk:latest\n",
"```\n",
"\n",
"We use the `--set-file` flag for `helm`, as currently, Helm does not support using files outside of the chart root directory (in this case `charts/orchestrator`). Using `--set-file` we can dynamically provide these files. See also issue [here](https://github.com/helm/helm/issues/3276)\n"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"helm uninstall experiment-orchestrator -n test\n",
"helm install experiment-orchestrator charts/orchestrator --namespace test -f charts/fltk-values.yaml\\\n",
" --set-file orchestrator.experiment=$EXPERIMENT_FILE,orchestrator.configuration=$CLUSTER_CONFIG\n"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"# To get logs from the orchestrator\n",
"kubectl logs -n test fl-learner"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"# To get logs from learners (example)\n",
"kubectl logs -n test trainjob-eb056010-7c33-4c46-9559-b197afc7cb84-master-0\n",
"\n",
"# To get logs from learners (federated learning)\n",
"kubectl logs -n test trainjob-eb056010-7c33-4c46-9559-b197afc7cb84-worker-0"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
},
{
"cell_type": "markdown",
"source": [
"# Wrapping up\n",
"\n",
"To scale down the cluster nodepools, run the cell below.\n"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%% md\n"
}
}
},
{
"cell_type": "code",
"execution_count": null,
"outputs": [],
"source": [
"gcloud container clusters resize $CLUSTER_NAME --node-pool $DEFAULT_POOL \\\n",
" --num-nodes 0 --region us-central1-c --quiet\n",
"\n",
"gcloud container clusters resize $CLUSTER_NAME --node-pool $EXPERIMENT_POOL \\\n",
" --num-nodes 0 --region us-central1-c --quiet"
],
"metadata": {
"collapsed": false,
"pycharm": {
"name": "#%%\n"
}
}
}
],
"metadata": {
"title": "Experiment deployment",
"kernelspec": {
"display_name": "Bash",
"language": "bash",
"name": "bash"
},
"language_info": {
"codemirror_mode": "shell",
"file_extension": ".sh",
"mimetype": "text/x-sh",
"name": "bash"
}
},
"nbformat": 4,
"nbformat_minor": 1
}
10 changes: 5 additions & 5 deletions terraform/terraform-gke/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -29,8 +29,8 @@ module "gke" {
machine_type = "e2-medium"
node_locations = "us-central1-c"
auto_scaling = false
node_count = 3
min_count = 0
max_count = 1
local_ssd_count = 0
spot = false
disk_size_gb = 64
Expand All @@ -42,15 +42,15 @@ module "gke" {
auto_upgrade = true
service_account = local.terraform_service_account
preemptible = false
initial_node_count = 1
initial_node_count = 0
},
{
name = "medium-fltk-pool-1"
machine_type = "e2-medium"
machine_type = "e2-highcpu-8"
node_locations = "us-central1-c"
auto_scaling = false
auto_scaling = false # Make sure to set min/max count if you change this
node_count = 4
min_count = 0
max_count = 1
local_ssd_count = 0
spot = false
disk_size_gb = 64
Expand Down

0 comments on commit 99ad6d0

Please sign in to comment.