diff --git a/jupyter/experiment_notebook.ipynb b/jupyter/experiment_notebook.ipynb index 98af269b..d0520883 100644 --- a/jupyter/experiment_notebook.ipynb +++ b/jupyter/experiment_notebook.ipynb @@ -2,6 +2,11 @@ "cells": [ { "cell_type": "markdown", + "metadata": { + "pycharm": { + "name": "#%% md\n" + } + }, "source": [ "# Getting started\n", "\n", @@ -12,24 +17,25 @@ "* GKE/Kubernetes cluster (see also `terraform/terraform_notebook.ipynb`)\n", " * 2 nodes pools (default for system & dependencies, experiment pool)\n", "* Docker image (including dataset, to speed-up starting experiments).\n", + " * Within a BASH shell\n", + " * Make sure to have the `requirements-cpu.txt` installed (or `requirements-gpu.txt (in a virtual venv/conda environment). You can run `pip3 install -r requirements-cpu.txt`\n", " * First run the extractor (locally) `python3 -m extractor configs/example_cloud_experiment.json`\n", " * This downloads datasets to be included in the docker image.\n", - " * Build the container `DOCKER_BUILDKIT=1 docker build --platform linux/amd64 . --tag gcr.io/$PROJECT_ID/fltk`\n", + " * Build the container `DOCKER_BUILDKIT=1 docker build --platform linux/amd64 . --tag gcr.io/\\$PROJECT_ID/fltk`\n", " * Push to your gcr.io repository `docker push gcr.io/$PROJECT_ID/fltk`\n", "\n", "\n", "With that setup, first set some variables used throughout the experiment.\n" - ], - "metadata": { - "collapsed": false, - "pycharm": { - "name": "#%% md\n" - } - } + ] }, { "cell_type": "code", "execution_count": null, + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, "outputs": [], "source": [ "PROJECT_ID=\"test-bed-fltk\"\n", @@ -40,29 +46,27 @@ "\n", "# In case we do not yet have the credentials/kubeconfig\n", "gcloud container clusters get-credentials $CLUSTER_NAME --region $REGION --project $PROJECT_ID" - ], - "metadata": { - "collapsed": false, - "pycharm": { - "name": "#%%\n" - } - } + ] }, { "cell_type": "markdown", - "source": [ - "Scale the default-node-pool up." - ], "metadata": { - "collapsed": false, "pycharm": { "name": "#%% md\n" } - } + }, + "source": [ + "Scale the default-node-pool up." + ] }, { "cell_type": "code", "execution_count": null, + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, "outputs": [], "source": [ "# These commands might take a while to complete.\n", @@ -71,16 +75,15 @@ "\n", "gcloud container clusters resize $CLUSTER_NAME --node-pool $EXPERIMENT_POOL \\\n", " --num-nodes 3 --region us-central1-c --quiet" - ], - "metadata": { - "collapsed": false, - "pycharm": { - "name": "#%%\n" - } - } + ] }, { "cell_type": "markdown", + "metadata": { + "pycharm": { + "name": "#%% md\n" + } + }, "source": [ "## Preparation\n", "In case you have already tested something or ran another experiment, we have to remove the deployment of the Orchestrator. This will not delete any experiment data, as this persists on one of the ReadWriteMany PVCs.\n", @@ -93,17 +96,16 @@ "\n", "\n", "ℹ️ This will not remove any data, but if your orchestrator is still/already running experiments, this will stop the deployment. Running training jobs will not be stopped, for this you can use `kubectl`. ConfigMaps created by the Orchestrator (to provide experiment configurations), will not be removed. See the commented code in the cell below." - ], - "metadata": { - "collapsed": false, - "pycharm": { - "name": "#%% md\n" - } - } + ] }, { "cell_type": "code", "execution_count": null, + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, "outputs": [], "source": [ "# If you want to delete all pytorch trainjobs, uncomment the command below.\n", @@ -113,30 +115,39 @@ "# kubectl delete configmaps --all --namespace test\n", "\n", "helm uninstall -n test flearner" - ], - "metadata": { - "collapsed": false, - "pycharm": { - "name": "#%%\n" - } - } + ] }, { "cell_type": "markdown", + "metadata": { + "pycharm": { + "name": "#%% md\n" + } + }, "source": [ "## Define experiment configuration files\n", "\n", "Deployment of experiments is currently done through a Helm Deployment. A future release (™️) will rework this to a Job definition, as this allows to re-use the template more easily.\n", "\n", - "The `EXPERIMENT_FILE` will contain the description of the experiments\n", - "The `CLUSTER_CONFIG` will contain shared configurations for logging, Orchestrator configuration and replication information." - ], + "\n", + "> The `EXPERIMENT_FILE` will contain the description of the experiments\n", + "> The `CLUSTER_CONFIG` will contain shared configurations for logging, Orchestrator configuration and replication information." + ] + }, + { + "cell_type": "code", + "execution_count": null, "metadata": { - "collapsed": false, "pycharm": { - "name": "#%% md\n" + "name": "#%%\n" } - } + }, + "outputs": [], + "source": [ + "# Change the directory to a level above, i.e. content root (the git root directory).\n", + "cd ../\n", + "echo $PWD" + ] }, { "cell_type": "code", @@ -144,7 +155,7 @@ "outputs": [], "source": [ "EXPERIMENT_FILE=\"configs/federated_tasks/example_arrival_config.json\"\n", - "CLUSTER_CONFIG=\"configs/example_cloud_experiment.json\"" + "CLUSTER_CONFIG=\"configs/example_arrival_config\"" ], "metadata": { "collapsed": false, @@ -155,6 +166,11 @@ }, { "cell_type": "markdown", + "metadata": { + "pycharm": { + "name": "#%% md\n" + } + }, "source": [ "## Setup experiment variables\n", "Next, we will deploy the experiments.\n", @@ -170,48 +186,45 @@ "```\n", "\n", "We use the `--set-file` flag for `helm`, as currently, Helm does not support using files outside of the chart root directory (in this case `charts/orchestrator`). Using `--set-file` we can dynamically provide these files. See also issue [here](https://github.com/helm/helm/issues/3276)\n" - ], - "metadata": { - "collapsed": false, - "pycharm": { - "name": "#%% md\n" - } - } + ] }, { "cell_type": "code", "execution_count": null, + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, "outputs": [], "source": [ "helm uninstall experiment-orchestrator -n test\n", "helm install experiment-orchestrator charts/orchestrator --namespace test -f charts/fltk-values.yaml\\\n", " --set-file orchestrator.experiment=$EXPERIMENT_FILE,orchestrator.configuration=$CLUSTER_CONFIG\n" - ], - "metadata": { - "collapsed": false, - "pycharm": { - "name": "#%%\n" - } - } + ] }, { "cell_type": "code", "execution_count": null, - "outputs": [], - "source": [ - "# To get logs from the orchestrator\n", - "kubectl logs -n test fl-learner" - ], "metadata": { - "collapsed": false, "pycharm": { "name": "#%%\n" } - } + }, + "outputs": [], + "source": [ + "# To get logs from the orchestrator\n", + "kubectl logs -n test fl-learner" + ] }, { "cell_type": "code", "execution_count": null, + "metadata": { + "pycharm": { + "name": "#%%\n" + } + }, "outputs": [], "source": [ "# To get logs from learners (example)\n", @@ -219,20 +232,17 @@ "\n", "# To get logs from learners (federated learning)\n", "kubectl logs -n test trainjob-eb056010-7c33-4c46-9559-b197afc7cb84-worker-0" - ], - "metadata": { - "collapsed": false, - "pycharm": { - "name": "#%%\n" - } - } + ] }, { "cell_type": "markdown", "source": [ "# Wrapping up\n", "\n", - "To scale down the cluster nodepools, run the cell below.\n" + "To scale down the cluster nodepools, run the cell below. This will scale the node pools down and remove all the experiments deployed (on the cluster).\n", + "\n", + "1. Experiments cannot be restarted.\n", + "2. Experiment logs will not persist deletion.\n" ], "metadata": { "collapsed": false, @@ -246,6 +256,9 @@ "execution_count": null, "outputs": [], "source": [ + "# This will remove all information and logs as well.\n", + "kubectl delete pytorchjobs.kubeflow.org --all-namespaces --all\n", + "\n", "gcloud container clusters resize $CLUSTER_NAME --node-pool $DEFAULT_POOL \\\n", " --num-nodes 0 --region us-central1-c --quiet\n", "\n", @@ -261,7 +274,6 @@ } ], "metadata": { - "title": "Experiment deployment", "kernelspec": { "display_name": "Bash", "language": "bash", @@ -272,7 +284,8 @@ "file_extension": ".sh", "mimetype": "text/x-sh", "name": "bash" - } + }, + "title": "Experiment deployment" }, "nbformat": 4, "nbformat_minor": 1 diff --git a/jupyter/terraform_notebook.ipynb b/jupyter/terraform_notebook.ipynb index 31c6592c..7e9731b0 100644 --- a/jupyter/terraform_notebook.ipynb +++ b/jupyter/terraform_notebook.ipynb @@ -260,7 +260,7 @@ "}\n", "\n", "# Create service-account\n", - "# gcloud iam service-accounts create $ACCOUNT_ID --display-name=\"Terraform service account\" --project ${PROJECT_ID}\n", + "gcloud iam service-accounts create $ACCOUNT_ID --display-name=\"Terraform service account\" --project ${PROJECT_ID}\n", "\n", "# Allow the service account to use the the set of roles below.\n", "enable_gcp_role \"compute.viewer\" # Allow the service account to see active resources\n", @@ -350,8 +350,7 @@ "execution_count": null, "outputs": [], "source": [ - "cd ../terraform/terraform-gke\n", - "echo $PWD" + "cd ../terraform/terraform-gke" ], "metadata": { "collapsed": false, @@ -475,6 +474,39 @@ } } }, + { + "cell_type": "markdown", + "source": [ + "⚠️ The cluster by default does not contain any nodes in the node pools, the `initial_node_count` is set to 0.\n", + "\n", + "Lastly, we need to scale up the cluster, as by default we create a cluster with nodepools of size 0." + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%% md\n" + } + }, + "outputs": [] + }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "gcloud container clusters resize $CLUSTER_NAME --node-pool \"default-node-pool\" \\\n", + " --num-nodes 2 --region us-central1-c --quiet\n", + "\n", + "gcloud container clusters resize $CLUSTER_NAME --node-pool \"medium-fltk-pool-1\" \\\n", + " --num-nodes 2 --region us-central1-c --quiet\n" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + }, { "cell_type": "markdown", "source": [ @@ -647,9 +679,28 @@ }, "outputs": [], "source": [ - "kubectl create -f https://raw.githubusercontent.com/kubeflow/training-operator/master/examples/pytorch/simple.yaml\n" + "# This cell is optional, but the next shell should show that a pytorch train job is created.\n", + "kubectl create -f https://raw.githubusercontent.com/kubeflow/training-operator/master/examples/pytorch/simple.yaml" ] }, + { + "cell_type": "code", + "execution_count": null, + "outputs": [], + "source": [ + "# Retrieve all CRD Pytorchjob from Kubeflow.\n", + "kubectl get pytorchjobs.kubeflow.org --all-namespaces --all\n", + "\n", + "# Alternatively, we can remove all jobs, this will remove all information and logs as well.\n", + "kubectl delete pytorchjobs.kubeflow.org --all-namespaces --all" + ], + "metadata": { + "collapsed": false, + "pycharm": { + "name": "#%%\n" + } + } + }, { "cell_type": "markdown", "metadata": { @@ -668,7 +719,9 @@ " * Running it in `terraform-dependencies` WILL REMOVE the Kubeflow Training-Operator from your cluster.\n", " * Running it in `terraform-gke` WILL REMOVE YOU ENTIRE CLUSTER.\n", "\n", - "You can uncomment the commands below to remove the cluster, or run the command in a terminal in the [`.../terraform/terraform-gke`](../terraform/terraform-gke) directory.\n" + "You can uncomment the commands below to remove the cluster, or run the command in a terminal in the [`.../terraform/terraform-gke`](../terraform/terraform-gke) directory.\n", + "\n", + "> ⚠️ It is recommended to scale down the cluster/nodepools rather then destroying, refer to the last code block." ] }, { @@ -683,7 +736,8 @@ "source": [ "cd ../terraform-gke\n", "\n", - "terraform destroy -auto-approve" + "# THIS WILL REMOVE/TEARDOWN YOUR CLUSTER, ONLY RECOMMENDED FOR TESTING THE DEPLOYMENT\n", + "# terraform destroy -auto-approve" ] }, { @@ -691,8 +745,9 @@ "execution_count": null, "outputs": [], "source": [ - "# Change nodepools\n", + "# Scale node pools down to prevent idle resource utilization.\n", "\n", + "# THIS IS THE PREFERRED WAY TO SCALE DOWN\n", "\n", "gcloud container clusters resize $CLUSTER_NAME --node-pool \"default-node-pool\" \\\n", " --num-nodes 0 --region us-central1-c --quiet\n",