Create a notebook for mnist E2E on GCP (kubeflow#723)

* A notebook to run the mnist E2E example on GCP. This fixes a number of issues with the example * Use ISTIO instead of Ambassador to add reverse proxy routes * The training job needs to be updated to run in a profile created namespace in order to have the required service accounts * See kubeflow#713 * Running inside a notebook running on Kubeflow should ensure user is running inside an appropriately setup namespace * With ISTIO the default RBAC rules prevent the web UI from sending requests to the model server * A short term fix was to not include the ISTIO side car * In the future we can add an appropriate ISTIO rbac policy * Using a notebook allows us to eliminate the use of kustomize * This resolves kubeflow#713 which required people to use and old version of kustomize * Rather than using kustomize we can use python f style strings to write the YAML specs and then easily substitute in user specific values * This should be more informative; it avoids introducing kustomize and users can see the resource specs. * I've opted to make the notebook GCP specific. I think its less confusing to users to have separate notebooks focused on specific platforms rather than having one notebook with a lot of caveats about what to do under different conditions * I've deleted the kustomize overlays for GCS since we don't want users to use them anymore * I used fairing and kaniko to eliminate the use of docker to build the images so that everything can run from a notebook running inside the cluster. * k8s_utils.py has some reusable functions to add some details from users (e.g. low level calls to K8s APIs.) * * Change the mnist test to just run the notebook * Copy the notebook test infra for xgboost_synthetic to py/kubeflow/examples/notebook_test to make it more reusable * Fix lint. * Update for lint. * A notebook to run the mnist E2E example. Related to: kubeflow/website#1553 * 1. Use fairing to build the model. 2. Construct the YAML spec directly in the notebook. 3. Use the TFJob python SDK. * Fix the ISTIO rule. * Fix UI and serving; need to update TF serving to match version trained on. * Get the IAP endpoint. * Start writing some helper python functions for K8s. * Commit before switching from replace to delete. * Create a library to bulk create objects. * Cleanup. * Add back k8s_util.py * Delete train.yaml; this shouldn't have been aded. * update the notebook image. * Refactor code into k8s_util; print out links. * Clean up the notebok. Should be working E2E. * Added section to get logs from stackdriver. * Add comment about profile. * Latest. * Override mnist_gcp.ipynb with mnist.ipynb I accidentally put my latest changes in mnist.ipynb even though that file was deleted. * More fixes. * Resolve some conflicts from the rebase; override with changes on remote branch.
AmandeepSinghCS · Feb 17, 2020 · cc93a80 · cc93a80
1 parent b9a7719
commit cc93a80
Show file tree

Hide file tree

Showing 28 changed files with 2,570 additions and 1,558 deletions.
diff --git a/.pylintrc b/.pylintrc
@@ -56,7 +56,10 @@ confidence=
 # --enable=similarities". If you want to run only the classes checker, but have
 # no Warning level messages displayed, use"--disable=all --enable=classes
 # --disable=W"
-disable=import-star-module-level,old-octal-literal,oct-method,print-statement,unpacking-in-except,parameter-unpacking,backtick,old-raise-syntax,old-ne-operator,long-suffix,dict-view-method,dict-iter-method,metaclass-assignment,next-method-called,raising-string,indexing-exception,raw_input-builtin,long-builtin,file-builtin,execfile-builtin,coerce-builtin,cmp-builtin,buffer-builtin,basestring-builtin,apply-builtin,filter-builtin-not-iterating,using-cmp-argument,useless-suppression,range-builtin-not-iterating,suppressed-message,missing-docstring,no-absolute-import,old-division,cmp-method,reload-builtin,zip-builtin-not-iterating,intern-builtin,unichr-builtin,reduce-builtin,standarderror-builtin,unicode-builtin,xrange-builtin,coerce-method,delslice-method,getslice-method,setslice-method,input-builtin,round-builtin,hex-method,nonzero-method,map-builtin-not-iterating,relative-import,invalid-name,bad-continuation,no-member,locally-disabled,fixme,import-error,too-many-locals,no-name-in-module,too-many-instance-attributes,no-self-use
+#
+# Kubeflow disables string-interpolation because we are starting to use f
+# style strings
+disable=import-star-module-level,old-octal-literal,oct-method,print-statement,unpacking-in-except,parameter-unpacking,backtick,old-raise-syntax,old-ne-operator,long-suffix,dict-view-method,dict-iter-method,metaclass-assignment,next-method-called,raising-string,indexing-exception,raw_input-builtin,long-builtin,file-builtin,execfile-builtin,coerce-builtin,cmp-builtin,buffer-builtin,basestring-builtin,apply-builtin,filter-builtin-not-iterating,using-cmp-argument,useless-suppression,range-builtin-not-iterating,suppressed-message,missing-docstring,no-absolute-import,old-division,cmp-method,reload-builtin,zip-builtin-not-iterating,intern-builtin,unichr-builtin,reduce-builtin,standarderror-builtin,unicode-builtin,xrange-builtin,coerce-method,delslice-method,getslice-method,setslice-method,input-builtin,round-builtin,hex-method,nonzero-method,map-builtin-not-iterating,relative-import,invalid-name,bad-continuation,no-member,locally-disabled,fixme,import-error,too-many-locals,no-name-in-module,too-many-instance-attributes,no-self-use,logging-fstring-interpolation
 
 
 [REPORTS]

diff --git a/mnist/Dockerfile.model b/mnist/Dockerfile.model
@@ -1,5 +1,6 @@
 #This container contains your model and any helper scripts specific to your model.
-FROM tensorflow/tensorflow:1.7.0
+# When building the image inside mnist.ipynb the base docker image will be overwritten
+FROM tensorflow/tensorflow:1.15.2-py3
 
 ADD model.py /opt/model.py
 RUN chmod +x /opt/model.py

diff --git a/mnist/Makefile b/mnist/Makefile
@@ -19,6 +19,8 @@
 # To override variables do
 # make ${TARGET} ${VAR}=${VALUE}
 #
+#
+# TODO(jlewi): We should probably switch to Skaffold and Tekton
 
 # IMG is the base path for images..
 # Individual images will be

diff --git a/mnist/README.md b/mnist/README.md
@@ -3,6 +3,8 @@
 **Table of Contents**  *generated with [DocToc](https://github.com/thlorenz/doctoc)*
 
 - [MNIST on Kubeflow](#mnist-on-kubeflow)
+- [MNIST on Kubeflow on GCP](#mnist-on-kubeflow-on-gcp)
+- [MNIST on other platforms](#mnist-on-other-platforms)
   - [Prerequisites](#prerequisites)
     - [Deploy Kubeflow](#deploy-kubeflow)
     - [Local Setup](#local-setup)
@@ -13,21 +15,17 @@
   - [Preparing your Kubernetes Cluster](#preparing-your-kubernetes-cluster)
     - [Training your model](#training-your-model)
       - [Local storage](#local-storage)
-      - [Using GCS](#using-gcs)
       - [Using S3](#using-s3)
   - [Monitoring](#monitoring)
     - [Tensorboard](#tensorboard)
       - [Local storage](#local-storage-1)
-      - [Using GCS](#using-gcs-1)
       - [Using S3](#using-s3-1)
       - [Deploying TensorBoard](#deploying-tensorboard)
   - [Serving the model](#serving-the-model)
-    - [GCS](#gcs)
     - [S3](#s3)
     - [Local storage](#local-storage-2)
   - [Web Front End](#web-front-end)
     - [Connecting via port forwarding](#connecting-via-port-forwarding)
-    - [Using IAP on GCP](#using-iap-on-gcp)
   - [Conclusion and Next Steps](#conclusion-and-next-steps)
 
 <!-- END doctoc generated TOC please keep comment here to allow auto update -->
@@ -37,6 +35,45 @@
 
 This example guides you through the process of taking an example model, modifying it to run better within Kubeflow, and serving the resulting trained model.
 
+Follow the version of the guide that is specific to how you have deployed Kubeflow
+
+1. [MNIST on Kubeflow on GCP](#gcp)
+1. [MNIST on other platforms](#other)
+
+<a id=gcp></a>
+# MNIST on Kubeflow on GCP
+
+Follow these instructions to run the MNIST tutorial on GCP
+
+1. Follow the [GCP instructions](https://www.kubeflow.org/docs/gke/deploy/) to deploy Kubeflow with IAP
+
+1. Launch a Jupyter notebook
+
+   * The tutorial has been tested using the Jupyter Tensorflow 1.15 image
+
+1. Launch a terminal in Jupyter and clone the kubeflow examples repo
+
+   ```
+   git clone https://github.com/kubeflow/examples.git git_kubeflow-examples
+   ```
+
+   * **Tip** When you start a terminal in Jupyter, run the command `bash` to start
+      a bash terminal which is much more friendly then the default shell
+
+   * **Tip** You can change the URL from '/tree' to '/lab' to switch to using Jupyterlab
+
+1. Open the notebook `mnist/mnist_gcp.ipynb`
+
+1. Follow the notebook to train and deploy MNIST on Kubeflow
+
+<a id=other></a>
+# MNIST on other platforms
+
+The tutorial is currently not up to date for Kubeflow 1.0. Please check the issues
+
+* [kubeflow/examples#724](https://github.com/kubeflow/examples/issues/724) for AWS
+* [kubeflow/examples#725](https://github.com/kubeflow/examples/issues/725) for other platforms
+
 ## Prerequisites
 
 Before we get started there are a few requirements.
@@ -166,100 +203,6 @@ And to check the logs
 kubectl logs mnist-train-local-chief-0
 ```
 
-
-#### Using GCS
-
-In this section we describe how to save the model to Google Cloud Storage (GCS).
-
-Storing the model in GCS has the advantages:
-
-* The model is readily available after the job finishes
-* We can run distributed training
-
-  * Distributed training requires a storage system accessible to all the machines
-
-Enter the `training/GCS` from the `mnist` application directory.
-
-```
-cd training/GCS
-```
-
-Set an environment variable that points to your GCP project Id
-```
-PROJECT=<your project id>
-```
-
-Create a bucket on GCS to store our model. The name must be unique across all GCS buckets
-```
-BUCKET=distributed-$(date +%s)
-gsutil mb gs://$BUCKET/
-```
-
-Give the job a different name (to distinguish it from your job which didn't use GCS)
-
-```
-kustomize edit add configmap mnist-map-training --from-literal=name=mnist-train-dist
-```
-
-Optionally, if you want to use your custom training image, configurate that as below.
-
-```
-kustomize edit set image training-image=$DOCKER_URL
-```
-
-Next we configure it to run distributed by setting the number of parameter servers and workers to use. The `numPs` means the number of Ps and the `numWorkers` means the number of Worker.
-
-```
-../base/definition.sh --numPs 1 --numWorkers 2
-```
-
-Set the training parameters, such as training steps, batch size and learning rate.
-
-```
-kustomize edit add configmap mnist-map-training --from-literal=trainSteps=200
-kustomize edit add configmap mnist-map-training --from-literal=batchSize=100
-kustomize edit add configmap mnist-map-training --from-literal=learningRate=0.01
-```
-
-Now we need to configure parameters and telling the code to save the model to GCS.
-
-```
-MODEL_PATH=my-model
-kustomize edit add configmap mnist-map-training --from-literal=modelDir=gs://${BUCKET}/${MODEL_PATH}
-kustomize edit add configmap mnist-map-training --from-literal=exportDir=gs://${BUCKET}/${MODEL_PATH}/export
-```
-
-Build a yaml file for the `TFJob` specification based on your kustomize config:
-
-```
-kustomize build . > mnist-training.yaml
-```
-
-Then, in `mnist-training.yaml`, search for this line: `namespace: kubeflow`.
-Edit it to **replace `kubeflow` with the name of your user profile namespace**,
-which will probably have the form `kubeflow-<username>`.  (If you're not sure what this
-namespace is called, you can find it in the top menubar of the Kubeflow Central
-Dashboard.)
-
-After you've updated the namespace, apply the `TFJob` specification to the
-Kubeflow cluster:
-
-```
-kubectl apply -f mnist-training.yaml
-```
-
-You can then check the job status:
-
-```
-kubectl get tfjobs -n <your-user-namespace> -o yaml mnist-train-dist
-```
-
-And to check the logs:
-
-```
-kubectl logs -n <your-user-namespace> -f mnist-train-dist-chief-0
-```
-
 #### Using S3
 
 To use S3 we need to configure TensorFlow to use S3 credentials and variables. These credentials will be provided as kubernetes secrets and the variables will be passed in as environment variables. Modify the below values to suit your environment.
@@ -426,27 +369,6 @@ kustomize edit add configmap mnist-map-monitoring --from-literal=pvcMountPath=/m
 kustomize edit add configmap mnist-map-monitoring --from-literal=logDir=/mnt
 ```
 
-
-#### Using GCS
-
-Enter the `monitoring/GCS` from the `mnist` application directory.
-
-```
-cd monitoring/GCS
-```
-
-Configure TensorBoard to point to your model location
-
-```
-kustomize edit add configmap mnist-map-monitoring --from-literal=logDir=${LOGDIR}
-```
-
-Assuming you followed the directions above if you used GCS you can use the following value
-
-```
-LOGDIR=gs://${BUCKET}/${MODEL_PATH}
-```
-
 #### Using S3
 
 Enter the `monitoring/S3` from the `mnist` application directory.
@@ -551,64 +473,6 @@ The model code will export the model in saved model format which is suitable for
 To serve the model follow the instructions below. The instructins vary slightly based on where you are storing your model (e.g. GCS, S3, PVC). Depending on the storage system we provide different kustomization as a convenience for setting relevant environment variables.
 
 
-### GCS
-
-Here we show to serve the model when it is stored on GCS. This assumes that when you trained the model you set `exportDir` to a GCS URI; if not you can always copy it to GCS using `gsutil`.
-
-Check that a model was exported
-
-```
-EXPORT_DIR=gs://${BUCKET}/${MODEL_PATH}/export
-gsutil ls -r ${EXPORT_DIR}
-```
-
-The output should look something like
-
-```
-${EXPORT_DIR}/1547100373/saved_model.pb
-${EXPORT_DIR}/1547100373/variables/:
-${EXPORT_DIR}/1547100373/variables/
-${EXPORT_DIR}/1547100373/variables/variables.data-00000-of-00001
-${EXPORT_DIR}/1547100373/variables/variables.index
-```
-
-The number `1547100373` is a version number auto-generated by TensorFlow; it will vary on each run but should be monotonically increasing if you save a model to the same location as a previous location.
-
-Enter the `serving/GCS` from the `mnist` application directory.
-```
-cd serving/GCS
-```
-
-Set a different name for the tf-serving.
-
-```
-kustomize edit add configmap mnist-map-serving --from-literal=name=mnist-gcs-dist
-```
-
-Set your model path
-
-```
-kustomize edit add configmap mnist-map-serving --from-literal=modelBasePath=${EXPORT_DIR} 
-```
-
-Deploy it, and run a service to make the deployment accessible to other pods in the cluster
-
-```
-kustomize build . |kubectl apply -f -
-```
-
-You can check the deployment by running
-
-```
-kubectl describe deployments mnist-gcs-dist
-```
-
-The service should make the `mnist-gcs-dist` deployment accessible over port 9000
-
-```
-kubectl describe service mnist-gcs-dist
-```
-
 ### S3
 
 We can also serve the model when it is stored on S3. This assumes that when you trained the model you set `exportDir` to a S3
@@ -799,16 +663,7 @@ POD_NAME=$(kubectl get pods --selector=app=web-ui --template '{{range .items}}{{
 kubectl port-forward ${POD_NAME} 8080:5000  
 ```
 
-You should now be able to open up the web app at your localhost. [Local Storage](http://localhost:8080) or [GCS](http://localhost:8080/?addr=mnist-gcs-dist) or [S3](http://localhost:8080/?addr=mnist-s3-serving).
-
-
-### Using IAP on GCP
-
-If you are using GCP and have set up IAP then you can access the web UI at
-
-```
-https://${DEPLOYMENT}.endpoints.${PROJECT}.cloud.goog/${NAMESPACE}/mnist/
-```
+You should now be able to open up the web app at your localhost. [Local Storage](http://localhost:8080) or [S3](http://localhost:8080/?addr=mnist-s3-serving).
 
 ## Conclusion and Next Steps