Skip to content

Commit

Permalink
Create a notebook for mnist E2E on GCP (kubeflow#723)
Browse files Browse the repository at this point in the history
* A notebook to run the mnist E2E example on GCP.

This fixes a number of issues with the example
* Use ISTIO instead of Ambassador to add reverse proxy routes
* The training job needs to be updated to run in a profile created namespace in order to have the required service accounts
     * See kubeflow#713
     * Running inside a notebook running on Kubeflow should ensure user
       is running inside an appropriately setup namespace
* With ISTIO the default RBAC rules prevent the web UI from sending requests to the model server
     * A short term fix was to not include the ISTIO side car
     * In the future we can add an appropriate ISTIO rbac policy

* Using a notebook allows us to eliminate the use of kustomize
  * This resolves kubeflow#713 which required people to use
    and old version of kustomize

  * Rather than using kustomize we can use python f style strings to
    write the YAML specs and then easily substitute in user specific values

  * This should be more informative; it avoids introducing kustomize and
    users can see the resource specs.

* I've opted to make the notebook GCP specific. I think its less confusing
  to users to have separate notebooks focused on specific platforms rather
  than having one notebook with a lot of caveats about what to do under
  different conditions

* I've deleted the kustomize overlays for GCS since we don't want users to
  use them anymore

* I used fairing and kaniko to eliminate the use of docker to build the images
  so that everything can run from a notebook running inside the cluster.

* k8s_utils.py has some reusable functions to add some details from users
  (e.g. low level calls to K8s APIs.)

* * Change the mnist test to just run the notebook
  * Copy the notebook test infra for xgboost_synthetic to py/kubeflow/examples/notebook_test to make it more reusable

* Fix lint.

* Update for lint.

* A notebook to run the mnist E2E example.

Related to: kubeflow/website#1553

* 1. Use fairing to build the model. 2. Construct the YAML spec directly in the notebook. 3. Use the TFJob python SDK.

* Fix the ISTIO rule.

* Fix UI and serving; need to update TF serving to match version trained on.

* Get the IAP endpoint.

* Start writing some helper python functions for K8s.

* Commit before switching from replace to delete.

* Create a library to bulk create objects.

* Cleanup.

* Add back k8s_util.py

* Delete train.yaml; this shouldn't have been aded.

* update the notebook image.

* Refactor code into k8s_util; print out links.

* Clean up the notebok. Should be working E2E.

* Added section to get logs from stackdriver.

* Add comment about profile.

* Latest.

* Override mnist_gcp.ipynb with mnist.ipynb

I accidentally put my latest changes in mnist.ipynb even though that file
was deleted.

* More fixes.

* Resolve some conflicts from the rebase; override with changes on remote branch.
  • Loading branch information
jlewi authored Feb 17, 2020
1 parent b9a7719 commit cc93a80
Show file tree
Hide file tree
Showing 28 changed files with 2,570 additions and 1,558 deletions.
5 changes: 4 additions & 1 deletion .pylintrc
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,10 @@ confidence=
# --enable=similarities". If you want to run only the classes checker, but have
# no Warning level messages displayed, use"--disable=all --enable=classes
# --disable=W"
disable=import-star-module-level,old-octal-literal,oct-method,print-statement,unpacking-in-except,parameter-unpacking,backtick,old-raise-syntax,old-ne-operator,long-suffix,dict-view-method,dict-iter-method,metaclass-assignment,next-method-called,raising-string,indexing-exception,raw_input-builtin,long-builtin,file-builtin,execfile-builtin,coerce-builtin,cmp-builtin,buffer-builtin,basestring-builtin,apply-builtin,filter-builtin-not-iterating,using-cmp-argument,useless-suppression,range-builtin-not-iterating,suppressed-message,missing-docstring,no-absolute-import,old-division,cmp-method,reload-builtin,zip-builtin-not-iterating,intern-builtin,unichr-builtin,reduce-builtin,standarderror-builtin,unicode-builtin,xrange-builtin,coerce-method,delslice-method,getslice-method,setslice-method,input-builtin,round-builtin,hex-method,nonzero-method,map-builtin-not-iterating,relative-import,invalid-name,bad-continuation,no-member,locally-disabled,fixme,import-error,too-many-locals,no-name-in-module,too-many-instance-attributes,no-self-use
#
# Kubeflow disables string-interpolation because we are starting to use f
# style strings
disable=import-star-module-level,old-octal-literal,oct-method,print-statement,unpacking-in-except,parameter-unpacking,backtick,old-raise-syntax,old-ne-operator,long-suffix,dict-view-method,dict-iter-method,metaclass-assignment,next-method-called,raising-string,indexing-exception,raw_input-builtin,long-builtin,file-builtin,execfile-builtin,coerce-builtin,cmp-builtin,buffer-builtin,basestring-builtin,apply-builtin,filter-builtin-not-iterating,using-cmp-argument,useless-suppression,range-builtin-not-iterating,suppressed-message,missing-docstring,no-absolute-import,old-division,cmp-method,reload-builtin,zip-builtin-not-iterating,intern-builtin,unichr-builtin,reduce-builtin,standarderror-builtin,unicode-builtin,xrange-builtin,coerce-method,delslice-method,getslice-method,setslice-method,input-builtin,round-builtin,hex-method,nonzero-method,map-builtin-not-iterating,relative-import,invalid-name,bad-continuation,no-member,locally-disabled,fixme,import-error,too-many-locals,no-name-in-module,too-many-instance-attributes,no-self-use,logging-fstring-interpolation


[REPORTS]
Expand Down
3 changes: 2 additions & 1 deletion mnist/Dockerfile.model
Original file line number Diff line number Diff line change
@@ -1,5 +1,6 @@
#This container contains your model and any helper scripts specific to your model.
FROM tensorflow/tensorflow:1.7.0
# When building the image inside mnist.ipynb the base docker image will be overwritten
FROM tensorflow/tensorflow:1.15.2-py3

ADD model.py /opt/model.py
RUN chmod +x /opt/model.py
Expand Down
2 changes: 2 additions & 0 deletions mnist/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,8 @@
# To override variables do
# make ${TARGET} ${VAR}=${VALUE}
#
#
# TODO(jlewi): We should probably switch to Skaffold and Tekton

# IMG is the base path for images..
# Individual images will be
Expand Down
229 changes: 42 additions & 187 deletions mnist/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,8 @@
**Table of Contents** *generated with [DocToc](https://github.com/thlorenz/doctoc)*

- [MNIST on Kubeflow](#mnist-on-kubeflow)
- [MNIST on Kubeflow on GCP](#mnist-on-kubeflow-on-gcp)
- [MNIST on other platforms](#mnist-on-other-platforms)
- [Prerequisites](#prerequisites)
- [Deploy Kubeflow](#deploy-kubeflow)
- [Local Setup](#local-setup)
Expand All @@ -13,21 +15,17 @@
- [Preparing your Kubernetes Cluster](#preparing-your-kubernetes-cluster)
- [Training your model](#training-your-model)
- [Local storage](#local-storage)
- [Using GCS](#using-gcs)
- [Using S3](#using-s3)
- [Monitoring](#monitoring)
- [Tensorboard](#tensorboard)
- [Local storage](#local-storage-1)
- [Using GCS](#using-gcs-1)
- [Using S3](#using-s3-1)
- [Deploying TensorBoard](#deploying-tensorboard)
- [Serving the model](#serving-the-model)
- [GCS](#gcs)
- [S3](#s3)
- [Local storage](#local-storage-2)
- [Web Front End](#web-front-end)
- [Connecting via port forwarding](#connecting-via-port-forwarding)
- [Using IAP on GCP](#using-iap-on-gcp)
- [Conclusion and Next Steps](#conclusion-and-next-steps)

<!-- END doctoc generated TOC please keep comment here to allow auto update -->
Expand All @@ -37,6 +35,45 @@

This example guides you through the process of taking an example model, modifying it to run better within Kubeflow, and serving the resulting trained model.

Follow the version of the guide that is specific to how you have deployed Kubeflow

1. [MNIST on Kubeflow on GCP](#gcp)
1. [MNIST on other platforms](#other)

<a id=gcp></a>
# MNIST on Kubeflow on GCP

Follow these instructions to run the MNIST tutorial on GCP

1. Follow the [GCP instructions](https://www.kubeflow.org/docs/gke/deploy/) to deploy Kubeflow with IAP

1. Launch a Jupyter notebook

* The tutorial has been tested using the Jupyter Tensorflow 1.15 image

1. Launch a terminal in Jupyter and clone the kubeflow examples repo

```
git clone https://github.com/kubeflow/examples.git git_kubeflow-examples
```

* **Tip** When you start a terminal in Jupyter, run the command `bash` to start
a bash terminal which is much more friendly then the default shell

* **Tip** You can change the URL from '/tree' to '/lab' to switch to using Jupyterlab

1. Open the notebook `mnist/mnist_gcp.ipynb`

1. Follow the notebook to train and deploy MNIST on Kubeflow

<a id=other></a>
# MNIST on other platforms

The tutorial is currently not up to date for Kubeflow 1.0. Please check the issues

* [kubeflow/examples#724](https://github.com/kubeflow/examples/issues/724) for AWS
* [kubeflow/examples#725](https://github.com/kubeflow/examples/issues/725) for other platforms

## Prerequisites

Before we get started there are a few requirements.
Expand Down Expand Up @@ -166,100 +203,6 @@ And to check the logs
kubectl logs mnist-train-local-chief-0
```


#### Using GCS

In this section we describe how to save the model to Google Cloud Storage (GCS).

Storing the model in GCS has the advantages:

* The model is readily available after the job finishes
* We can run distributed training

* Distributed training requires a storage system accessible to all the machines

Enter the `training/GCS` from the `mnist` application directory.

```
cd training/GCS
```

Set an environment variable that points to your GCP project Id
```
PROJECT=<your project id>
```

Create a bucket on GCS to store our model. The name must be unique across all GCS buckets
```
BUCKET=distributed-$(date +%s)
gsutil mb gs://$BUCKET/
```

Give the job a different name (to distinguish it from your job which didn't use GCS)

```
kustomize edit add configmap mnist-map-training --from-literal=name=mnist-train-dist
```

Optionally, if you want to use your custom training image, configurate that as below.

```
kustomize edit set image training-image=$DOCKER_URL
```

Next we configure it to run distributed by setting the number of parameter servers and workers to use. The `numPs` means the number of Ps and the `numWorkers` means the number of Worker.

```
../base/definition.sh --numPs 1 --numWorkers 2
```

Set the training parameters, such as training steps, batch size and learning rate.

```
kustomize edit add configmap mnist-map-training --from-literal=trainSteps=200
kustomize edit add configmap mnist-map-training --from-literal=batchSize=100
kustomize edit add configmap mnist-map-training --from-literal=learningRate=0.01
```

Now we need to configure parameters and telling the code to save the model to GCS.

```
MODEL_PATH=my-model
kustomize edit add configmap mnist-map-training --from-literal=modelDir=gs://${BUCKET}/${MODEL_PATH}
kustomize edit add configmap mnist-map-training --from-literal=exportDir=gs://${BUCKET}/${MODEL_PATH}/export
```

Build a yaml file for the `TFJob` specification based on your kustomize config:

```
kustomize build . > mnist-training.yaml
```

Then, in `mnist-training.yaml`, search for this line: `namespace: kubeflow`.
Edit it to **replace `kubeflow` with the name of your user profile namespace**,
which will probably have the form `kubeflow-<username>`. (If you're not sure what this
namespace is called, you can find it in the top menubar of the Kubeflow Central
Dashboard.)

After you've updated the namespace, apply the `TFJob` specification to the
Kubeflow cluster:

```
kubectl apply -f mnist-training.yaml
```

You can then check the job status:

```
kubectl get tfjobs -n <your-user-namespace> -o yaml mnist-train-dist
```

And to check the logs:

```
kubectl logs -n <your-user-namespace> -f mnist-train-dist-chief-0
```

#### Using S3

To use S3 we need to configure TensorFlow to use S3 credentials and variables. These credentials will be provided as kubernetes secrets and the variables will be passed in as environment variables. Modify the below values to suit your environment.
Expand Down Expand Up @@ -426,27 +369,6 @@ kustomize edit add configmap mnist-map-monitoring --from-literal=pvcMountPath=/m
kustomize edit add configmap mnist-map-monitoring --from-literal=logDir=/mnt
```
#### Using GCS
Enter the `monitoring/GCS` from the `mnist` application directory.
```
cd monitoring/GCS
```
Configure TensorBoard to point to your model location
```
kustomize edit add configmap mnist-map-monitoring --from-literal=logDir=${LOGDIR}
```
Assuming you followed the directions above if you used GCS you can use the following value
```
LOGDIR=gs://${BUCKET}/${MODEL_PATH}
```
#### Using S3
Enter the `monitoring/S3` from the `mnist` application directory.
Expand Down Expand Up @@ -551,64 +473,6 @@ The model code will export the model in saved model format which is suitable for
To serve the model follow the instructions below. The instructins vary slightly based on where you are storing your model (e.g. GCS, S3, PVC). Depending on the storage system we provide different kustomization as a convenience for setting relevant environment variables.
### GCS
Here we show to serve the model when it is stored on GCS. This assumes that when you trained the model you set `exportDir` to a GCS URI; if not you can always copy it to GCS using `gsutil`.
Check that a model was exported
```
EXPORT_DIR=gs://${BUCKET}/${MODEL_PATH}/export
gsutil ls -r ${EXPORT_DIR}
```
The output should look something like
```
${EXPORT_DIR}/1547100373/saved_model.pb
${EXPORT_DIR}/1547100373/variables/:
${EXPORT_DIR}/1547100373/variables/
${EXPORT_DIR}/1547100373/variables/variables.data-00000-of-00001
${EXPORT_DIR}/1547100373/variables/variables.index
```
The number `1547100373` is a version number auto-generated by TensorFlow; it will vary on each run but should be monotonically increasing if you save a model to the same location as a previous location.
Enter the `serving/GCS` from the `mnist` application directory.
```
cd serving/GCS
```
Set a different name for the tf-serving.
```
kustomize edit add configmap mnist-map-serving --from-literal=name=mnist-gcs-dist
```
Set your model path
```
kustomize edit add configmap mnist-map-serving --from-literal=modelBasePath=${EXPORT_DIR}
```
Deploy it, and run a service to make the deployment accessible to other pods in the cluster
```
kustomize build . |kubectl apply -f -
```
You can check the deployment by running
```
kubectl describe deployments mnist-gcs-dist
```
The service should make the `mnist-gcs-dist` deployment accessible over port 9000
```
kubectl describe service mnist-gcs-dist
```
### S3
We can also serve the model when it is stored on S3. This assumes that when you trained the model you set `exportDir` to a S3
Expand Down Expand Up @@ -799,16 +663,7 @@ POD_NAME=$(kubectl get pods --selector=app=web-ui --template '{{range .items}}{{
kubectl port-forward ${POD_NAME} 8080:5000
```
You should now be able to open up the web app at your localhost. [Local Storage](http://localhost:8080) or [GCS](http://localhost:8080/?addr=mnist-gcs-dist) or [S3](http://localhost:8080/?addr=mnist-s3-serving).
### Using IAP on GCP
If you are using GCP and have set up IAP then you can access the web UI at
```
https://${DEPLOYMENT}.endpoints.${PROJECT}.cloud.goog/${NAMESPACE}/mnist/
```
You should now be able to open up the web app at your localhost. [Local Storage](http://localhost:8080) or [S3](http://localhost:8080/?addr=mnist-s3-serving).
## Conclusion and Next Steps
Expand Down
Loading

0 comments on commit cc93a80

Please sign in to comment.