Spark DL Inference on Dataproc

Setup

Note: fields in <brackets> require user inputs.

Setup GCloud CLI

Install the latest gcloud-cli and initialize with gcloud init.

Configure the following settings:

export PROJECT=<your_project>
export DATAPROC_REGION=<your_dataproc_region>
export COMPUTE_REGION=<your_compute_region>
export COMPUTE_ZONE=<your_compute_zone>

gcloud config set project ${PROJECT}
gcloud config set dataproc/region ${DATAPROC_REGION}
gcloud config set compute/region ${COMPUTE_REGION}
gcloud config set compute/zone ${COMPUTE_ZONE}

Copy files to GCS

Create a GCS bucket if you don't already have one:

export GCS_BUCKET=<your_gcs_bucket_name>

gcloud storage buckets create gs://${GCS_BUCKET}

Specify the local path to the notebook(s) and copy to the GCS bucket. As an example for a torch notebook:
```
export SPARK_DL_HOME=${GCS_BUCKET}/spark-dl

gcloud storage cp </path/to/notebook_name_torch.ipynb> gs://${SPARK_DL_HOME}/notebooks/
```
Repeat this step for any notebooks you wish to run. All notebooks under gs://${SPARK_DL_HOME}/notebooks/ will be copied to the master node during initialization.

Copy the utils file to the GCS bucket.

gcloud storage cp </path/to/pytriton_utils.py> gs://${SPARK_DL_HOME}/

Start cluster and run

Specify the framework to use (torch or tf), which will determine what libraries to install on the cluster. For example:
```
export FRAMEWORK=torch
```
Run the cluster startup script. The script will also retrieve and use the spark-rapids initialization script to setup GPU resources.
```
cd setup
chmod +x start_cluster.sh
./start_cluster.sh
```
By default, the script creates a 4 node GPU cluster named ${USER}-spark-dl-inference-${FRAMEWORK}.
Browse to the Jupyter web UI:
- Go to Dataproc > Clusters > (Cluster Name) > Web Interfaces > Jupyter/Lab
Or, get the link by running this command (under httpPorts > Jupyter/Lab):
```
gcloud dataproc clusters describe ${CLUSTER_NAME} --region=${COMPUTE_REGION}
```
Open and run the notebook interactively with the Python 3 kernel.
The notebooks can be found under Local Disk/spark-dl-notebooks on the master node (folder icon on the top left > Local Disk).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Spark DL Inference on Dataproc

Setup

Setup GCloud CLI

Copy files to GCS

Start cluster and run

Files

README.md

Latest commit

History

README.md

File metadata and controls

Spark DL Inference on Dataproc

Setup

Setup GCloud CLI

Copy files to GCS

Start cluster and run