Note: fields in <brackets> require user inputs.
-
Install the latest gcloud-cli and initialize with
gcloud init
. -
Configure the following settings:
export PROJECT=<your_project> export DATAPROC_REGION=<your_dataproc_region> export COMPUTE_REGION=<your_compute_region> export COMPUTE_ZONE=<your_compute_zone> gcloud config set project ${PROJECT} gcloud config set dataproc/region ${DATAPROC_REGION} gcloud config set compute/region ${COMPUTE_REGION} gcloud config set compute/zone ${COMPUTE_ZONE}
-
Create a GCS bucket if you don't already have one:
export GCS_BUCKET=<your_gcs_bucket_name> gcloud storage buckets create gs://${GCS_BUCKET}
-
Specify the local path to the notebook(s) and copy to the GCS bucket. As an example for a torch notebook:
export SPARK_DL_HOME=${GCS_BUCKET}/spark-dl gcloud storage cp </path/to/notebook_name_torch.ipynb> gs://${SPARK_DL_HOME}/notebooks/
Repeat this step for any notebooks you wish to run. All notebooks under
gs://${SPARK_DL_HOME}/notebooks/
will be copied to the master node during initialization. -
Copy the utils file to the GCS bucket.
gcloud storage cp </path/to/pytriton_utils.py> gs://${SPARK_DL_HOME}/
-
Specify the framework to use (torch or tf), which will determine what libraries to install on the cluster. For example:
export FRAMEWORK=torch
Run the cluster startup script. The script will also retrieve and use the spark-rapids initialization script to setup GPU resources.
cd setup chmod +x start_cluster.sh ./start_cluster.sh
By default, the script creates a 4 node GPU cluster named
${USER}-spark-dl-inference-${FRAMEWORK}
. -
Browse to the Jupyter web UI:
- Go to
Dataproc
>Clusters
>(Cluster Name)
>Web Interfaces
>Jupyter/Lab
Or, get the link by running this command (under httpPorts > Jupyter/Lab):
gcloud dataproc clusters describe ${CLUSTER_NAME} --region=${COMPUTE_REGION}
- Go to
-
Open and run the notebook interactively with the Python 3 kernel.
The notebooks can be found underLocal Disk/spark-dl-notebooks
on the master node (folder icon on the top left > Local Disk).