Note: fields in <brackets> require user inputs.
Install the latest gcloud-cli and initialize with
gcloud init
. -
Configure the following settings:
export PROJECT=<your_project> export DATAPROC_REGION=<your_dataproc_region> export COMPUTE_REGION=<your_compute_region> export COMPUTE_ZONE=<your_compute_zone> gcloud config set project ${PROJECT} gcloud config set dataproc/region ${DATAPROC_REGION} gcloud config set compute/region ${COMPUTE_REGION} gcloud config set compute/zone ${COMPUTE_ZONE}
Create a GCS bucket if you don't already have one:
export GCS_BUCKET=<your_gcs_bucket_name> gcloud storage buckets create gs://${GCS_BUCKET}
Specify the local path to the notebook(s) and copy to the GCS bucket. As an example for a torch notebook:
export SPARK_DL_HOME=${GCS_BUCKET}/spark-dl gcloud storage cp </path/to/notebook_name_torch.ipynb> gs://${SPARK_DL_HOME}/notebooks/
Repeat this step for any notebooks you wish to run. All notebooks under
will be copied to the master node during initialization. -
Copy the utils file to the GCS bucket.
gcloud storage cp </path/to/> gs://${SPARK_DL_HOME}/
Specify the framework to use (torch or tf), which will determine what libraries to install on the cluster. For example:
export FRAMEWORK=torch
Run the cluster startup script. The script will also retrieve and use the spark-rapids initialization script to setup GPU resources.
cd setup chmod +x ./
By default, the script creates a 4 node GPU cluster named
. -
Browse to the Jupyter web UI:
- Go to
>(Cluster Name)
>Web Interfaces
Or, get the link by running this command (under httpPorts > Jupyter/Lab):
gcloud dataproc clusters describe ${CLUSTER_NAME} --region=${COMPUTE_REGION}
- Go to
Open and run the notebook interactively with the Python 3 kernel.
The notebooks can be found underLocal Disk/spark-dl-notebooks
on the master node (folder icon on the top left > Local Disk).