Skip to content

Latest commit

 

History

History
70 lines (53 loc) · 2.55 KB

File metadata and controls

70 lines (53 loc) · 2.55 KB

Spark DL Inference on Dataproc

Setup

Note: fields in <brackets> require user inputs.

Setup GCloud CLI

  1. Install the latest gcloud-cli and initialize with gcloud init.

  2. Configure the following settings:

    export PROJECT=<your_project>
    export DATAPROC_REGION=<your_dataproc_region>
    export COMPUTE_REGION=<your_compute_region>
    export COMPUTE_ZONE=<your_compute_zone>
    
    gcloud config set project ${PROJECT}
    gcloud config set dataproc/region ${DATAPROC_REGION}
    gcloud config set compute/region ${COMPUTE_REGION}
    gcloud config set compute/zone ${COMPUTE_ZONE}

Copy files to GCS

  1. Create a GCS bucket if you don't already have one:

    export GCS_BUCKET=<your_gcs_bucket_name>
    
    gcloud storage buckets create gs://${GCS_BUCKET} 
  2. Specify the local path to the notebook(s) and copy to the GCS bucket. As an example for a torch notebook:

    export SPARK_DL_HOME=${GCS_BUCKET}/spark-dl
    
    gcloud storage cp </path/to/notebook_name_torch.ipynb> gs://${SPARK_DL_HOME}/notebooks/

    Repeat this step for any notebooks you wish to run. All notebooks under gs://${SPARK_DL_HOME}/notebooks/ will be copied to the master node during initialization.

  3. Copy the utils file to the GCS bucket.

    gcloud storage cp </path/to/pytriton_utils.py> gs://${SPARK_DL_HOME}/

Start cluster and run

  1. Specify the framework to use (torch or tf), which will determine what libraries to install on the cluster. For example:

    export FRAMEWORK=torch

    Run the cluster startup script. The script will also retrieve and use the spark-rapids initialization script to setup GPU resources.

    cd setup
    chmod +x start_cluster.sh
    ./start_cluster.sh

    By default, the script creates a 4 node GPU cluster named ${USER}-spark-dl-inference-${FRAMEWORK}.

  2. Browse to the Jupyter web UI:

    • Go to Dataproc > Clusters > (Cluster Name) > Web Interfaces > Jupyter/Lab

    Or, get the link by running this command (under httpPorts > Jupyter/Lab):

    gcloud dataproc clusters describe ${CLUSTER_NAME} --region=${COMPUTE_REGION}
  3. Open and run the notebook interactively with the Python 3 kernel.
    The notebooks can be found under Local Disk/spark-dl-notebooks on the master node (folder icon on the top left > Local Disk).