Skip to content

Commit ae292de

Browse files
authored
Support for Spark DL notebooks with PyTriton on Databricks/Dataproc (#483)
### Support for running DL Inference notebooks on CSP environments. - Refactored Triton sections to use PyTriton, a Python API for the Triton inference server which avoids Docker. Once this PR is merged, Triton sections no longer need to be skipped in the CI pipeline @YanxuanLiu . - Updated notebooks with instructions to run on Databricks/Dataproc - Updated Torch notebooks with best practices for ahead-of-time TensorRT compilation. - Cleaned up README, removing instructions to start Jupyter with PySpark (we need a cell to attach to standalone for CI/CD anyway, so hoping to reduce confusion for user). Notebook outputs are saved from running locally, but all notebooks were tested on Databricks/Dataproc. --------- Signed-off-by: Rishi Chandra <[email protected]>
1 parent fa7f490 commit ae292de

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

44 files changed

+11045
-11311
lines changed
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,17 @@
1-
# Spark DL Inference Using External Frameworks
1+
# Deep Learning Inference on Spark
22

3-
Example notebooks for the [predict_batch_udf](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.ml.functions.predict_batch_udf.html#pyspark.ml.functions.predict_batch_udf) function introduced in Spark 3.4.
3+
Example notebooks demonstrating **distributed deep learning inference** using the [predict_batch_udf](https://developer.nvidia.com/blog/distributed-deep-learning-made-easy-with-spark-3-4/) introduced in Spark 3.4.0.
4+
These notebooks also demonstrate integration with [Triton Inference Server](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html), an open-source, GPU-accelerated serving solution for DL.
45

5-
## Overview
6+
## Contents:
7+
- [Overview](#overview)
8+
- [Running Locally](#running-locally)
9+
- [Running on Cloud](#running-on-cloud-environments)
10+
- [Integration with Triton Inference Server](#inference-with-triton)
611

7-
This directory contains notebooks for each DL framework (based on their own published examples). The goal is to demonstrate how models trained and saved on single-node machines can be easily used for parallel inferencing on Spark clusters.
12+
## Overview
813

14+
These notebooks demonstrate how models from external frameworks (Torch, Huggingface, Tensorflow) trained on single-worker machines can be used for large-scale distributed inference on Spark clusters.
915
For example, a basic model trained in TensorFlow and saved on disk as "mnist_model" can be used in Spark as follows:
1016
```
1117
import numpy as np
@@ -28,35 +34,33 @@ df = spark.read.parquet("mnist_data")
2834
predictions = df.withColumn("preds", mnist("data")).collect()
2935
```
3036

31-
In this simple case, the `predict_batch_fn` will use TensorFlow APIs to load the model and return a simple `predict` function which operates on numpy arrays. The `predict_batch_udf` will automatically convert the Spark DataFrame columns to the expected numpy inputs.
37+
In this simple case, the `predict_batch_fn` will use TensorFlow APIs to load the model and return a simple `predict` function. The `predict_batch_udf` will handle the data conversion from Spark DataFrame columns into batched numpy inputs.
38+
3239

33-
All notebooks have been saved with sample outputs for quick browsing.
34-
Here is a full list of the notebooks with their published example links:
40+
#### Notebook List
3541

36-
| | Category | Notebook Name | Description | Link
42+
Below is a full list of the notebooks with links to the examples they are based on. All notebooks have been saved with sample outputs for quick browsing.
43+
44+
| | Framework | Notebook Name | Description | Link
3745
| ------------- | ------------- | ------------- | ------------- | -------------
3846
| 1 | PyTorch | Image Classification | Training a model to predict clothing categories in FashionMNIST, including accelerated inference with Torch-TensorRT. | [Link](https://pytorch.org/tutorials/beginner/basics/quickstart_tutorial.html)
39-
| 2 | PyTorch | Regression | Training a model to predict housing prices in the California Housing Dataset, including accelerated inference with Torch-TensorRT. | [Link](https://github.com/christianversloot/machine-learning-articles/blob/main/how-to-create-a-neural-network-for-regression-with-pytorch.md)
47+
| 2 | PyTorch | Housing Regression | Training a model to predict housing prices in the California Housing Dataset, including accelerated inference with Torch-TensorRT. | [Link](https://github.com/christianversloot/machine-learning-articles/blob/main/how-to-create-a-neural-network-for-regression-with-pytorch.md)
4048
| 3 | Tensorflow | Image Classification | Training a model to predict hand-written digits in MNIST. | [Link](https://github.com/tensorflow/docs/blob/master/site/en/tutorials/keras/save_and_load.ipynb)
41-
| 4 | Tensorflow | Feature Columns | Training a model with preprocessing layers to predict likelihood of pet adoption in the PetFinder mini dataset. | [Link](https://github.com/tensorflow/docs/blob/master/site/en/tutorials/structured_data/preprocessing_layers.ipynb)
42-
| 5 | Tensorflow | Keras Metadata | Training ResNet-50 to perform flower recognition on Databricks. | [Link](https://docs.databricks.com/en/_extras/notebooks/source/deep-learning/keras-metadata.html)
49+
| 4 | Tensorflow | Keras Preprocessing | Training a model with preprocessing layers to predict likelihood of pet adoption in the PetFinder mini dataset. | [Link](https://github.com/tensorflow/docs/blob/master/site/en/tutorials/structured_data/preprocessing_layers.ipynb)
50+
| 5 | Tensorflow | Keras Resnet50 | Training ResNet-50 to perform flower recognition from flower images. | [Link](https://docs.databricks.com/en/_extras/notebooks/source/deep-learning/keras-metadata.html)
4351
| 6 | Tensorflow | Text Classification | Training a model to perform sentiment analysis on the IMDB dataset. | [Link](https://github.com/tensorflow/docs/blob/master/site/en/tutorials/keras/text_classification.ipynb)
44-
| 7+8 | HuggingFace | Conditional Generation | Sentence translation using the T5 text-to-text transformer, with notebooks demoing both Torch and Tensorflow. | [Link](https://huggingface.co/docs/transformers/model_doc/t5#t5)
45-
| 9+10 | HuggingFace | Pipelines | Sentiment analysis using Huggingface pipelines, with notebooks demoing both Torch and Tensorflow. | [Link](https://huggingface.co/docs/transformers/quicktour#pipeline-usage)
46-
| 11 | HuggingFace | Sentence Transformers | Sentence embeddings using the SentenceTransformers framework in Torch. | [Link](https://huggingface.co/sentence-transformers)
52+
| 7+8 | HuggingFace | Conditional Generation | Sentence translation using the T5 text-to-text transformer for both Torch and Tensorflow. | [Link](https://huggingface.co/docs/transformers/model_doc/t5#t5)
53+
| 9+10 | HuggingFace | Pipelines | Sentiment analysis using Huggingface pipelines for both Torch and Tensorflow. | [Link](https://huggingface.co/docs/transformers/quicktour#pipeline-usage)
54+
| 11 | HuggingFace | Sentence Transformers | Sentence embeddings using SentenceTransformers in Torch. | [Link](https://huggingface.co/sentence-transformers)
4755

48-
## Running the Notebooks
56+
## Running Locally
4957

50-
If you want to run the notebooks yourself, please follow these instructions.
51-
52-
**Notes**:
53-
- The notebooks require a GPU environment for the executors.
54-
- Please create separate environments for PyTorch and Tensorflow examples as specified below. This will avoid conflicts between the CUDA libraries bundled with their respective versions. The Huggingface examples will have a `_torch` or `_tf` suffix to specify the environment used.
55-
- The PyTorch notebooks include model compilation and accelerated inference with TensorRT. While not included in the notebooks, Tensorflow also supports [integration with TensorRT](https://docs.nvidia.com/deeplearning/frameworks/tf-trt-user-guide/index.html), but may require downgrading the TF version.
56-
- For demonstration purposes, these examples just use a local Spark Standalone cluster with a single executor, but you should be able to run them on any distributed Spark cluster.
58+
To run the notebooks locally, please follow these instructions:
5759

5860
#### Create environment
5961

62+
Each notebook has a suffix `_torch` or `_tf` specifying the environment used.
63+
6064
**For PyTorch:**
6165
```
6266
conda create -n spark-dl-torch python=3.11
@@ -70,36 +74,57 @@ conda activate spark-dl-tf
7074
pip install -r tf_requirements.txt
7175
```
7276

73-
#### Launch Jupyter + Spark
77+
#### Start Cluster
78+
79+
For demonstration, these instructions just use a local Standalone cluster with a single executor, but they can be run on any distributed Spark cluster. For cloud environments, see [below](#running-on-cloud-environments).
7480

81+
```shell
82+
# Replace with your Spark installation path
83+
export SPARK_HOME=</path/to/spark>
7584
```
76-
# setup environment variables
77-
export SPARK_HOME=/path/to/spark
85+
86+
```shell
87+
# Configure and start cluster
7888
export MASTER=spark://$(hostname):7077
7989
export SPARK_WORKER_INSTANCES=1
8090
export CORES_PER_WORKER=8
81-
export PYSPARK_DRIVER_PYTHON=jupyter
82-
export PYSPARK_DRIVER_PYTHON_OPTS='lab'
83-
84-
# start spark standalone cluster
91+
export SPARK_WORKER_OPTS="-Dspark.worker.resource.gpu.amount=1 -Dspark.worker.resource.gpu.discoveryScript=$SPARK_HOME/examples/src/main/scripts/getGpusResources.sh"
8592
${SPARK_HOME}/sbin/start-master.sh; ${SPARK_HOME}/sbin/start-worker.sh -c ${CORES_PER_WORKER} -m 16G ${MASTER}
93+
```
8694

87-
# start jupyter with pyspark
88-
${SPARK_HOME}/bin/pyspark --master ${MASTER} \
89-
--driver-memory 8G \
90-
--executor-memory 8G \
91-
--conf spark.python.worker.reuse=True
95+
The notebooks are ready to run! Each notebook has a cell to connect to the standalone cluster and create a SparkSession.
9296

93-
# BROWSE to localhost:8888 to view/run notebooks
97+
**Notes**:
98+
- Please create separate environments for PyTorch and Tensorflow notebooks as specified above. This will avoid conflicts between the CUDA libraries bundled with their respective versions.
99+
- `requirements.txt` installs pyspark>=3.4.0. Make sure the installed PySpark version is compatible with your system's Spark installation.
100+
- The notebooks require a GPU environment for the executors.
101+
- The PyTorch notebooks include model compilation and accelerated inference with TensorRT. While not included in the notebooks, Tensorflow also supports [integration with TensorRT](https://docs.nvidia.com/deeplearning/frameworks/tf-trt-user-guide/index.html), but as of writing it is not supported in TF==2.17.0.
94102

95-
# stop spark standalone cluster
96-
${SPARK_HOME}/sbin/stop-worker.sh; ${SPARK_HOME}/sbin/stop-master.sh
103+
**Troubleshooting:**
104+
If you encounter issues starting the Triton server, you may need to link your libstdc++ file to the conda environment, e.g.:
105+
```shell
106+
ln -sf /usr/lib/x86_64-linux-gnu/libstdc++.so.6 ${CONDA_PREFIX}/lib/libstdc++.so.6
97107
```
98108

99-
## Triton Inference Server
109+
## Running on Cloud Environments
110+
111+
We also provide instructions to run the notebooks on CSP Spark environments.
112+
See the instructions for [Databricks](databricks/README.md) and [GCP Dataproc](dataproc/README.md).
113+
114+
## Inference with Triton
115+
116+
The notebooks also demonstrate integration with the [Triton Inference Server](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html), an open-source serving platform for deep learning models, which includes many [features and performance optimizations](https://docs.nvidia.com/deeplearning/triton-inference-server/user-guide/docs/index.html#triton-major-features) to streamline inference.
117+
The notebooks use [PyTriton](https://github.com/triton-inference-server/pytriton), a Flask-like Python framework that handles communication with the Triton server.
118+
119+
<img src="images/spark-pytriton.png" alt="drawing" width="1000"/>
100120

101-
The example notebooks also demonstrate integration with [Triton Inference Server](https://developer.nvidia.com/nvidia-triton-inference-server), an open-source, GPU-accelerated serving solution for DL.
121+
The diagram above shows how Spark distributes inference tasks to run on the Triton Inference Server, with PyTriton handling request/response communication with the server.
102122

103-
**Note**: Some examples may require special configuration of server as highlighted in the notebooks.
123+
The process looks like this:
124+
- Distribute a PyTriton task across the Spark cluster, instructing each worker to launch a Triton server process.
125+
- Use stage-level scheduling to ensure there is a 1:1 mapping between worker nodes and servers.
126+
- Define a Triton inference function, which contains a client that binds to the local server on a given worker and sends inference requests.
127+
- Wrap the Triton inference function in a predict_batch_udf to launch parallel inference requests using Spark.
128+
- Finally, distribute a shutdown signal to terminate the Triton server processes on each worker.
104129

105-
**Note**: for demonstration purposes, the Triton Inference Server integrations just launch the server in a docker container on the local host, so you will need to [install docker](https://docs.docker.com/engine/install/) on your local host. Most real-world deployments will likely be hosted on remote machines.
130+
For more information on how PyTriton works, see the [PyTriton docs](https://triton-inference-server.github.io/pytriton/latest/high_level_design/).
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,55 @@
1+
# Spark DL Inference on Databricks
2+
3+
**Note**: fields in \<brackets\> require user inputs.
4+
5+
## Setup
6+
7+
1. Install the latest [databricks-cli](https://docs.databricks.com/en/dev-tools/cli/tutorial.html) and configure for your workspace.
8+
9+
2. Specify the path to your Databricks workspace:
10+
```shell
11+
export WS_PATH=</Users/[email protected]>
12+
13+
export NOTEBOOK_DEST=${WS_PATH}/spark-dl/notebook_torch.ipynb
14+
export UTILS_DEST=${WS_PATH}/spark-dl/pytriton_utils.py
15+
export INIT_DEST=${WS_PATH}/spark-dl/init_spark_dl.sh
16+
```
17+
3. Specify the local paths to the notebook you wish to run, the utils file, and the init script.
18+
As an example for a PyTorch notebook:
19+
```shell
20+
export NOTEBOOK_SRC=</path/to/notebook_torch.ipynb>
21+
export UTILS_SRC=</path/to/pytriton_utils.py>
22+
export INIT_SRC=$(pwd)/setup/init_spark_dl.sh
23+
```
24+
4. Specify the framework to torch or tf, corresponding to the notebook you wish to run. Continuing with the PyTorch example:
25+
```shell
26+
export FRAMEWORK=torch
27+
```
28+
This will tell the init script which libraries to install on the cluster.
29+
30+
5. Copy the files to the Databricks Workspace:
31+
```shell
32+
databricks workspace import $NOTEBOOK_DEST --format JUPYTER --file $NOTEBOOK_SRC
33+
databricks workspace import $UTILS_DEST --format AUTO --file $UTILS_SRC
34+
databricks workspace import $INIT_DEST --format AUTO --file $INIT_SRC
35+
```
36+
37+
6. Launch the cluster with the provided script (note that the script specifies **Azure instances** by default; change as needed):
38+
```shell
39+
cd setup
40+
chmod +x start_cluster.sh
41+
./start_cluster.sh
42+
```
43+
44+
OR, start the cluster from the Databricks UI:
45+
46+
- Go to `Compute > Create compute` and set the desired cluster settings.
47+
- Integration with Triton inference server uses stage-level scheduling (Spark>=3.4.0). Make sure to:
48+
- use a cluster with GPU resources
49+
- set a value for `spark.executor.cores`
50+
- ensure that `spark.executor.resource.gpu.amount` = 1
51+
- Under `Advanced Options > Init Scripts`, upload the init script from your workspace.
52+
- Under environment variables, set `FRAMEWORK=torch` or `FRAMEWORK=tf` based on the notebook used.
53+
- For Tensorflow notebooks, we recommend setting the environment variable `TF_GPU_ALLOCATOR=cuda_malloc_async` (especially for Huggingface LLM models), which enables the CUDA driver to implicity release unused memory from the pool.
54+
55+
7. Navigate to the notebook in your workspace and attach it to the cluster. The default cluster name is `spark-dl-inference-$FRAMEWORK`.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
#!/bin/bash
2+
# Copyright (c) 2025, NVIDIA CORPORATION.
3+
4+
set -euxo pipefail
5+
6+
# install requirements
7+
sudo /databricks/python3/bin/pip3 install --upgrade pip
8+
9+
if [[ "${FRAMEWORK}" == "torch" ]]; then
10+
cat <<EOF > temp_requirements.txt
11+
datasets==3.*
12+
transformers
13+
urllib3<2
14+
nvidia-pytriton
15+
torch
16+
torchvision --extra-index-url https://download.pytorch.org/whl/cu121
17+
torch-tensorrt
18+
tensorrt --extra-index-url https://download.pytorch.org/whl/cu121
19+
sentence_transformers
20+
sentencepiece
21+
nvidia-modelopt[all] --extra-index-url https://pypi.nvidia.com
22+
EOF
23+
elif [[ "${FRAMEWORK}" == "tf" ]]; then
24+
cat <<EOF > temp_requirements.txt
25+
datasets==3.*
26+
transformers
27+
urllib3<2
28+
nvidia-pytriton
29+
EOF
30+
else
31+
echo "Please export FRAMEWORK as torch or tf per README"
32+
exit 1
33+
fi
34+
35+
sudo /databricks/python3/bin/pip3 install --upgrade --force-reinstall -r temp_requirements.txt
36+
rm temp_requirements.txt
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,49 @@
1+
#!/bin/bash
2+
# Copyright (c) 2025, NVIDIA CORPORATION.
3+
4+
set -eo pipefail
5+
6+
# configure arguments
7+
if [[ -z ${INIT_DEST} ]]; then
8+
echo "Please make sure INIT_DEST is exported per README.md"
9+
exit 1
10+
fi
11+
12+
if [[ -z ${FRAMEWORK} ]]; then
13+
echo "Please make sure FRAMEWORK is exported to torch or tf per README.md"
14+
exit 1
15+
fi
16+
17+
json_config=$(cat <<EOF
18+
{
19+
"cluster_name": "spark-dl-inference-${FRAMEWORK}",
20+
"spark_version": "15.4.x-gpu-ml-scala2.12",
21+
"spark_conf": {
22+
"spark.executor.resource.gpu.amount": "1",
23+
"spark.python.worker.reuse": "true",
24+
"spark.task.resource.gpu.amount": "0.125",
25+
"spark.sql.execution.arrow.pyspark.enabled": "true",
26+
"spark.executor.cores": "8"
27+
},
28+
"node_type_id": "Standard_NC8as_T4_v3",
29+
"driver_node_type_id": "Standard_NC8as_T4_v3",
30+
"spark_env_vars": {
31+
"TF_GPU_ALLOCATOR": "cuda_malloc_async",
32+
"FRAMEWORK": "${FRAMEWORK}"
33+
},
34+
"autotermination_minutes": 60,
35+
"enable_elastic_disk": true,
36+
"init_scripts": [
37+
{
38+
"workspace": {
39+
"destination": "${INIT_DEST}"
40+
}
41+
}
42+
],
43+
"runtime_engine": "STANDARD",
44+
"num_workers": 4
45+
}
46+
EOF
47+
)
48+
49+
databricks clusters create --json "$json_config"
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
# Spark DL Inference on Dataproc
2+
3+
## Setup
4+
5+
**Note**: fields in \<brackets\> require user inputs.
6+
7+
#### Setup GCloud CLI
8+
9+
1. Install the latest [gcloud-cli](https://cloud.google.com/sdk/docs/install) and initialize with `gcloud init`.
10+
11+
2. Configure the following settings:
12+
```shell
13+
export PROJECT=<your_project>
14+
export DATAPROC_REGION=<your_dataproc_region>
15+
export COMPUTE_REGION=<your_compute_region>
16+
export COMPUTE_ZONE=<your_compute_zone>
17+
18+
gcloud config set project ${PROJECT}
19+
gcloud config set dataproc/region ${DATAPROC_REGION}
20+
gcloud config set compute/region ${COMPUTE_REGION}
21+
gcloud config set compute/zone ${COMPUTE_ZONE}
22+
```
23+
24+
#### Copy files to GCS
25+
26+
3. Create a GCS bucket if you don't already have one:
27+
```shell
28+
export GCS_BUCKET=<your_gcs_bucket_name>
29+
30+
gcloud storage buckets create gs://${GCS_BUCKET}
31+
```
32+
33+
4. Specify the local path to the notebook(s) and copy to the GCS bucket.
34+
As an example for a torch notebook:
35+
```shell
36+
export SPARK_DL_HOME=${GCS_BUCKET}/spark-dl
37+
38+
gcloud storage cp </path/to/notebook_name_torch.ipynb> gs://${SPARK_DL_HOME}/notebooks/
39+
```
40+
Repeat this step for any notebooks you wish to run. All notebooks under `gs://${SPARK_DL_HOME}/notebooks/` will be copied to the master node during initialization.
41+
42+
5. Copy the utils file to the GCS bucket.
43+
```shell
44+
gcloud storage cp </path/to/pytriton_utils.py> gs://${SPARK_DL_HOME}/
45+
```
46+
47+
#### Start cluster and run
48+
49+
5. Specify the framework to use (torch or tf), which will determine what libraries to install on the cluster. For example:
50+
```shell
51+
export FRAMEWORK=torch
52+
```
53+
Run the cluster startup script. The script will also retrieve and use the [spark-rapids initialization script](https://github.com/GoogleCloudDataproc/initialization-actions/blob/master/spark-rapids/spark-rapids.sh) to setup GPU resources.
54+
```shell
55+
cd setup
56+
chmod +x start_cluster.sh
57+
./start_cluster.sh
58+
```
59+
By default, the script creates a 4 node GPU cluster named `${USER}-spark-dl-inference-${FRAMEWORK}`.
60+
61+
7. Browse to the Jupyter web UI:
62+
- Go to `Dataproc` > `Clusters` > `(Cluster Name)` > `Web Interfaces` > `Jupyter/Lab`
63+
64+
Or, get the link by running this command (under httpPorts > Jupyter/Lab):
65+
```shell
66+
gcloud dataproc clusters describe ${CLUSTER_NAME} --region=${COMPUTE_REGION}
67+
```
68+
69+
8. Open and run the notebook interactively with the **Python 3 kernel**.
70+
The notebooks can be found under `Local Disk/spark-dl-notebooks` on the master node (folder icon on the top left > Local Disk).

0 commit comments

Comments
 (0)