Skip to content

Commit 1bc43fb

Browse files
authored
Add LLM batch inference examples (#493)
Add deepseek-r1 and gemma-7b LLM batch inference notebooks. Updated CSP instructions since these notebooks require >20GB GPU RAM (A10/L4). --------- Signed-off-by: Rishi Chandra <[email protected]>
1 parent e522404 commit 1bc43fb

File tree

9 files changed

+1811
-61
lines changed

9 files changed

+1811
-61
lines changed

examples/ML+DL-Examples/Spark-DL/dl_inference/README.md

+14-12
Original file line numberDiff line numberDiff line change
@@ -43,15 +43,18 @@ Below is a full list of the notebooks with links to the examples they are based
4343

4444
| | Framework | Notebook Name | Description | Link
4545
| ------------- | ------------- | ------------- | ------------- | -------------
46-
| 1 | PyTorch | Image Classification | Training a model to predict clothing categories in FashionMNIST, including accelerated inference with Torch-TensorRT. | [Link](https://pytorch.org/tutorials/beginner/basics/quickstart_tutorial.html)
47-
| 2 | PyTorch | Housing Regression | Training a model to predict housing prices in the California Housing Dataset, including accelerated inference with Torch-TensorRT. | [Link](https://github.com/christianversloot/machine-learning-articles/blob/main/how-to-create-a-neural-network-for-regression-with-pytorch.md)
48-
| 3 | Tensorflow | Image Classification | Training a model to predict hand-written digits in MNIST. | [Link](https://github.com/tensorflow/docs/blob/master/site/en/tutorials/keras/save_and_load.ipynb)
49-
| 4 | Tensorflow | Keras Preprocessing | Training a model with preprocessing layers to predict likelihood of pet adoption in the PetFinder mini dataset. | [Link](https://github.com/tensorflow/docs/blob/master/site/en/tutorials/structured_data/preprocessing_layers.ipynb)
50-
| 5 | Tensorflow | Keras Resnet50 | Training ResNet-50 to perform flower recognition from flower images. | [Link](https://docs.databricks.com/en/_extras/notebooks/source/deep-learning/keras-metadata.html)
51-
| 6 | Tensorflow | Text Classification | Training a model to perform sentiment analysis on the IMDB dataset. | [Link](https://github.com/tensorflow/docs/blob/master/site/en/tutorials/keras/text_classification.ipynb)
52-
| 7+8 | HuggingFace | Conditional Generation | Sentence translation using the T5 text-to-text transformer for both Torch and Tensorflow. | [Link](https://huggingface.co/docs/transformers/model_doc/t5#t5)
53-
| 9+10 | HuggingFace | Pipelines | Sentiment analysis using Huggingface pipelines for both Torch and Tensorflow. | [Link](https://huggingface.co/docs/transformers/quicktour#pipeline-usage)
54-
| 11 | HuggingFace | Sentence Transformers | Sentence embeddings using SentenceTransformers in Torch. | [Link](https://huggingface.co/sentence-transformers)
46+
| 1 | HuggingFace | DeepSeek-R1 | LLM batch inference using the DeepSeek-R1-Distill-Llama reasoning model. | [Link](https://huggingface.co/deepseek-ai/DeepSeek-R1)
47+
| 2 | HuggingFace | Gemma-7b | LLM batch inference using the lightweight Google Gemma-7b model. | [Link](https://huggingface.co/google/gemma-7b-it)
48+
| 3 | HuggingFace | Sentence Transformers | Sentence embeddings using SentenceTransformers in Torch. | [Link](https://huggingface.co/sentence-transformers)
49+
| 4+5 | HuggingFace | Conditional Generation | Sentence translation using the T5 text-to-text transformer for both Torch and Tensorflow. | [Link](https://huggingface.co/docs/transformers/model_doc/t5#t5)
50+
| 6+7 | HuggingFace | Pipelines | Sentiment analysis using Huggingface pipelines for both Torch and Tensorflow. | [Link](https://huggingface.co/docs/transformers/quicktour#pipeline-usage)
51+
| 8 | PyTorch | Image Classification | Training a model to predict clothing categories in FashionMNIST, and deploying with Torch-TensorRT accelerated inference. | [Link](https://pytorch.org/tutorials/beginner/basics/quickstart_tutorial.html)
52+
| 9 | PyTorch | Housing Regression | Training and deploying a model to predict housing prices in the California Housing Dataset, and deploying with Torch-TensorRT accelerated inference. | [Link](https://github.com/christianversloot/machine-learning-articles/blob/main/how-to-create-a-neural-network-for-regression-with-pytorch.md)
53+
| 10 | Tensorflow | Image Classification | Training and deploying a model to predict hand-written digits in MNIST. | [Link](https://github.com/tensorflow/docs/blob/master/site/en/tutorials/keras/save_and_load.ipynb)
54+
| 11 | Tensorflow | Keras Preprocessing | Training and deploying a model with preprocessing layers to predict likelihood of pet adoption in the PetFinder mini dataset. | [Link](https://github.com/tensorflow/docs/blob/master/site/en/tutorials/structured_data/preprocessing_layers.ipynb)
55+
| 12 | Tensorflow | Keras Resnet50 | Deploying ResNet-50 to perform flower recognition from flower images. | [Link](https://docs.databricks.com/en/_extras/notebooks/source/deep-learning/keras-metadata.html)
56+
| 13 | Tensorflow | Text Classification | Training and deploying a model to perform sentiment analysis on the IMDB dataset. | [Link](https://github.com/tensorflow/docs/blob/master/site/en/tutorials/keras/text_classification.ipynb)
57+
5558

5659
## Running Locally
5760

@@ -130,9 +133,8 @@ The notebooks use [PyTriton](https://github.com/triton-inference-server/pytriton
130133
The diagram above shows how Spark distributes inference tasks to run on the Triton Inference Server, with PyTriton handling request/response communication with the server.
131134

132135
The process looks like this:
133-
- Distribute a PyTriton task across the Spark cluster, instructing each worker to launch a Triton server process.
134-
- Use stage-level scheduling to ensure there is a 1:1 mapping between worker nodes and servers.
135-
- Define a Triton inference function, which contains a client that binds to the local server on a given worker and sends inference requests.
136+
- Prior to inference, launch a Triton server process on each node.
137+
- Define a Triton predict function, which creates a client that binds to the local server and sends/receives inference requests.
136138
- Wrap the Triton inference function in a predict_batch_udf to launch parallel inference requests using Spark.
137139
- Finally, distribute a shutdown signal to terminate the Triton server processes on each worker.
138140

examples/ML+DL-Examples/Spark-DL/dl_inference/databricks/README.md

+8-5
Original file line numberDiff line numberDiff line change
@@ -34,22 +34,25 @@
3434
databricks workspace import $INIT_DEST --format AUTO --file $INIT_SRC
3535
```
3636

37-
6. Launch the cluster with the provided script (note that the script specifies **Azure instances** by default; change as needed):
37+
6. Launch the cluster with the provided script. By default the script will create a cluster with 4 A10 worker nodes and 1 A10 driver node. (Note that the script uses **Azure instances** by default; change as needed).
3838
```shell
3939
cd setup
4040
chmod +x start_cluster.sh
4141
./start_cluster.sh
4242
```
43-
4443
OR, start the cluster from the Databricks UI:
4544

4645
- Go to `Compute > Create compute` and set the desired cluster settings.
4746
- Integration with Triton inference server uses stage-level scheduling (Spark>=3.4.0). Make sure to:
48-
- use a cluster with GPU resources
47+
- use a cluster with GPU resources (for LLM examples, make sure the selected GPUs have sufficient RAM)
4948
- set a value for `spark.executor.cores`
5049
- ensure that `spark.executor.resource.gpu.amount` = 1
5150
- Under `Advanced Options > Init Scripts`, upload the init script from your workspace.
52-
- Under environment variables, set `FRAMEWORK=torch` or `FRAMEWORK=tf` based on the notebook used.
53-
- For Tensorflow notebooks, we recommend setting the environment variable `TF_GPU_ALLOCATOR=cuda_malloc_async` (especially for Huggingface LLM models), which enables the CUDA driver to implicity release unused memory from the pool.
51+
- Under environment variables, set:
52+
- `FRAMEWORK=torch` or `FRAMEWORK=tf` based on the notebook used.
53+
- `HF_HOME=/dbfs/FileStore/hf_home` to cache Huggingface models in DBFS.
54+
- `TF_GPU_ALLOCATOR=cuda_malloc_async` to implicity release unused GPU memory in Tensorflow notebooks.
55+
56+
5457

5558
7. Navigate to the notebook in your workspace and attach it to the cluster. The default cluster name is `spark-dl-inference-$FRAMEWORK`.

examples/ML+DL-Examples/Spark-DL/dl_inference/databricks/setup/start_cluster.sh

+7-4
Original file line numberDiff line numberDiff line change
@@ -14,19 +14,22 @@ if [[ -z ${FRAMEWORK} ]]; then
1414
exit 1
1515
fi
1616

17+
# Modify the node_type_id and driver_node_type_id below if you don't have this specific instance type.
18+
# Modify executor.cores=(cores per node) and task.resource.gpu.amount=(1/executor cores) accordingly.
19+
# We recommend selecting A10/L4+ instances for these examples.
1720
json_config=$(cat <<EOF
1821
{
1922
"cluster_name": "spark-dl-inference-${FRAMEWORK}",
2023
"spark_version": "15.4.x-gpu-ml-scala2.12",
2124
"spark_conf": {
2225
"spark.executor.resource.gpu.amount": "1",
2326
"spark.python.worker.reuse": "true",
24-
"spark.task.resource.gpu.amount": "0.125",
2527
"spark.sql.execution.arrow.pyspark.enabled": "true",
26-
"spark.executor.cores": "8"
28+
"spark.task.resource.gpu.amount": "0.16667",
29+
"spark.executor.cores": "6"
2730
},
28-
"node_type_id": "Standard_NC8as_T4_v3",
29-
"driver_node_type_id": "Standard_NC8as_T4_v3",
31+
"node_type_id": "Standard_NV12ads_A10_v5",
32+
"driver_node_type_id": "Standard_NV12ads_A10_v5",
3033
"spark_env_vars": {
3134
"TF_GPU_ALLOCATOR": "cuda_malloc_async",
3235
"FRAMEWORK": "${FRAMEWORK}"

examples/ML+DL-Examples/Spark-DL/dl_inference/dataproc/README.md

+1-2
Original file line numberDiff line numberDiff line change
@@ -50,13 +50,12 @@
5050
```shell
5151
export FRAMEWORK=torch
5252
```
53-
Run the cluster startup script. The script will also retrieve and use the [spark-rapids initialization script](https://github.com/GoogleCloudDataproc/initialization-actions/blob/master/spark-rapids/spark-rapids.sh) to setup GPU resources.
53+
Run the cluster startup script. The script will also retrieve and use the [spark-rapids initialization script](https://github.com/GoogleCloudDataproc/initialization-actions/blob/master/spark-rapids/spark-rapids.sh) to setup GPU resources. The script will create 4 L4 worker nodes and 1 L4 driver node by default, named `${USER}-spark-dl-inference-${FRAMEWORK}`.
5454
```shell
5555
cd setup
5656
chmod +x start_cluster.sh
5757
./start_cluster.sh
5858
```
59-
By default, the script creates a 4 node GPU cluster named `${USER}-spark-dl-inference-${FRAMEWORK}`.
6059
6160
7. Browse to the Jupyter web UI:
6261
- Go to `Dataproc` > `Clusters` > `(Cluster Name)` > `Web Interfaces` > `Jupyter/Lab`

examples/ML+DL-Examples/Spark-DL/dl_inference/dataproc/setup/start_cluster.sh

+23-23
Original file line numberDiff line numberDiff line change
@@ -77,29 +77,29 @@ else
7777
exit 1
7878
fi
7979

80-
# start cluster if not already running
8180
if gcloud dataproc clusters list | grep -q "${cluster_name}"; then
8281
echo "Cluster ${cluster_name} already exists."
83-
else
84-
gcloud dataproc clusters create ${cluster_name} \
85-
--image-version=2.2-ubuntu \
86-
--region ${COMPUTE_REGION} \
87-
--master-machine-type n1-standard-16 \
88-
--num-workers 4 \
89-
--worker-min-cpu-platform="Intel Skylake" \
90-
--worker-machine-type n1-standard-16 \
91-
--master-accelerator type=nvidia-tesla-t4,count=1 \
92-
--worker-accelerator type=nvidia-tesla-t4,count=1 \
93-
--initialization-actions gs://${SPARK_DL_HOME}/init/spark-rapids.sh,${INIT_PATH} \
94-
--metadata gpu-driver-provider="NVIDIA" \
95-
--metadata gcs-bucket=${GCS_BUCKET} \
96-
--metadata spark-dl-home=${SPARK_DL_HOME} \
97-
--metadata requirements="${requirements}" \
98-
--worker-local-ssd-interface=NVME \
99-
--optional-components=JUPYTER \
100-
--bucket ${GCS_BUCKET} \
101-
--enable-component-gateway \
102-
--max-idle "60m" \
103-
--subnet=default \
104-
--no-shielded-secure-boot
82+
exit 0
10583
fi
84+
85+
CLUSTER_PARAMS=(
86+
--image-version=2.2-ubuntu
87+
--region ${COMPUTE_REGION}
88+
--num-workers 4
89+
--master-machine-type g2-standard-8
90+
--worker-machine-type g2-standard-8
91+
--initialization-actions gs://${SPARK_DL_HOME}/init/spark-rapids.sh,${INIT_PATH}
92+
--metadata gpu-driver-provider="NVIDIA"
93+
--metadata gcs-bucket=${GCS_BUCKET}
94+
--metadata spark-dl-home=${SPARK_DL_HOME}
95+
--metadata requirements="${requirements}"
96+
--worker-local-ssd-interface=NVME
97+
--optional-components=JUPYTER
98+
--bucket ${GCS_BUCKET}
99+
--enable-component-gateway
100+
--max-idle "60m"
101+
--subnet=default
102+
--no-shielded-secure-boot
103+
)
104+
105+
gcloud dataproc clusters create ${cluster_name} "${CLUSTER_PARAMS[@]}"

0 commit comments

Comments
 (0)