NVIDIA
diff --git a/‎examples/ML+DL-Examples/Spark-DL/dl_inference/README.md
+14-12 b/‎examples/ML+DL-Examples/Spark-DL/dl_inference/README.md
+14-12
diff --git a/‎examples/ML+DL-Examples/Spark-DL/dl_inference/databricks/README.md
+8-5 b/‎examples/ML+DL-Examples/Spark-DL/dl_inference/databricks/README.md
+8-5
diff --git a/‎examples/ML+DL-Examples/Spark-DL/dl_inference/databricks/setup/start_cluster.sh
+7-4 b/‎examples/ML+DL-Examples/Spark-DL/dl_inference/databricks/setup/start_cluster.sh
+7-4
diff --git a/‎examples/ML+DL-Examples/Spark-DL/dl_inference/dataproc/README.md
+1-2 b/‎examples/ML+DL-Examples/Spark-DL/dl_inference/dataproc/README.md
+1-2
diff --git a/‎examples/ML+DL-Examples/Spark-DL/dl_inference/dataproc/setup/start_cluster.sh
+23-23 b/‎examples/ML+DL-Examples/Spark-DL/dl_inference/dataproc/setup/start_cluster.sh
+23-23
@@ -43,15 +43,18 @@ Below is a full list of the notebooks with links to the examples they are based
 
 |   | Framework  | Notebook Name | Description | Link
 | ------------- | ------------- | ------------- | ------------- | ------------- 
-| 1 | PyTorch | Image Classification | Training a model to predict clothing categories in FashionMNIST, including accelerated inference with Torch-TensorRT. | [Link](https://pytorch.org/tutorials/beginner/basics/quickstart_tutorial.html)
-| 2 | PyTorch | Housing Regression | Training a model to predict housing prices in the California Housing Dataset, including accelerated inference with Torch-TensorRT. | [Link](https://github.com/christianversloot/machine-learning-articles/blob/main/how-to-create-a-neural-network-for-regression-with-pytorch.md)
-| 3 | Tensorflow | Image Classification | Training a model to predict hand-written digits in MNIST. | [Link](https://github.com/tensorflow/docs/blob/master/site/en/tutorials/keras/save_and_load.ipynb)
-| 4 | Tensorflow | Keras Preprocessing | Training a model with preprocessing layers to predict likelihood of pet adoption in the PetFinder mini dataset. | [Link](https://github.com/tensorflow/docs/blob/master/site/en/tutorials/structured_data/preprocessing_layers.ipynb)
-| 5 | Tensorflow | Keras Resnet50 | Training ResNet-50 to perform flower recognition from flower images. | [Link](https://docs.databricks.com/en/_extras/notebooks/source/deep-learning/keras-metadata.html)
-| 6 | Tensorflow | Text Classification | Training a model to perform sentiment analysis on the IMDB dataset. | [Link](https://github.com/tensorflow/docs/blob/master/site/en/tutorials/keras/text_classification.ipynb)
-| 7+8 | HuggingFace | Conditional Generation | Sentence translation using the T5 text-to-text transformer for both Torch and Tensorflow. | [Link](https://huggingface.co/docs/transformers/model_doc/t5#t5) 
-| 9+10 | HuggingFace | Pipelines | Sentiment analysis using Huggingface pipelines for both Torch and Tensorflow. | [Link](https://huggingface.co/docs/transformers/quicktour#pipeline-usage)
-| 11 | HuggingFace | Sentence Transformers | Sentence embeddings using SentenceTransformers in Torch. | [Link](https://huggingface.co/sentence-transformers)
+| 1 | HuggingFace | DeepSeek-R1 | LLM batch inference using the DeepSeek-R1-Distill-Llama reasoning model. | [Link](https://huggingface.co/deepseek-ai/DeepSeek-R1)
+| 2 | HuggingFace | Gemma-7b | LLM batch inference using the lightweight Google Gemma-7b model. | [Link](https://huggingface.co/google/gemma-7b-it)
+| 3 | HuggingFace | Sentence Transformers | Sentence embeddings using SentenceTransformers in Torch. | [Link](https://huggingface.co/sentence-transformers)
+| 4+5 | HuggingFace | Conditional Generation | Sentence translation using the T5 text-to-text transformer for both Torch and Tensorflow. | [Link](https://huggingface.co/docs/transformers/model_doc/t5#t5) 
+| 6+7 | HuggingFace | Pipelines | Sentiment analysis using Huggingface pipelines for both Torch and Tensorflow. | [Link](https://huggingface.co/docs/transformers/quicktour#pipeline-usage)
+| 8 | PyTorch | Image Classification | Training a model to predict clothing categories in FashionMNIST, and deploying with Torch-TensorRT accelerated inference. | [Link](https://pytorch.org/tutorials/beginner/basics/quickstart_tutorial.html)
+| 9 | PyTorch | Housing Regression | Training and deploying a model to predict housing prices in the California Housing Dataset, and deploying with Torch-TensorRT accelerated inference. | [Link](https://github.com/christianversloot/machine-learning-articles/blob/main/how-to-create-a-neural-network-for-regression-with-pytorch.md)
+| 10 | Tensorflow | Image Classification | Training and deploying a model to predict hand-written digits in MNIST. | [Link](https://github.com/tensorflow/docs/blob/master/site/en/tutorials/keras/save_and_load.ipynb)
+| 11 | Tensorflow | Keras Preprocessing | Training and deploying a model with preprocessing layers to predict likelihood of pet adoption in the PetFinder mini dataset. | [Link](https://github.com/tensorflow/docs/blob/master/site/en/tutorials/structured_data/preprocessing_layers.ipynb)
+| 12 | Tensorflow | Keras Resnet50 | Deploying ResNet-50 to perform flower recognition from flower images. | [Link](https://docs.databricks.com/en/_extras/notebooks/source/deep-learning/keras-metadata.html)
+| 13 | Tensorflow | Text Classification | Training and deploying a model to perform sentiment analysis on the IMDB dataset. | [Link](https://github.com/tensorflow/docs/blob/master/site/en/tutorials/keras/text_classification.ipynb)
+
 
 ## Running Locally
 
@@ -130,9 +133,8 @@ The notebooks use [PyTriton](https://github.com/triton-inference-server/pytriton
 The diagram above shows how Spark distributes inference tasks to run on the Triton Inference Server, with PyTriton handling request/response communication with the server. 
 
 The process looks like this:
-- Distribute a PyTriton task across the Spark cluster, instructing each worker to launch a Triton server process.
-    - Use stage-level scheduling to ensure there is a 1:1 mapping between worker nodes and servers.
-- Define a Triton inference function, which contains a client that binds to the local server on a given worker and sends inference requests.
+- Prior to inference, launch a Triton server process on each node.
+- Define a Triton predict function, which creates a client that binds to the local server and sends/receives inference requests.
 - Wrap the Triton inference function in a predict_batch_udf to launch parallel inference requests using Spark.
 - Finally, distribute a shutdown signal to terminate the Triton server processes on each worker.
 
 
@@ -34,22 +34,25 @@
     databricks workspace import $INIT_DEST --format AUTO --file $INIT_SRC
     ```
 
-6. Launch the cluster with the provided script (note that the script specifies **Azure instances** by default; change as needed):
+6. Launch the cluster with the provided script. By default the script will create a cluster with 4 A10 worker nodes and 1 A10 driver node. (Note that the script uses **Azure instances** by default; change as needed).
     ```shell
     cd setup
     chmod +x start_cluster.sh
     ./start_cluster.sh
     ```
-
     OR, start the cluster from the Databricks UI:  
 
     - Go to `Compute > Create compute` and set the desired cluster settings.
         - Integration with Triton inference server uses stage-level scheduling (Spark>=3.4.0). Make sure to:
-            - use a cluster with GPU resources
+            - use a cluster with GPU resources (for LLM examples, make sure the selected GPUs have sufficient RAM)
             - set a value for `spark.executor.cores`
             - ensure that `spark.executor.resource.gpu.amount` = 1
     - Under `Advanced Options > Init Scripts`, upload the init script from your workspace.
-    - Under environment variables, set `FRAMEWORK=torch` or `FRAMEWORK=tf` based on the notebook used.
-    - For Tensorflow notebooks, we recommend setting the environment variable `TF_GPU_ALLOCATOR=cuda_malloc_async` (especially for Huggingface LLM models), which enables the CUDA driver to implicity release unused memory from the pool. 
+    - Under environment variables, set:
+        - `FRAMEWORK=torch` or `FRAMEWORK=tf` based on the notebook used.
+        - `HF_HOME=/dbfs/FileStore/hf_home` to cache Huggingface models in DBFS.
+        - `TF_GPU_ALLOCATOR=cuda_malloc_async` to implicity release unused GPU memory in Tensorflow notebooks.
+
+    
 
 7. Navigate to the notebook in your workspace and attach it to the cluster. The default cluster name is `spark-dl-inference-$FRAMEWORK`.  
@@ -14,19 +14,22 @@ if [[ -z ${FRAMEWORK} ]]; then
     exit 1
 fi
 
+# Modify the node_type_id and driver_node_type_id below if you don't have this specific instance type. 
+# Modify executor.cores=(cores per node) and task.resource.gpu.amount=(1/executor cores) accordingly.
+# We recommend selecting A10/L4+ instances for these examples.
 json_config=$(cat <<EOF
 {
     "cluster_name": "spark-dl-inference-${FRAMEWORK}",
     "spark_version": "15.4.x-gpu-ml-scala2.12",
     "spark_conf": {
         "spark.executor.resource.gpu.amount": "1",
         "spark.python.worker.reuse": "true",
-        "spark.task.resource.gpu.amount": "0.125",
         "spark.sql.execution.arrow.pyspark.enabled": "true",
-        "spark.executor.cores": "8"
+        "spark.task.resource.gpu.amount": "0.16667",
+        "spark.executor.cores": "6"
     },
-    "node_type_id": "Standard_NC8as_T4_v3",
-    "driver_node_type_id": "Standard_NC8as_T4_v3",
+    "node_type_id": "Standard_NV12ads_A10_v5",
+    "driver_node_type_id": "Standard_NV12ads_A10_v5",
     "spark_env_vars": {
         "TF_GPU_ALLOCATOR": "cuda_malloc_async",
         "FRAMEWORK": "${FRAMEWORK}"
 
@@ -50,13 +50,12 @@
     ```shell
     export FRAMEWORK=torch
     ```
-    Run the cluster startup script. The script will also retrieve and use the [spark-rapids initialization script](https://github.com/GoogleCloudDataproc/initialization-actions/blob/master/spark-rapids/spark-rapids.sh) to setup GPU resources.
+    Run the cluster startup script. The script will also retrieve and use the [spark-rapids initialization script](https://github.com/GoogleCloudDataproc/initialization-actions/blob/master/spark-rapids/spark-rapids.sh) to setup GPU resources. The script will create 4 L4 worker nodes and 1 L4 driver node by default, named `${USER}-spark-dl-inference-${FRAMEWORK}`.
     ```shell
     cd setup
     chmod +x start_cluster.sh
     ./start_cluster.sh
     ```
-    By default, the script creates a 4 node GPU cluster named `${USER}-spark-dl-inference-${FRAMEWORK}`.
 
 7. Browse to the Jupyter web UI:
     - Go to `Dataproc` > `Clusters` > `(Cluster Name)` > `Web Interfaces` > `Jupyter/Lab`
 
@@ -77,29 +77,29 @@ else
     exit 1
 fi
 
-# start cluster if not already running
 if gcloud dataproc clusters list | grep -q "${cluster_name}"; then
     echo "Cluster ${cluster_name} already exists."
-else
-    gcloud dataproc clusters create ${cluster_name} \
-    --image-version=2.2-ubuntu \
-    --region ${COMPUTE_REGION} \
-    --master-machine-type n1-standard-16 \
-    --num-workers 4 \
-    --worker-min-cpu-platform="Intel Skylake" \
-    --worker-machine-type n1-standard-16 \
-    --master-accelerator type=nvidia-tesla-t4,count=1 \
-    --worker-accelerator type=nvidia-tesla-t4,count=1 \
-    --initialization-actions gs://${SPARK_DL_HOME}/init/spark-rapids.sh,${INIT_PATH} \
-    --metadata gpu-driver-provider="NVIDIA" \
-    --metadata gcs-bucket=${GCS_BUCKET} \
-    --metadata spark-dl-home=${SPARK_DL_HOME} \
-    --metadata requirements="${requirements}" \
-    --worker-local-ssd-interface=NVME \
-    --optional-components=JUPYTER \
-    --bucket ${GCS_BUCKET} \
-    --enable-component-gateway \
-    --max-idle "60m" \
-    --subnet=default \
-    --no-shielded-secure-boot
+    exit 0
 fi
+
+CLUSTER_PARAMS=(
+    --image-version=2.2-ubuntu
+    --region ${COMPUTE_REGION}
+    --num-workers 4
+    --master-machine-type g2-standard-8
+    --worker-machine-type g2-standard-8
+    --initialization-actions gs://${SPARK_DL_HOME}/init/spark-rapids.sh,${INIT_PATH}
+    --metadata gpu-driver-provider="NVIDIA"
+    --metadata gcs-bucket=${GCS_BUCKET}
+    --metadata spark-dl-home=${SPARK_DL_HOME}
+    --metadata requirements="${requirements}"
+    --worker-local-ssd-interface=NVME
+    --optional-components=JUPYTER
+    --bucket ${GCS_BUCKET}
+    --enable-component-gateway
+    --max-idle "60m"
+    --subnet=default
+    --no-shielded-secure-boot
+)
+
+gcloud dataproc clusters create ${cluster_name} "${CLUSTER_PARAMS[@]}"