Add LLM batch inference examples #493

rishic3 · 2025-02-10T23:23:05Z

Add deepseek-r1 and gemma-7b LLM batch inference notebooks.
Updated CSP instructions since these notebooks require >20GB GPU RAM (A10/L4).

Signed-off-by: Rishi Chandra <[email protected]>

rishic3 · 2025-02-11T02:02:10Z

@leewyang If you get the chance, welcoming suggestions on these initial examples + future extensions 🙂

eordentlich

Overall looks great. A few comments, questions.

eordentlich · 2025-02-13T08:30:42Z

examples/ML+DL-Examples/Spark-DL/dl_inference/databricks/README.md

@@ -34,22 +34,26 @@
    databricks workspace import $INIT_DEST --format AUTO --file $INIT_SRC
    ```

-6. Launch the cluster with the provided script (note that the script specifies **Azure instances** by default; change as needed):
+6. Launch the cluster with the provided script.
+**Note:** The LLM examples (e.g. deepseek-r1, gemma-7b) require greater GPU RAM (>18GB). For these notebooks, we recommend modifying the startup script node types to use A10 GPU instances. Note that the script specifies **Azure instances** by default; change as needed. 


We might want to just recommend a10 across the board.

eordentlich · 2025-02-13T08:31:42Z

examples/ML+DL-Examples/Spark-DL/dl_inference/dataproc/README.md

@@ -50,7 +50,12 @@
    ```shell
    export FRAMEWORK=torch
    ```
-    Run the cluster startup script. The script will also retrieve and use the [spark-rapids initialization script](https://github.com/GoogleCloudDataproc/initialization-actions/blob/master/spark-rapids/spark-rapids.sh) to setup GPU resources.
+    Run the cluster startup script. The script will also retrieve and use the [spark-rapids initialization script](https://github.com/GoogleCloudDataproc/initialization-actions/blob/master/spark-rapids/spark-rapids.sh) to setup GPU resources. 
+    **Note:** The LLM examples (e.g. deepseek-r1, gemma-7b) require greater GPU RAM (>18GB). For these notebooks, setting the following environment variable will tell the startup script to use L4 GPUs instead of the default T4 GPUs.


L4's across the board.

eordentlich · 2025-02-13T17:56:02Z

examples/ML+DL-Examples/Spark-DL/dl_inference/huggingface/gemma-7b_torch.ipynb

+    "- Wrap the Triton inference function in a predict_batch_udf to launch parallel inference requests using Spark.\n",
+    "- Finally, distribute a shutdown signal to terminate the Triton server processes on each node.\n",
+    "\n",
+    "<img src=\"../images/spark-pytriton.png\" alt=\"drawing\" width=\"700\"/>"


As I look at this figure more closely, it is a little confusing with only 2 executors running the start triton and then 4 executors running the prediction tasks. Should executors be constant with possibly multiple inference tasks running in parallel within an executor? Also worker <-> node? And referencing tasks on the driver is a little confusing too.

eordentlich · 2025-02-13T17:57:34Z

examples/ML+DL-Examples/Spark-DL/dl_inference/huggingface/gemma-7b_torch.ipynb

+    "    print(f\"Connecting to Triton model {model_name} at {url}.\")\n",
+    "\n",
+    "    def infer_batch(inputs):\n",
+    "        with ModelClient(url, model_name, inference_timeout_s=500) as client:\n",


This has appeared throughout these PRs, but does this mean a new connection is created to triton for each batch of data? I guess overhead isn't that much.

Yep, there is a small overhead - the client will send a request to the server for model configuration (shapes, types, etc.) with every new connection. Essentially just a local ping for a pbtxt, no compute, so I doubt it's significant.

That said if we create the client outside the predict function, predict_batch_udf can cache the client on the executor side. We would just need some way to gracefully close it on shutdown.

rishic3 · 2025-02-14T00:00:59Z

examples/ML+DL-Examples/Spark-DL/dl_inference/databricks/setup/start_cluster.sh

    },
-    "node_type_id": "Standard_NC8as_T4_v3",
-    "driver_node_type_id": "Standard_NC8as_T4_v3",
+    "node_type_id": "Standard_NV12ads_A10_v5",


I'm not sure if this is a "popular" node to have on Azure databricks. Would be good to know what's most commonly used A10 instance and we could have that as the default.

eordentlich

👍

rishic3 added 15 commits February 9, 2025 20:52

Deepseek example

49bab86

Gemma notebook

8556aa1

Fix batch size

0ee79b4

Code comprehension dataset

3d0bff6

Use tokenizer batch decoding

114187b

return_full_text=False

b67baa1

Write full dataset

b87e72c

Update CSP instructions for LLM examples

cedd781

Add note on LLM examples

a046a06

Test LLM notebooks on CSPs

7d6b068

signoff

c3343f8

Signed-off-by: Rishi Chandra <[email protected]>

Update notebook list

f09d2bc

Fix descriptions

3a503cd

Add comments

c51b5e0

Cleanup

28d0e66

rishic3 marked this pull request as ready for review February 11, 2025 01:43

rishic3 requested a review from eordentlich February 11, 2025 01:44

eordentlich reviewed Feb 13, 2025

View reviewed changes

rishic3 added 2 commits February 13, 2025 14:57

Use A10/L4 by default

4a66b4e

Update diagram

68a3a52

rishic3 commented Feb 14, 2025

View reviewed changes

eordentlich approved these changes Feb 14, 2025

View reviewed changes

rishic3 merged commit 1bc43fb into NVIDIA:branch-25.02 Feb 14, 2025
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add LLM batch inference examples #493

Add LLM batch inference examples #493

rishic3 commented Feb 10, 2025 •

edited

Loading

rishic3 commented Feb 11, 2025

eordentlich left a comment

eordentlich Feb 13, 2025

rishic3 Feb 13, 2025

eordentlich Feb 13, 2025

rishic3 Feb 13, 2025

eordentlich Feb 13, 2025

rishic3 Feb 13, 2025

eordentlich Feb 13, 2025

rishic3 Feb 13, 2025

rishic3 Feb 14, 2025 •

edited

Loading

eordentlich left a comment

Add LLM batch inference examples #493

Add LLM batch inference examples #493

Conversation

rishic3 commented Feb 10, 2025 • edited Loading

rishic3 commented Feb 11, 2025

eordentlich left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rishic3 Feb 14, 2025 • edited Loading

Choose a reason for hiding this comment

eordentlich left a comment

Choose a reason for hiding this comment

rishic3 commented Feb 10, 2025 •

edited

Loading

rishic3 Feb 14, 2025 •

edited

Loading