Support for Spark DL notebooks with PyTriton on Databricks/Dataproc #483

rishic3 · 2025-01-16T23:58:40Z

Support for running DL Inference notebooks on CSP environments.

Refactored Triton sections to use PyTriton, a Python API for the Triton inference server which avoids Docker. Once this PR is merged, Triton sections no longer need to be skipped in the CI pipeline @YanxuanLiu .
Updated notebooks with instructions to run on Databricks/Dataproc
Updated Torch notebooks with best practices for ahead-of-time TensorRT compilation.
Cleaned up README, removing instructions to start Jupyter with PySpark (we need a cell to attach to standalone for CI/CD anyway, so hoping to reduce confusion for user).

Notebook outputs are saved from running locally, but all notebooks were tested on Databricks/Dataproc.

Signed-off-by: Rishi Chandra <[email protected]>

eordentlich

Looks good overall. A few comments.

In a future optimization we can look at something like https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_http_cudashm_client.py or for regular shm to reduce data copy (if I'm interpreting these correctly).

eordentlich · 2025-01-24T20:58:13Z

examples/ML+DL-Examples/Spark-DL/dl_inference/databricks/setup/init_spark_dl_tf.sh

+sudo /databricks/python3/bin/pip3 install --upgrade --force-reinstall -r temp_requirements.txt
+rm temp_requirements.txt
+
+set +x


Add a carriage return at the end of last line in all files this symbol appears.

Deleted, also merged the tf/torch scripts into one for convenience.

eordentlich · 2025-01-24T22:06:22Z

examples/ML+DL-Examples/Spark-DL/dl_inference/huggingface/conditional_generation_tf.ipynb

-    "df = spark.read.parquet(\"imdb_test\").limit(100).cache()"
+    "def _use_stage_level_scheduling(spark, rdd):\n",
+    "\n",
+    "    if spark.version < \"3.4.0\":\n",


This check probably not needed since predict_batch_udf is also not in spark < 3.4

eordentlich · 2025-01-24T23:05:44Z

examples/ML+DL-Examples/Spark-DL/dl_inference/huggingface/conditional_generation_tf.ipynb

+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df = spark.read.parquet(data_path).limit(256).repartition(8)"


Is limit and repartition needed? And is this the right order? And why these numbers? A comment might be in order. Propagate any changes to other notebooks.

This was intended to test the minimal scenario of 1 batch per task—especially with tensorflow, too high of a number can be really slow (>1 min). (In previous versions we were limiting to 100 rows: https://github.com/NVIDIA/spark-rapids-examples/blob/branch-23.06/examples/ML%2BDL-Examples/Spark-DL/dl_inference/huggingface/conditional_generation.ipynb?short_path=d3949f8#L1208)

eordentlich · 2025-01-24T23:12:40Z

examples/ML+DL-Examples/Spark-DL/dl_inference/huggingface/conditional_generation_tf.ipynb

   ]
  },
  {
   "cell_type": "code",
-   "execution_count": 56,
+   "execution_count": null,


fyi, spark.stop() below might be bad for databricks. It puts the cluster in a bad state. (at least in older versions like 13.3 from what I've seen).

Yup, issue persists on latest runtime - addressed

eordentlich · 2025-01-25T01:37:09Z

examples/ML+DL-Examples/Spark-DL/dl_inference/huggingface/conditional_generation_tf.ipynb

-    "def stop_triton(it):\n",
-    "    import docker\n",
-    "    import time\n",
+    "def stop_triton(pids):\n",


Can this along with all the other triton related code that is common across the notebooks be moved to a single python file triton_utils.py that gets shipped via pyfiles with each Spark job and then imported in the notebooks? Would avoid a lot of repetition.

rishic3 · 2025-01-27T17:20:07Z

Looks good overall. A few comments.

In a future optimization we can look at something like https://github.com/triton-inference-server/client/blob/main/src/python/examples/simple_http_cudashm_client.py or for regular shm to reduce data copy (if I'm interpreting these correctly).

Good idea, will definitely follow-up with this improvement. Note per pytriton team—with shm, there still will be an additional inter-process data copy (until Triton 3 release):
shm -> python backend -> (copy input) -> pytriton server -> (copy output) -> python backend -> shm
but per their benchmarks this is a few ms of latency (for ~4MB inputs — with larger inputs it might be more significant but still likely within the range of noise).

eordentlich

A few more comments.

eordentlich · 2025-01-28T00:03:57Z

examples/ML+DL-Examples/Spark-DL/dl_inference/databricks/setup/init_spark_dl.sh

+fi
+
+sudo /databricks/python3/bin/pip3 install --upgrade --force-reinstall -r temp_requirements.txt
+rm temp_requirements.txt


Looks like carriage returns still needed at end of last lines in some files.

eordentlich · 2025-01-31T23:46:30Z

examples/ML+DL-Examples/Spark-DL/dl_inference/huggingface/conditional_generation_tf.ipynb

-    "    return [True]\n",
-    "\n",
-    "nodeRDD.barrier().mapPartitions(stop_triton).collect()"
+    "shutdownRDD = sc.parallelize(list(range(num_nodes)), num_nodes)\n",


Same as above, benefit of leaving shutdownRDD ... out of stop_triton utility fn?

Will address in separate PR.

eordentlich · 2025-01-31T23:48:45Z

examples/ML+DL-Examples/Spark-DL/dl_inference/huggingface/conditional_generation_tf.ipynb

-    "%%time\n",
-    "preds = df1.withColumn(\"preds\", generate(\"input\"))\n",
-    "results = preds.collect()"
+    "pids = nodeRDD.barrier().mapPartitions(lambda _: start_triton(triton_server_fn=triton_server,\n",


Any benefit to not having nodeRdd.barrier... as part of start_triton utility?

Good point. I've implemented this along with organizing the utils into a "ServerManager" class (since we are passing around a bunch of redundant parameters), but I think it warrants a separate PR. Will follow-up with it.

eordentlich · 2025-01-31T23:52:45Z

examples/ML+DL-Examples/Spark-DL/dl_inference/huggingface/conditional_generation_tf.ipynb

-    "rm -rf models\n",
-    "mkdir -p models\n",
-    "cp -r models_config/hf_generation_tf models\n",
+    "sc.addPyFile(\"https://raw.githubusercontent.com/NVIDIA/spark-rapids-examples/branch-25.02/examples/ML%2BDL-Examples/Spark-DL/dl_inference/pytriton_utils.py\")\n",


Probably better to upload this file from checkout repo in setup instructions vs hard coding a version/link here.

Also, does it work to just have the file in the notebooks directory? Only needed on the driver.
Should work on databricks and dataproc but not on EMR (since the latter runs driver in cluster mode for notebooks).

eordentlich

just a few more.

eordentlich · 2025-02-03T01:08:13Z

examples/ML+DL-Examples/Spark-DL/dl_inference/pytriton_utils.py

+            return [True]
+        time.sleep(5)
+
+    return [False]


carriage return

Ah didn't know git shows missing returns in the file diff; will check there next time

eordentlich · 2025-02-03T01:08:32Z

examples/ML+DL-Examples/Spark-DL/dl_inference/requirements.txt

+ipykernel
+urllib3<2
+nvidia-pytriton


eordentlich · 2025-02-03T01:09:53Z

examples/ML+DL-Examples/Spark-DL/dl_inference/tensorflow/image_classification_tf.ipynb

+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "if on_standalone:\n",


to avoid this special case, possible to add symbolic link in each notebook directory to the pytriton_utils.py file? these can be checked into git

Cool didn't know you could do that; done

eordentlich · 2025-02-03T02:27:16Z

examples/ML+DL-Examples/Spark-DL/dl_inference/pytriton_utils.py

+    task_gpus = 1.0
+    treqs = TaskResourceRequests().cpus(task_cores).resource("gpu", task_gpus)
+    rp = ResourceProfileBuilder().require(treqs).build
+    print(f"Reqesting stage-level resources: (cores={task_cores}, gpu={task_gpus})")


Reqesting -> Requesting throughout several places.

eordentlich

👍

rishic3 added 4 commits January 16, 2025 15:52

Add notebooks with runs

1152443

Add image, update readme/requirements

4efa8e4

Add dataproc instructions

5f6268f

Add databricks instructions

7e0c1e3

Signed-off-by: Rishi Chandra <[email protected]>

rishic3 marked this pull request as ready for review January 17, 2025 00:36

eordentlich reviewed Jan 25, 2025

View reviewed changes

rishic3 added 4 commits January 27, 2025 11:57

Combine init scripts for databricks

b076e26

Move common PyTriton funcs to utils

916d05d

Use https path to pyfile

aeb3a36

cleanup

8088c58

eordentlich reviewed Jan 31, 2025

View reviewed changes

rishic3 added 3 commits January 31, 2025 19:38

Carriage returns

b3e09e1

Move utils to local dir

d6cebff

Add local imports

c138dd1

eordentlich reviewed Feb 3, 2025

View reviewed changes

rishic3 added 2 commits February 2, 2025 21:14

Address comments

259c955

merge changes

ce7df83

eordentlich approved these changes Feb 4, 2025

View reviewed changes

rishic3 merged commit ae292de into NVIDIA:branch-25.02 Feb 4, 2025
3 checks passed

rishic3 deleted the dl-pytriton branch February 4, 2025 17:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for Spark DL notebooks with PyTriton on Databricks/Dataproc #483

Support for Spark DL notebooks with PyTriton on Databricks/Dataproc #483

rishic3 commented Jan 16, 2025

eordentlich left a comment

eordentlich Jan 24, 2025

rishic3 Jan 28, 2025

eordentlich Jan 24, 2025

rishic3 Jan 28, 2025

eordentlich Jan 24, 2025

rishic3 Jan 27, 2025 •

edited

Loading

eordentlich Jan 24, 2025

rishic3 Jan 28, 2025

eordentlich Jan 25, 2025

rishic3 Jan 28, 2025

rishic3 commented Jan 27, 2025

eordentlich left a comment

eordentlich Jan 28, 2025

rishic3 Feb 1, 2025

eordentlich Jan 31, 2025

rishic3 Feb 1, 2025

eordentlich Jan 31, 2025

rishic3 Feb 1, 2025 •

edited

Loading

eordentlich Jan 31, 2025

eordentlich Jan 31, 2025

eordentlich left a comment

eordentlich Feb 3, 2025

rishic3 Feb 3, 2025

eordentlich Feb 3, 2025

eordentlich Feb 3, 2025

rishic3 Feb 3, 2025

eordentlich Feb 3, 2025

eordentlich left a comment

Support for Spark DL notebooks with PyTriton on Databricks/Dataproc #483

Support for Spark DL notebooks with PyTriton on Databricks/Dataproc #483

Conversation

rishic3 commented Jan 16, 2025

Support for running DL Inference notebooks on CSP environments.

eordentlich left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rishic3 Jan 27, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rishic3 commented Jan 27, 2025

eordentlich left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rishic3 Feb 1, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eordentlich left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

eordentlich left a comment

Choose a reason for hiding this comment

rishic3 Jan 27, 2025 •

edited

Loading

rishic3 Feb 1, 2025 •

edited

Loading