feat: vLLM DistributedRuntime Monolith and Disagg Workers Example #113

ptarasiewiczNV · 2025-02-05T20:21:57Z

What does the PR do?

Add example of using rust runtime to serve vllm in a monolith and disaggregated settings.

working monolith example
working disagg example
prepare README (referencing hello world readme from docs: readme + examples #116 )

Checklist

Commit Type:

Check the conventional commit type
box here and add the label to the github PR.

Related PRs:

Where should the reviewer start?

Test plan:

CI Pipeline ID:

Caveats:

Background

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

closes GitHub issue: #xxx

docker-compose.yml

runtime/rust/python-wheel/examples/README.md

runtime/rust/python-wheel/examples/vllm/README.md

ptarasiewiczNV · 2025-02-05T22:32:45Z

@nnshah1 readme and docker compose are just temp, will rebase after #116

examples/python_rs/llm/vllm/README.md

nnshah1 · 2025-02-07T18:41:14Z

examples/python_rs/llm/vllm/README.md

+
+**Terminal 1 - Server:**
+```bash
+python3 -m monolith.worker \


is this in the container or outside the container - or not related to container?

I assumed no container, meaning just running in users environment as we now have wheels.

nnshah1 · 2025-02-07T18:42:09Z

examples/python_rs/llm/vllm/README.md

+
+**Terminal 2 - Client:**
+```bash
+python3 -m common.client \


same here - is this from the same container / separate?

nnshah1 · 2025-02-07T18:43:40Z

examples/python_rs/llm/vllm/common/client.py

+from .protocol import Request
+
+
+@triton_worker()


to do - does seem this would be better as @triton_distributed_component

nnshah1

LGTM - added a few questions / comments for future

rmccorm4 · 2025-02-07T19:12:23Z

examples/python_rs/llm/vllm/README.md

+
+This example demonstrates how to use Triton Distributed to serve large language models with the vLLM engine, enabling efficient model serving with both monolithic and disaggregated deployment options.
+
+## Prerequisites


@ptarasiewiczNV Is there an assumed container as the base environment for these steps too? Such as the:

./container/build.sh --framework vllm ./container/run.sh --framework vllm -it

flow beforehand?

All the rust/runtime examples that don't use vllm or GPUs work fine on my host, so I tried that here too since the README doesn't mention any containers:

cd triton_distributed/ git checkout ptarasiewicz/vllm-example-rust-runtime cd runtime/rust/python-wheel/ uv venv source .venv/bin/activate uv pip install maturin maturin develop --uv uv pip install vllm==0.7.2

But when using vllm on host (no container) I got this error from following the steps to launch monolith worker (Note I don't have cuda toolkit/runtime installed on my host globally, only the driver -- but the venv installs the cuda toolkit during vllm/pytorch install):

$ python3 -m monolith.worker \ --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B \ --max-model-len 100 \ --enforce-eager ... ImportError: /home/rmccormick/triton/distributed/v0.2.0/triton_distributed/runtime/rust/python-wheel/.venv/lib/python3.10/site-packages/torch/lib/../../nvidia/cusparse/lib/libcusparse.so.12: undefined symbol: __nvJitLinkComplete_12_4, version libnvJitLink.so.12

Seems to be work-aroundable by doing something like this to add this libnvjitlink to the ld library path:

rmccormick@ced35d0-lcedt:~/triton/distributed/v0.2.0/triton_distributed/examples/python_rs/llm/vllm$ export LD_LIBRARY_PATH=$PWD/../../../../runtime/rust/python-wheel/.venv/lib64/python3.10/site-packages/nvidia/nvjitlink/lib:${LD_LIBRARY_PATH}

Similar issue: pytorch/pytorch#111469

After the above WAR to put libnvjitlink on ld library path, I can run the monolith worker and client:

rmccormick@ced35d0-lcedt:~/triton/distributed/v0.2.0/triton_distributed/runtime/rust/python-wheel$ source .venv/bin/activate (python-wheel) rmccormick@ced35d0-lcedt:~/triton/distributed/v0.2.0/triton_distributed/runtime/rust/python-wheel$ cd ../../../examples/python_rs/llm/vllm (python-wheel) rmccormick@ced35d0-lcedt:~/triton/distributed/v0.2.0/triton_distributed/examples/python_rs/llm/vllm$ python3 -m common.client \ --prompt "what is the capital of france?" \ --max-tokens 10 \ --temperature 0.5 INFO 02-07 11:18:48 __init__.py:190] Automatically detected platform cuda. ... [7587884607120538396] Annotated(data=' Well', event=None, comment=[], id=None) Annotated(data=' Well,', event=None, comment=[], id=None) Annotated(data=' Well, France', event=None, comment=[], id=None) Annotated(data=' Well, France is', event=None, comment=[], id=None) Annotated(data=' Well, France is a', event=None, comment=[], id=None) Annotated(data=' Well, France is a country', event=None, comment=[], id=None) Annotated(data=' Well, France is a country located', event=None, comment=[], id=None) Annotated(data=' Well, France is a country located in', event=None, comment=[], id=None) Annotated(data=' Well, France is a country located in Western', event=None, comment=[], id=None) Annotated(data=' Well, France is a country located in Western Europe', event=None, comment=[], id=None)

I did not assume container usage. I was running this both through ngc pytorch container and on bare metal on my desktop. I like the idea of not having to run it through the container as the user should be able to just install wheels.

We can still refer to our container for reproduction, but it would have to be

./container/build.sh ./container/run.sh -it

as vllm container has some settings required for the old example. Or we just remove the old example right now, @nnshah1 ?

Assuming same fixes mentioned, I'm able to run the disaggregated example on single node with 2 gpus (heterogeneous gpus too!)

Prefill Worker

# Make sure venv is activated, and LD_LIBRARY_PATH has necessary libs from venv $ CUDA_VISIBLE_DEVICES=0 python3 -m disaggregated.prefill_worker \ --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B \ --max-model-len 100 \ --gpu-memory-utilization 0.8 \ --enforce-eager \ --kv-transfer-config \ '{"kv_connector":"PyNcclConnector","kv_role":"kv_producer","kv_rank":0,"kv_parallel_size":2}' ... INFO 02-07 11:27:46 model_runner.py:1115] Loading model weights took 14.9888 GB INFO 02-07 11:27:47 worker.py:267] Memory profiling takes 0.98 seconds INFO 02-07 11:27:47 worker.py:267] the current vLLM instance can use total_gpu_memory (47.50GiB) x gpu_memory_utilization (0.80) = 38.00GiB INFO 02-07 11:27:47 worker.py:267] model weights take 14.99GiB; non_torch_memory takes 0.17GiB; PyTorch activation peak memory takes 1.19GiB; the rest of the memory reserved for KV Cache is 21.66GiB. INFO 02-07 11:27:47 executor_base.py:110] # CUDA blocks: 11089, # CPU blocks: 2048 INFO 02-07 11:27:47 executor_base.py:115] Maximum concurrency for 100 tokens per request: 1774.24x INFO 02-07 11:27:50 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 4.11 seconds INFO 02-07 11:28:09 prefill_worker.py:41] Received prefill request: prompt='what is the capital of france?' sampling_params={'temperature': 0.5, 'max_tokens': 1} request_id='84bf7368-f5b1-43e2-89da-e46867af5ac8' INFO 02-07 11:28:09 async_llm_engine.py:211] Added request 84bf7368-f5b1-43e2-89da-e46867af5ac8. INFO 02-07 11:28:09 metrics.py:455] Avg prompt throughput: 0.4 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%. INFO 02-07 11:28:09 async_llm_engine.py:179] Finished request 84bf7368-f5b1-43e2-89da-e46867af5ac8.

Decode Worker

# Make sure venv is activated, and LD_LIBRARY_PATH has necessary libs from venv $ CUDA_VISIBLE_DEVICES=1 python3 -m disaggregated.decode_worker --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B --max-model-len 100 --gpu-memory-utilization 0.8 --enforce-eager --kv-transfer-config '{"kv_connector":"PyNcclConnector","kv_role":"kv_consumer","kv_rank":1,"kv_parallel_size":2}' ... INFO 02-07 11:27:46 model_runner.py:1115] Loading model weights took 14.9888 GB INFO 02-07 11:27:47 worker.py:267] Memory profiling takes 1.16 seconds INFO 02-07 11:27:47 worker.py:267] the current vLLM instance can use total_gpu_memory (23.69GiB) x gpu_memory_utilization (0.80) = 18.95GiB INFO 02-07 11:27:47 worker.py:267] model weights take 14.99GiB; non_torch_memory takes 0.14GiB; PyTorch activation peak memory takes 1.19GiB; the rest of the memory reserved for KV Cache is 2.63GiB. INFO 02-07 11:27:47 executor_base.py:110] # CUDA blocks: 1347, # CPU blocks: 2048 INFO 02-07 11:27:47 executor_base.py:115] Maximum concurrency for 100 tokens per request: 215.52x INFO 02-07 11:27:50 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 4.31 seconds INFO 02-07 11:28:09 decode_worker.py:43] Received request: prompt='what is the capital of france?' sampling_params={'temperature': 0.5, 'max_tokens': 10} INFO 02-07 11:28:09 async_llm_engine.py:211] Added request 84bf7368-f5b1-43e2-89da-e46867af5ac8. INFO 02-07 11:28:09 metrics.py:455] Avg prompt throughput: 0.4 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%. INFO 02-07 11:28:10 async_llm_engine.py:179] Finished request 84bf7368-f5b1-43e2-89da-e46867af5ac8.

Client

# Make sure venv is activated, and LD_LIBRARY_PATH has necessary libs from venv (python-wheel) rmccormick@ced35d0-lcedt:~/triton/distributed/v0.2.0/triton_distributed/examples/python_rs/llm/vllm$ python3 -m common.client \ --prompt "what is the capital of france?" \ --max-tokens 10 \ --temperature 0.5 INFO 02-07 11:28:09 __init__.py:190] Automatically detected platform cuda. WARNING 02-07 11:28:09 cuda.py:336] Detected different devices in the system: WARNING 02-07 11:28:09 cuda.py:336] NVIDIA GeForce RTX 3090 WARNING 02-07 11:28:09 cuda.py:336] NVIDIA RTX 5880 Ada Generation WARNING 02-07 11:28:09 cuda.py:336] Please make sure to set `CUDA_DEVICE_ORDER=PCI_BUS_ID` to avoid unexpected behavior. [7587884607120538406] Annotated(data=' Well', event=None, comment=[], id=None) Annotated(data=' Well,', event=None, comment=[], id=None) Annotated(data=' Well, France', event=None, comment=[], id=None) Annotated(data=' Well, France is', event=None, comment=[], id=None) Annotated(data=' Well, France is a', event=None, comment=[], id=None) Annotated(data=' Well, France is a country', event=None, comment=[], id=None) Annotated(data=' Well, France is a country located', event=None, comment=[], id=None) Annotated(data=' Well, France is a country located in', event=None, comment=[], id=None) Annotated(data=' Well, France is a country located in Western', event=None, comment=[], id=None) Annotated(data=' Well, France is a country located in Western Europe', event=None, comment=[], id=None)

lets not remove the old example yet - let's update the instructions now

ptarasiewiczNV temporarily deployed to GITLAB February 5, 2025 20:22 — with GitHub Actions Inactive

ptarasiewiczNV temporarily deployed to GITLAB February 5, 2025 20:28 — with GitHub Actions Inactive

ptarasiewiczNV temporarily deployed to GITLAB February 5, 2025 21:02 — with GitHub Actions Inactive

ptarasiewiczNV temporarily deployed to GITLAB February 5, 2025 21:03 — with GitHub Actions Inactive

nnshah1 requested review from nnshah1, ryanolson and biswapanda February 5, 2025 21:06