Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: vLLM DistributedRuntime Monolith and Disagg Workers Example #113

Merged
merged 21 commits into from
Feb 7, 2025

Conversation

ptarasiewiczNV
Copy link
Contributor

@ptarasiewiczNV ptarasiewiczNV commented Feb 5, 2025

What does the PR do?

Add example of using rust runtime to serve vllm in a monolith and disaggregated settings.

Checklist

  • PR title reflects the change and is of format <commit_type>: <Title>
  • Changes are described in the pull request.
  • Related issues are referenced.
  • Populated github labels field
  • Added test plan and verified test passes.
  • Verified that the PR passes existing CI.
  • Verified copyright is correct on all changed files.
  • Added succinct git squash message before merging ref.
  • All template sections are filled out.
  • Optional: Additional screenshots for behavior/output changes with before/after.

Commit Type:

Check the conventional commit type
box here and add the label to the github PR.

  • build
  • ci
  • docs
  • feat
  • fix
  • perf
  • refactor
  • revert
  • style
  • test

Related PRs:

Where should the reviewer start?

Test plan:

  • CI Pipeline ID:

Caveats:

Background

Related Issues: (use one of the action keywords Closes / Fixes / Resolves / Relates to)

  • closes GitHub issue: #xxx

@ptarasiewiczNV
Copy link
Contributor Author

@nnshah1 readme and docker compose are just temp, will rebase after #116

@ptarasiewiczNV ptarasiewiczNV force-pushed the ptarasiewicz/vllm-example-rust-runtime branch from ff267ec to 6ab1302 Compare February 6, 2025 12:15
@ptarasiewiczNV ptarasiewiczNV added documentation Improvements or additions to documentation and removed documentation Improvements or additions to documentation labels Feb 6, 2025

**Terminal 1 - Server:**
```bash
python3 -m monolith.worker \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this in the container or outside the container - or not related to container?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assumed no container, meaning just running in users environment as we now have wheels.


**Terminal 2 - Client:**
```bash
python3 -m common.client \
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here - is this from the same container / separate?

from .protocol import Request


@triton_worker()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to do - does seem this would be better as @triton_distributed_component

Copy link
Collaborator

@nnshah1 nnshah1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - added a few questions / comments for future


This example demonstrates how to use Triton Distributed to serve large language models with the vLLM engine, enabling efficient model serving with both monolithic and disaggregated deployment options.

## Prerequisites
Copy link
Contributor

@rmccorm4 rmccorm4 Feb 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ptarasiewiczNV Is there an assumed container as the base environment for these steps too? Such as the:

./container/build.sh --framework vllm
./container/run.sh --framework vllm -it

flow beforehand?

All the rust/runtime examples that don't use vllm or GPUs work fine on my host, so I tried that here too since the README doesn't mention any containers:

cd triton_distributed/
git checkout ptarasiewicz/vllm-example-rust-runtime
cd runtime/rust/python-wheel/
uv venv
source .venv/bin/activate
uv pip install maturin
maturin develop --uv
uv pip install vllm==0.7.2

But when using vllm on host (no container) I got this error from following the steps to launch monolith worker (Note I don't have cuda toolkit/runtime installed on my host globally, only the driver -- but the venv installs the cuda toolkit during vllm/pytorch install):

$ python3 -m monolith.worker \
    --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
    --max-model-len 100 \
    --enforce-eager
...
ImportError: /home/rmccormick/triton/distributed/v0.2.0/triton_distributed/runtime/rust/python-wheel/.venv/lib/python3.10/site-packages/torch/lib/../../nvidia/cusparse/lib/libcusparse.so.12: undefined symbol: __nvJitLinkComplete_12_4, version libnvJitLink.so.12

Seems to be work-aroundable by doing something like this to add this libnvjitlink to the ld library path:

rmccormick@ced35d0-lcedt:~/triton/distributed/v0.2.0/triton_distributed/examples/python_rs/llm/vllm$ export LD_LIBRARY_PATH=$PWD/../../../../runtime/rust/python-wheel/.venv/lib64/python3.10/site-packages/nvidia/nvjitlink/lib:${LD_LIBRARY_PATH}

Similar issue: pytorch/pytorch#111469

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the above WAR to put libnvjitlink on ld library path, I can run the monolith worker and client:

rmccormick@ced35d0-lcedt:~/triton/distributed/v0.2.0/triton_distributed/runtime/rust/python-wheel$ source .venv/bin/activate

(python-wheel) rmccormick@ced35d0-lcedt:~/triton/distributed/v0.2.0/triton_distributed/runtime/rust/python-wheel$ cd ../../../examples/python_rs/llm/vllm

(python-wheel) rmccormick@ced35d0-lcedt:~/triton/distributed/v0.2.0/triton_distributed/examples/python_rs/llm/vllm$ python3 -m common.client \
    --prompt "what is the capital of france?" \
    --max-tokens 10 \
    --temperature 0.5
INFO 02-07 11:18:48 __init__.py:190] Automatically detected platform cuda.
...
[7587884607120538396]
Annotated(data=' Well', event=None, comment=[], id=None)
Annotated(data=' Well,', event=None, comment=[], id=None)
Annotated(data=' Well, France', event=None, comment=[], id=None)
Annotated(data=' Well, France is', event=None, comment=[], id=None)
Annotated(data=' Well, France is a', event=None, comment=[], id=None)
Annotated(data=' Well, France is a country', event=None, comment=[], id=None)
Annotated(data=' Well, France is a country located', event=None, comment=[], id=None)
Annotated(data=' Well, France is a country located in', event=None, comment=[], id=None)
Annotated(data=' Well, France is a country located in Western', event=None, comment=[], id=None)
Annotated(data=' Well, France is a country located in Western Europe', event=None, comment=[], id=None)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did not assume container usage. I was running this both through ngc pytorch container and on bare metal on my desktop. I like the idea of not having to run it through the container as the user should be able to just install wheels.

We can still refer to our container for reproduction, but it would have to be

./container/build.sh
./container/run.sh -it

as vllm container has some settings required for the old example. Or we just remove the old example right now, @nnshah1 ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming same fixes mentioned, I'm able to run the disaggregated example on single node with 2 gpus (heterogeneous gpus too!)

Prefill Worker

# Make sure venv is activated, and LD_LIBRARY_PATH has necessary libs from venv
$ CUDA_VISIBLE_DEVICES=0 python3 -m disaggregated.prefill_worker \
    --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B \
    --max-model-len 100 \
    --gpu-memory-utilization 0.8 \
    --enforce-eager \
    --kv-transfer-config \
    '{"kv_connector":"PyNcclConnector","kv_role":"kv_producer","kv_rank":0,"kv_parallel_size":2}'
...
INFO 02-07 11:27:46 model_runner.py:1115] Loading model weights took 14.9888 GB
INFO 02-07 11:27:47 worker.py:267] Memory profiling takes 0.98 seconds
INFO 02-07 11:27:47 worker.py:267] the current vLLM instance can use total_gpu_memory (47.50GiB) x gpu_memory_utilization (0.80) = 38.00GiB
INFO 02-07 11:27:47 worker.py:267] model weights take 14.99GiB; non_torch_memory takes 0.17GiB; PyTorch activation peak memory takes 1.19GiB; the rest of the memory reserved for KV Cache is 21.66GiB.
INFO 02-07 11:27:47 executor_base.py:110] # CUDA blocks: 11089, # CPU blocks: 2048
INFO 02-07 11:27:47 executor_base.py:115] Maximum concurrency for 100 tokens per request: 1774.24x
INFO 02-07 11:27:50 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 4.11 seconds
INFO 02-07 11:28:09 prefill_worker.py:41] Received prefill request: prompt='what is the capital of france?' sampling_params={'temperature': 0.5, 'max_tokens': 1} request_id='84bf7368-f5b1-43e2-89da-e46867af5ac8'
INFO 02-07 11:28:09 async_llm_engine.py:211] Added request 84bf7368-f5b1-43e2-89da-e46867af5ac8.
INFO 02-07 11:28:09 metrics.py:455] Avg prompt throughput: 0.4 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 0 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.0%, CPU KV cache usage: 0.0%.
INFO 02-07 11:28:09 async_llm_engine.py:179] Finished request 84bf7368-f5b1-43e2-89da-e46867af5ac8.

Decode Worker

# Make sure venv is activated, and LD_LIBRARY_PATH has necessary libs from venv
$ CUDA_VISIBLE_DEVICES=1 python3 -m disaggregated.decode_worker     --model deepseek-ai/DeepSeek-R1-Distill-Llama-8B     --max-model-len 100     --gpu-memory-utilization 0.8     --enforce-eager     --kv-transfer-config     '{"kv_connector":"PyNcclConnector","kv_role":"kv_consumer","kv_rank":1,"kv_parallel_size":2}'
...
INFO 02-07 11:27:46 model_runner.py:1115] Loading model weights took 14.9888 GB
INFO 02-07 11:27:47 worker.py:267] Memory profiling takes 1.16 seconds
INFO 02-07 11:27:47 worker.py:267] the current vLLM instance can use total_gpu_memory (23.69GiB) x gpu_memory_utilization (0.80) = 18.95GiB
INFO 02-07 11:27:47 worker.py:267] model weights take 14.99GiB; non_torch_memory takes 0.14GiB; PyTorch activation peak memory takes 1.19GiB; the rest of the memory reserved for KV Cache is 2.63GiB.
INFO 02-07 11:27:47 executor_base.py:110] # CUDA blocks: 1347, # CPU blocks: 2048
INFO 02-07 11:27:47 executor_base.py:115] Maximum concurrency for 100 tokens per request: 215.52x
INFO 02-07 11:27:50 llm_engine.py:431] init engine (profile, create kv cache, warmup model) took 4.31 seconds
INFO 02-07 11:28:09 decode_worker.py:43] Received request: prompt='what is the capital of france?' sampling_params={'temperature': 0.5, 'max_tokens': 10}
INFO 02-07 11:28:09 async_llm_engine.py:211] Added request 84bf7368-f5b1-43e2-89da-e46867af5ac8.
INFO 02-07 11:28:09 metrics.py:455] Avg prompt throughput: 0.4 tokens/s, Avg generation throughput: 0.1 tokens/s, Running: 1 reqs, Swapped: 0 reqs, Pending: 0 reqs, GPU KV cache usage: 0.1%, CPU KV cache usage: 0.0%.
INFO 02-07 11:28:10 async_llm_engine.py:179] Finished request 84bf7368-f5b1-43e2-89da-e46867af5ac8.

Client

# Make sure venv is activated, and LD_LIBRARY_PATH has necessary libs from venv
(python-wheel) rmccormick@ced35d0-lcedt:~/triton/distributed/v0.2.0/triton_distributed/examples/python_rs/llm/vllm$ python3 -m common.client \
    --prompt "what is the capital of france?" \
    --max-tokens 10 \
    --temperature 0.5

INFO 02-07 11:28:09 __init__.py:190] Automatically detected platform cuda.
WARNING 02-07 11:28:09 cuda.py:336] Detected different devices in the system:
WARNING 02-07 11:28:09 cuda.py:336] NVIDIA GeForce RTX 3090
WARNING 02-07 11:28:09 cuda.py:336] NVIDIA RTX 5880 Ada Generation
WARNING 02-07 11:28:09 cuda.py:336] Please make sure to set `CUDA_DEVICE_ORDER=PCI_BUS_ID` to avoid unexpected behavior.
[7587884607120538406]
Annotated(data=' Well', event=None, comment=[], id=None)
Annotated(data=' Well,', event=None, comment=[], id=None)
Annotated(data=' Well, France', event=None, comment=[], id=None)
Annotated(data=' Well, France is', event=None, comment=[], id=None)
Annotated(data=' Well, France is a', event=None, comment=[], id=None)
Annotated(data=' Well, France is a country', event=None, comment=[], id=None)
Annotated(data=' Well, France is a country located', event=None, comment=[], id=None)
Annotated(data=' Well, France is a country located in', event=None, comment=[], id=None)
Annotated(data=' Well, France is a country located in Western', event=None, comment=[], id=None)
Annotated(data=' Well, France is a country located in Western Europe', event=None, comment=[], id=None)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets not remove the old example yet - let's update the instructions now

@nnshah1 nnshah1 merged commit 164d7c6 into main Feb 7, 2025
5 checks passed
@nnshah1 nnshah1 deleted the ptarasiewicz/vllm-example-rust-runtime branch February 7, 2025 23:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants