📝 vLLM integration doc #3358

shirinyamani · 2025-04-24T23:58:27Z

What does this PR do?

Fixes # (issue)

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a GitHub issue? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes?
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

qgallouedec · 2025-04-25T02:27:25Z

docs/source/vllm_integration.md

@@ -1,14 +1,72 @@
-# vLLM Integration
+# vLLM Integration 


Suggested change

# vLLM Integration

# vLLM Integration

qgallouedec · 2025-04-25T02:29:22Z

docs/source/vllm_integration.md


-Section under construction. Feel free to contribute!
+Online methods such as GRPO or Online DPO require the model to generate completions. Because these completions are used to compute the reward signal, they need to be generated at regular intervals during training. This is typically done every `gradient_accumulation_steps * num_iterations` steps, where `num_iterations` is the number of iterations between two gradient updates. Now, the problem is that generating completions is a time-consuming process, especially when using large models. The reason of the time-consuming nature of the generation process is by default, this is done using the [(unwrapped)model's `generate` method](https://github.com/huggingface/trl/blob/f3e8c2304428ef16e9ae5de9e5741ed84d533b7b/trl/trainer/grpo_trainer.py#L965C39-L965C66), (ofcourse if `use_vllm` is set to False). This `unwrapped_model.generate` method is thechnically a synchronous function, meaning that it will block the execution of the program until the generation is complete. This can lead to inefficiencies, especially when generating large batches of completions or when using large models. Therefore, the generation process can become a bottleneck in the training process, leading to longer training times and reduced efficiency. So this is why we need to think of a better way to do this which is using vLLM for faster generation. 


What we name num_iterations in GRPO isn't the number of iterations between two gradient updates. You should use another term to avoid confusion

meaning that it will block the execution of the program until the generation is complet

I'm not sure why you mention this? Put like this, it seems to imply that this is the major limitation that will eliminate with vLLM.

qgallouedec · 2025-04-25T02:41:48Z

docs/source/vllm_integration.md

As far as the general structure is concerned, I think that when users come across this page, the first question they have in mind is How can I use vLLM with TRL to make things go faster? (a fortiori with GRPO). So my advice is to get straight to this point. And further down, explain in more detail why generation is the major bottleneck, how vLLM optimizes generation etc.

In short, I see this page more as a practical guide.

qgallouedec · 2025-04-25T02:44:12Z

docs/source/vllm_integration.md

@@ -40,8 +98,74 @@ options:
                        feature. (default: None)
 ```

-### Find the best distributed setup
+# 🥸 Okay, now that we have the server running, how can we use it to generate completions? 


There seems to be heading level issues

qgallouedec · 2025-04-25T02:46:48Z

docs/source/vllm_integration.md

+```
+# 🔎 What exactly happens when you run `trl vllm-serve --model <model_name>`?
+when you run for example `trl vllm-serve --model Qwen/Qwen2.5-7B --tensor-parallel-size 2 --data-parallel-size 2`, the following happens:
+![vllm](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/vllm_integration_dp2_tp2.png)


I have the impression that on the figure you're using DP=4, and not DP=2.

qgallouedec · 2025-04-25T02:48:51Z

docs/source/vllm_integration.md

+# 🔎 What exactly happens when you run `trl vllm-serve --model <model_name>`?
+when you run for example `trl vllm-serve --model Qwen/Qwen2.5-7B --tensor-parallel-size 2 --data-parallel-size 2`, the following happens:
+![vllm](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/vllm_integration_dp2_tp2.png)
+1. First it will spawn multiple workers to handle loads of requests in parallel. To figure our exactly how many workers to spawn, it will use the `--tensor-parallel-size` and `--data-parallel-size` arguments in the command. For example, if you run `trl vllm-serve --model Qwen/Qwen2.5-7B --tensor-parallel-size 2 --data-parallel-size 2`, it will spawn 4 workers (2 for data parallelism and 2 for tensor parallelism). Tricky point here is that you need to think of it as 4 workers independent and in parallel at the same time are processing a chunk of the incoming requests. Here the requests are basically the prompts that are sent to the model on the server to generate completions. Therefore, each of these workers(gpus) will be responsible for processing a chunk of the incoming requests. 


For example, if you run trl vllm-serve --model Qwen/Qwen2.5-7B --tensor-parallel-size 2 --data-parallel-size 2, it will spawn 4 workers (2 for data parallelism and 2 for tensor parallelism)

When you put it like that, I think it's a bit misleading. It seems to imply this can also be true, which is not

For example, if you run trl vllm-serve --model Qwen/Qwen2.5-7B --tensor-parallel-size 3 --data-parallel-size 2, it will spawn 5 workers (2 for data parallelism and 3 for tensor parallelism)

qgallouedec · 2025-04-25T02:53:10Z

docs/source/vllm_integration.md

+
+2. Now that we have the requests(prompts) ready on each of the workers, the model will start generating the completions. Note that the model (models' weights actually) itself is split across multiple gpus on the vllm side (`--tensor parallelism size`) and each gpu will be responsible for processing a chunk of the incoming requests(`--data parallelism size`).
+
+3. Although the gpus process the requests in parallel and independent of one another, they need to communicate or talk with each other. Because recall that each of them process a chunk of the incoming prompt (e.g. if you have 4 gpus and 8 requests, with dp=2, each gpu will process 2 requests). This  gpu-gpu communication is handled by Nvidia NCCL lib. This communication is just to make sure each gpu has its slice of prompt/request.  Note that you can define the num_generations, as the number of completions to generate for each request. So if you have 4 gpus and 8 requests/prompts, with dp=2 and num_generations=2, each gpu will process 2 prompts and generate 2 completions for each of them. So in total, you will have 16 completions. 


So if you have 4 gpus and 8 requests/prompts, with dp=2 and num_generations=2, each gpu will process 2 prompts and generate 2 completions for each of them. So in total, you will have 16 completions.

when it's formulated like this "So in total, you will have 16 completions", it seems to imply that this number is derived from the ones you gave before (number of GPUs, DP). However, this number only comes from the fact that you have 8 prompts and 2 completions per prompt.

shirinyamani added 3 commits April 24, 2025 23:49

vllm integration doc

fbb134f

experiments figures added

f1b41ff

experiments figures added

fad7b83

shirinyamani changed the title ~~Vllm document~~ 📝 vLLM integration doc Apr 25, 2025

shirinyamani requested review from qgallouedec and lewtun April 25, 2025 00:07

qgallouedec reviewed Apr 25, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

📝 vLLM integration doc #3358

📝 vLLM integration doc #3358

shirinyamani commented Apr 24, 2025

qgallouedec Apr 25, 2025

qgallouedec Apr 25, 2025

qgallouedec Apr 25, 2025

qgallouedec Apr 25, 2025

qgallouedec Apr 25, 2025

qgallouedec Apr 25, 2025

qgallouedec Apr 25, 2025

qgallouedec Apr 25, 2025


		Section under construction. Feel free to contribute!
		Online methods such as GRPO or Online DPO require the model to generate completions. Because these completions are used to compute the reward signal, they need to be generated at regular intervals during training. This is typically done every `gradient_accumulation_steps * num_iterations` steps, where `num_iterations` is the number of iterations between two gradient updates. Now, the problem is that generating completions is a time-consuming process, especially when using large models. The reason of the time-consuming nature of the generation process is by default, this is done using the [(unwrapped)model's `generate` method](https://github.com/huggingface/trl/blob/f3e8c2304428ef16e9ae5de9e5741ed84d533b7b/trl/trainer/grpo_trainer.py#L965C39-L965C66), (ofcourse if `use_vllm` is set to False). This `unwrapped_model.generate` method is thechnically a synchronous function, meaning that it will block the execution of the program until the generation is complete. This can lead to inefficiencies, especially when generating large batches of completions or when using large models. Therefore, the generation process can become a bottleneck in the training process, leading to longer training times and reduced efficiency. So this is why we need to think of a better way to do this which is using vLLM for faster generation.


		2. Now that we have the requests(prompts) ready on each of the workers, the model will start generating the completions. Note that the model (models' weights actually) itself is split across multiple gpus on the vllm side (`--tensor parallelism size`) and each gpu will be responsible for processing a chunk of the incoming requests(`--data parallelism size`).

		3. Although the gpus process the requests in parallel and independent of one another, they need to communicate or talk with each other. Because recall that each of them process a chunk of the incoming prompt (e.g. if you have 4 gpus and 8 requests, with dp=2, each gpu will process 2 requests). This gpu-gpu communication is handled by Nvidia NCCL lib. This communication is just to make sure each gpu has its slice of prompt/request. Note that you can define the num_generations, as the number of completions to generate for each request. So if you have 4 gpus and 8 requests/prompts, with dp=2 and num_generations=2, each gpu will process 2 prompts and generate 2 completions for each of them. So in total, you will have 16 completions.

📝 vLLM integration doc #3358

Are you sure you want to change the base?

📝 vLLM integration doc #3358

Conversation

shirinyamani commented Apr 24, 2025

What does this PR do?

Before submitting

Who can review?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment