Skip to content

📝 vLLM integration doc #3358

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

📝 vLLM integration doc #3358

wants to merge 3 commits into from

Conversation

shirinyamani
Copy link
Member

What does this PR do?

Fixes # (issue)

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline,
    Pull Request section?
  • Was this discussed/approved via a GitHub issue? Please add a link
    to it if that's the case.
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@shirinyamani shirinyamani changed the title Vllm document 📝 vLLM integration doc Apr 25, 2025
@@ -1,14 +1,72 @@
# vLLM Integration
# vLLM Integration
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# vLLM Integration
# vLLM Integration


Section under construction. Feel free to contribute!
Online methods such as GRPO or Online DPO require the model to generate completions. Because these completions are used to compute the reward signal, they need to be generated at regular intervals during training. This is typically done every `gradient_accumulation_steps * num_iterations` steps, where `num_iterations` is the number of iterations between two gradient updates. Now, the problem is that generating completions is a time-consuming process, especially when using large models. The reason of the time-consuming nature of the generation process is by default, this is done using the [(unwrapped)model's `generate` method](https://github.com/huggingface/trl/blob/f3e8c2304428ef16e9ae5de9e5741ed84d533b7b/trl/trainer/grpo_trainer.py#L965C39-L965C66), (ofcourse if `use_vllm` is set to False). This `unwrapped_model.generate` method is thechnically a synchronous function, meaning that it will block the execution of the program until the generation is complete. This can lead to inefficiencies, especially when generating large batches of completions or when using large models. Therefore, the generation process can become a bottleneck in the training process, leading to longer training times and reduced efficiency. So this is why we need to think of a better way to do this which is using vLLM for faster generation.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What we name num_iterations in GRPO isn't the number of iterations between two gradient updates. You should use another term to avoid confusion

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

meaning that it will block the execution of the program until the generation is complet

I'm not sure why you mention this? Put like this, it seems to imply that this is the major limitation that will eliminate with vLLM.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as the general structure is concerned, I think that when users come across this page, the first question they have in mind is How can I use vLLM with TRL to make things go faster? (a fortiori with GRPO). So my advice is to get straight to this point. And further down, explain in more detail why generation is the major bottleneck, how vLLM optimizes generation etc.

In short, I see this page more as a practical guide.

@@ -40,8 +98,74 @@ options:
feature. (default: None)
```

### Find the best distributed setup
# 🥸 Okay, now that we have the server running, how can we use it to generate completions?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be heading level issues

```
# 🔎 What exactly happens when you run `trl vllm-serve --model <model_name>`?
when you run for example `trl vllm-serve --model Qwen/Qwen2.5-7B --tensor-parallel-size 2 --data-parallel-size 2`, the following happens:
![vllm](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/vllm_integration_dp2_tp2.png)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the impression that on the figure you're using DP=4, and not DP=2.

# 🔎 What exactly happens when you run `trl vllm-serve --model <model_name>`?
when you run for example `trl vllm-serve --model Qwen/Qwen2.5-7B --tensor-parallel-size 2 --data-parallel-size 2`, the following happens:
![vllm](https://huggingface.co/datasets/trl-lib/documentation-images/resolve/main/vllm_integration_dp2_tp2.png)
1. First it will spawn multiple workers to handle loads of requests in parallel. To figure our exactly how many workers to spawn, it will use the `--tensor-parallel-size` and `--data-parallel-size` arguments in the command. For example, if you run `trl vllm-serve --model Qwen/Qwen2.5-7B --tensor-parallel-size 2 --data-parallel-size 2`, it will spawn 4 workers (2 for data parallelism and 2 for tensor parallelism). Tricky point here is that you need to think of it as 4 workers independent and in parallel at the same time are processing a chunk of the incoming requests. Here the requests are basically the prompts that are sent to the model on the server to generate completions. Therefore, each of these workers(gpus) will be responsible for processing a chunk of the incoming requests.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, if you run trl vllm-serve --model Qwen/Qwen2.5-7B --tensor-parallel-size 2 --data-parallel-size 2, it will spawn 4 workers (2 for data parallelism and 2 for tensor parallelism)

When you put it like that, I think it's a bit misleading. It seems to imply this can also be true, which is not

For example, if you run trl vllm-serve --model Qwen/Qwen2.5-7B --tensor-parallel-size 3 --data-parallel-size 2, it will spawn 5 workers (2 for data parallelism and 3 for tensor parallelism)


2. Now that we have the requests(prompts) ready on each of the workers, the model will start generating the completions. Note that the model (models' weights actually) itself is split across multiple gpus on the vllm side (`--tensor parallelism size`) and each gpu will be responsible for processing a chunk of the incoming requests(`--data parallelism size`).

3. Although the gpus process the requests in parallel and independent of one another, they need to communicate or talk with each other. Because recall that each of them process a chunk of the incoming prompt (e.g. if you have 4 gpus and 8 requests, with dp=2, each gpu will process 2 requests). This gpu-gpu communication is handled by Nvidia NCCL lib. This communication is just to make sure each gpu has its slice of prompt/request. Note that you can define the num_generations, as the number of completions to generate for each request. So if you have 4 gpus and 8 requests/prompts, with dp=2 and num_generations=2, each gpu will process 2 prompts and generate 2 completions for each of them. So in total, you will have 16 completions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if you have 4 gpus and 8 requests/prompts, with dp=2 and num_generations=2, each gpu will process 2 prompts and generate 2 completions for each of them. So in total, you will have 16 completions.

when it's formulated like this "So in total, you will have 16 completions", it seems to imply that this number is derived from the ones you gave before (number of GPUs, DP). However, this number only comes from the fact that you have 8 prompts and 2 completions per prompt.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants