-
Notifications
You must be signed in to change notification settings - Fork 1.8k
📝 vLLM integration doc #3358
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
📝 vLLM integration doc #3358
Conversation
@@ -1,14 +1,72 @@ | |||
# vLLM Integration | |||
# vLLM Integration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# vLLM Integration | |
# vLLM Integration |
|
||
Section under construction. Feel free to contribute! | ||
Online methods such as GRPO or Online DPO require the model to generate completions. Because these completions are used to compute the reward signal, they need to be generated at regular intervals during training. This is typically done every `gradient_accumulation_steps * num_iterations` steps, where `num_iterations` is the number of iterations between two gradient updates. Now, the problem is that generating completions is a time-consuming process, especially when using large models. The reason of the time-consuming nature of the generation process is by default, this is done using the [(unwrapped)model's `generate` method](https://github.com/huggingface/trl/blob/f3e8c2304428ef16e9ae5de9e5741ed84d533b7b/trl/trainer/grpo_trainer.py#L965C39-L965C66), (ofcourse if `use_vllm` is set to False). This `unwrapped_model.generate` method is thechnically a synchronous function, meaning that it will block the execution of the program until the generation is complete. This can lead to inefficiencies, especially when generating large batches of completions or when using large models. Therefore, the generation process can become a bottleneck in the training process, leading to longer training times and reduced efficiency. So this is why we need to think of a better way to do this which is using vLLM for faster generation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What we name num_iterations
in GRPO isn't the number of iterations between two gradient updates. You should use another term to avoid confusion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
meaning that it will block the execution of the program until the generation is complet
I'm not sure why you mention this? Put like this, it seems to imply that this is the major limitation that will eliminate with vLLM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as the general structure is concerned, I think that when users come across this page, the first question they have in mind is How can I use vLLM with TRL to make things go faster? (a fortiori with GRPO). So my advice is to get straight to this point. And further down, explain in more detail why generation is the major bottleneck, how vLLM optimizes generation etc.
In short, I see this page more as a practical guide.
@@ -40,8 +98,74 @@ options: | |||
feature. (default: None) | |||
``` | |||
|
|||
### Find the best distributed setup | |||
# 🥸 Okay, now that we have the server running, how can we use it to generate completions? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seems to be heading level issues
``` | ||
# 🔎 What exactly happens when you run `trl vllm-serve --model <model_name>`? | ||
when you run for example `trl vllm-serve --model Qwen/Qwen2.5-7B --tensor-parallel-size 2 --data-parallel-size 2`, the following happens: | ||
 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have the impression that on the figure you're using DP=4, and not DP=2.
# 🔎 What exactly happens when you run `trl vllm-serve --model <model_name>`? | ||
when you run for example `trl vllm-serve --model Qwen/Qwen2.5-7B --tensor-parallel-size 2 --data-parallel-size 2`, the following happens: | ||
 | ||
1. First it will spawn multiple workers to handle loads of requests in parallel. To figure our exactly how many workers to spawn, it will use the `--tensor-parallel-size` and `--data-parallel-size` arguments in the command. For example, if you run `trl vllm-serve --model Qwen/Qwen2.5-7B --tensor-parallel-size 2 --data-parallel-size 2`, it will spawn 4 workers (2 for data parallelism and 2 for tensor parallelism). Tricky point here is that you need to think of it as 4 workers independent and in parallel at the same time are processing a chunk of the incoming requests. Here the requests are basically the prompts that are sent to the model on the server to generate completions. Therefore, each of these workers(gpus) will be responsible for processing a chunk of the incoming requests. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, if you run
trl vllm-serve --model Qwen/Qwen2.5-7B --tensor-parallel-size 2 --data-parallel-size 2
, it will spawn 4 workers (2 for data parallelism and 2 for tensor parallelism)
When you put it like that, I think it's a bit misleading. It seems to imply this can also be true, which is not
For example, if you run
trl vllm-serve --model Qwen/Qwen2.5-7B --tensor-parallel-size 3 --data-parallel-size 2
, it will spawn 5 workers (2 for data parallelism and 3 for tensor parallelism)
|
||
2. Now that we have the requests(prompts) ready on each of the workers, the model will start generating the completions. Note that the model (models' weights actually) itself is split across multiple gpus on the vllm side (`--tensor parallelism size`) and each gpu will be responsible for processing a chunk of the incoming requests(`--data parallelism size`). | ||
|
||
3. Although the gpus process the requests in parallel and independent of one another, they need to communicate or talk with each other. Because recall that each of them process a chunk of the incoming prompt (e.g. if you have 4 gpus and 8 requests, with dp=2, each gpu will process 2 requests). This gpu-gpu communication is handled by Nvidia NCCL lib. This communication is just to make sure each gpu has its slice of prompt/request. Note that you can define the num_generations, as the number of completions to generate for each request. So if you have 4 gpus and 8 requests/prompts, with dp=2 and num_generations=2, each gpu will process 2 prompts and generate 2 completions for each of them. So in total, you will have 16 completions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if you have 4 gpus and 8 requests/prompts, with dp=2 and num_generations=2, each gpu will process 2 prompts and generate 2 completions for each of them. So in total, you will have 16 completions.
when it's formulated like this "So in total, you will have 16 completions", it seems to imply that this number is derived from the ones you gave before (number of GPUs, DP). However, this number only comes from the fact that you have 8 prompts and 2 completions per prompt.
What does this PR do?
Fixes # (issue)
Before submitting
Pull Request section?
to it if that's the case.
Who can review?
Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.