Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vllm多卡推理问题 #18

Open
BarryAlllen opened this issue Nov 27, 2024 · 5 comments
Open

vllm多卡推理问题 #18

BarryAlllen opened this issue Nov 27, 2024 · 5 comments

Comments

@BarryAlllen
Copy link

BarryAlllen commented Nov 27, 2024

我的命令:

vllm serve TeleChat2-7B \ 
--trust-remote-code \ 
--max-model-len 2000 \
--tensor-parallel-size 2 \
--dtype float16 --port 10000

运行之后,会一直卡在一步,不继续加载模型:

INFO 11-27 02:16:22 api_server.py:495] vLLM API server version 0.6.1.post2
INFO 11-27 02:16:22 api_server.py:496] args: Namespace(model_tag='TeleChat2-7B', config='', host=None, port=10000,

......

INFO 11-27 02:16:26 custom_cache_manager.py:17] Setting Triton cache manager to: vllm.triton_utils.custom_cache_manager:CustomCacheManager
INFO 11-27 02:16:26 selector.py:217] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
INFO 11-27 02:16:26 selector.py:116] Using XFormers backend.
(VllmWorkerProcess pid=23694) INFO 11-27 02:16:26 selector.py:217] Cannot use FlashAttention-2 backend for Volta and Turing GPUs.
(VllmWorkerProcess pid=23694) INFO 11-27 02:16:26 selector.py:116] Using XFormers backend.
/opt/conda/envs/telechat/lib/python3.11/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_fwd")
/opt/conda/envs/telechat/lib/python3.11/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
  @torch.library.impl_abstract("xformers_flash::flash_bwd")
(VllmWorkerProcess pid=23694) /opt/conda/envs/telechat/lib/python3.11/site-packages/xformers/ops/fmha/flash.py:211: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
(VllmWorkerProcess pid=23694)   @torch.library.impl_abstract("xformers_flash::flash_fwd")
(VllmWorkerProcess pid=23694) /opt/conda/envs/telechat/lib/python3.11/site-packages/xformers/ops/fmha/flash.py:344: FutureWarning: `torch.library.impl_abstract` was renamed to `torch.library.register_fake`. Please use that instead; we will remove `torch.library.impl_abstract` in a future version of PyTorch.
(VllmWorkerProcess pid=23694)   @torch.library.impl_abstract("xformers_flash::flash_bwd")
(VllmWorkerProcess pid=23694) INFO 11-27 02:16:27 multiproc_worker_utils.py:215] Worker ready; awaiting tasks
(VllmWorkerProcess pid=23694) INFO 11-27 02:16:27 utils.py:981] Found nccl from library libnccl.so.2
INFO 11-27 02:16:27 utils.py:981] Found nccl from library libnccl.so.2
(VllmWorkerProcess pid=23694) INFO 11-27 02:16:27 pynccl.py:63] vLLM is using nccl==2.20.5
INFO 11-27 02:16:27 pynccl.py:63] vLLM is using nccl==2.20.5 <------- 一直卡在这一步 不往下进行了

显卡每张只加载了400多m
微信截图_20241127102238

目前不知道是什么问题,想知道我的参数有没有错误,或者是哪里的配置未改。

ps. 单卡加载推理的话是正常的

@shunxing12345
Copy link
Contributor

shunxing12345 commented Nov 28, 2024

vllm 已经支持telechat2 可以在官网拉取vllm最新代码安装vllm使用telechat2

@BarryAlllen
Copy link
Author

vllm 已经支持telechat2 可以在官网拉取vllm最新代码安装vllm使用telechat2

我用的是文档推荐的0.6.1.post2版本,但仍有上述问题。

@spunk166
Copy link

spunk166 commented Dec 9, 2024

vllm 加载模型,第一次启动多卡的模型可以正常启动,但是模型退出之后,要再次启动,就会卡住。《VLLM启动时NCCL遇到显卡P2P通信问题》
按上面的方法,也没有解决。最后用SGLang 多卡加载模型没有任何问题,只是SGLang不支持Tool 。

@sparkingarthur
Copy link
Contributor

vllm 加载模型,第一次启动多卡的模型可以正常启动,但是模型退出之后,要再次启动,就会卡住。《VLLM启动时NCCL遇到显卡P2P通信问题》 按上面的方法,也没有解决。最后用SGLang 多卡加载模型没有任何问题,只是SGLang不支持Tool 。

telechat还能跑在sglang上的?

@qinyuenlp
Copy link

请问这个问题解决了吗?我也遇到了相同的问题,使用vllm部署qwen与ds,第一次正常启动;退出后,后续启动阶段模型加载到500MB时会卡住,GPU利用率达100%。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants