Releases: vllm-project/vllm
v0.8.2
This release contains important bug fix for the V1 engine's memory usage. We highly recommend you upgrading!
Highlights
- Revert "Use uv python for docker rather than ppa:deadsnakess/ppa (#13569)" (#15377)
- Remove openvino support in favor of external plugin (#15339)
V1 Engine
- Fix V1 Engine crash while handling requests with duplicate request id (#15043)
- Support FP8 KV Cache (#14570, #15191)
- Add flag to disable cascade attention (#15243)
- Scheduler Refactoring: Add Scheduler Interface (#15250)
- Structured Output
- Spec Decode
- AMD
- Enable Triton(ROCm) Attention backend for Nvidia GPUs (#14071)
- TPU
Features
- Integrate
fastsafetensors
loader for loading model weights (#10647) - Add guidance backend for structured output (#14589)
Others
- Add Kubernetes deployment guide with CPUs (#14865)
- Support reset prefix cache by specified device (#15003)
- Support tool calling and reasoning parser (#14511)
- Support --disable-uvicorn-access-log parameters (#14754)
- Support Tele-FLM Model (#15023)
- Add pipeline parallel support to
TransformersModel
(#12832) - Enable CUDA graph support for llama 3.2 vision (#14917)
What's Changed
- [FEAT]Support reset prefix cache by specified device by @maobaolong in #15003
- [BugFix][V1] Update stats.py by @WrRan in #15139
- [V1][TPU] Change kv cache shape. by @vanbasten23 in #15145
- [FrontEnd][Perf]
merge_async_iterators
fast-path for single-prompt requests by @njhill in #15150 - [Docs] Annouce Ollama and Singapore Meetups by @simon-mo in #15161
- [V1] TPU - Tensor parallel MP support by @alexm-redhat in #15059
- [BugFix] Lazily import XgrammarBackend to avoid early cuda init by @njhill in #15171
- [Doc] Clarify run vllm only on one node in distributed inference by @ruisearch42 in #15148
- Fix broken tests by @jovsa in #14713
- [Bugfix] Fix embedding assignment for InternVL-based models by @DarkLight1337 in #15086
- fix "Total generated tokens:" is 0 if using --backend tgi and --endpo… by @sywangyi in #14673
- [V1][TPU] Support V1 Sampler for ragged attention by @NickLucche in #14227
- [Benchmark] Allow oversample request in benchmark dataset by @JenZhao in #15170
- [Core][V0] Add guidance backend for structured output by @russellb in #14589
- [Doc] Update Mistral Small 3.1/Pixtral example by @ywang96 in #15184
- [Misc] support --disable-uvicorn-access-log parameters by @chaunceyjiang in #14754
- [Attention] Flash Attention 3 - fp8 by @mickaelseznec in #14570
- [Doc] Update README.md by @DarkLight1337 in #15187
- Enable CUDA graph support for llama 3.2 vision by @mritterfigma in #14917
- typo: Update config.py by @WrRan in #15189
- [Frontend][Bugfix] support prefill decode disaggregation on deepseek by @billishyahao in #14824
- [release] Tag vllm-cpu with latest upon new version released by @khluu in #15193
- Fixing Imprecise Type Annotations by @WrRan in #15192
- [macOS] Ugrade pytorch to 2.6.0 by @linktohack in #15129
- [Bugfix] Multi-video inference on LLaVA-Onevision by @DarkLight1337 in #15082
- Add user forum to README by @hmellor in #15220
- Fix env vars for running Ray distributed backend on GKE by @richardsliu in #15166
- Replace
misc
issues with link to forum by @hmellor in #15226 - [ci] feat: make the test_torchrun_example run with tp=2, external_dp=2 by @vermouth1992 in #15172
- [Bugfix] fix V1 Engine crash while handling requests with duplicate request id by @JasonJ2021 in #15043
- [V1] Add flag to disable cascade attention by @WoosukKwon in #15243
- Enforce that TP > 1 is not supported for Mamba2 if Quantization is Enabled. by @fabianlim in #14617
- [V1] Scheduler Refactoring [1/N] - Add Scheduler Interface by @WoosukKwon in #15250
- [CI/Build] LoRA : make add_lora_test safer by @varun-sundar-rabindranath in #15181
- Fix CUDA kernel index data type in vllm/csrc/quantization/fused_kernels/layernorm_utils.cuh +10 by @houseroad in #15159
- [Misc] Clean up the BitsAndBytes arguments by @jeejeelee in #15140
- [ROCM] Upgrade torch to 2.6 by @SageMoore in #15244
- [Bugfix] Fix incorrect qwen2.5-vl attention mask pre-computation by @Isotr0py in #15200
- Mention
extra_body
as a way top pass vLLM only parameters using the OpenAI client by @hmellor in #15240 - [V1][TPU] Speed up top-k on TPU by using torch.topk by @hyeygit in #15242
- [Bugfix] detect alibi and revert to FA2 by @tjohnson31415 in #15231
- [Model] RE: Mamba2 Prefill Performance Tweaks: Fixing Flurry of Unnecessary Memory Copies by @cyang49 in #14857
- [Docs] Trim the latest news in README by @WoosukKwon in #15261
- [Misc] Better RayExecutor and multiprocessing compatibility by @comaniac in #14705
- Add an example for reproducibility by @WoosukKwon in #15262
- [Hardware][TPU] Add check for no additional graph compilation during runtime by @lsy323 in #14710
- [V1] Enable Triton(ROCm) Attention backend for Nvidia GPUs by @Isotr0py in #14071
- [Doc] Update LWS docs by @Edwinhr716 in #15163
- [V1] Avoid redundant input processing in n>1 case by @njhill in #14985
- [Feature] specify model in config.yaml by @wayzeng in #14855
- [Bugfix] Add int8 torch dtype for KVCache by @shen-shanshan in #15260
- [Misc] Add attention mask pre-computation optimization back to Qwen2.5-VL by @Isotr0py in #15273
- [Bugfix] Fix incorrect resolving order for transformers fallback by @Isotr0py in #15279
- [V1] Fix wrong import path of get_flash_attn_version by @lhtin in #15280
- [Bugfix] Fix broken kernel test due to missing rename for v1 Triton backend by @Isotr0py in #15282
- [Misc] Add cProfile helpers by @russellb in #15074
- [v1] Refactor KVCacheConfig by @heheda12345 in #14079
- [Bugfix][VLM] fix llava processor by @MengqingCao in #15285
- Revert "[Feature] specify model in config.yaml (#14855)" by @DarkLight1337 in #15293
- [TPU][V1] MHA Pallas backend by @NickLucche in #15288
- [Build/CI] Fix env var typo by @russellb in #15305
- [Misc] Increase RayDistributedExecutor RAY_CGRAPH_get_timeout by @ruisearch42 in #15301
- [Bugfix][V0] Multi-sequence logprobs streaming edge case by @andylolu2 in #15259
- [FEAT] [ROCm]: Add AITER RMS Norm (Layer Norm) Feature by @tjtanaa in #14959
- [Doc] add load_format items in docs by @wwl2755 in #14804
- [Bugfix] Fix torch.compile raise FileNotFoundError by @jeejeelee in #15278
- [Bugfix] LoRA V0 - Fix case where
max_num_seqs
is between cudagraph capture sizes by @varun-sundar-rabindranath in #15308 - [Model] Support Tele-FLM Model by @atone in #15023
- [V1] Add
disable-any-whitespace
option support for xgrammar by @russellb in #15316 - [BugFix][Typing] Fix Imprecise Type Annotations by @WrRan in #15...
v0.8.1
This release contains important bug fixes for v0.8.0. We highly recommend upgrading!
-
V1 Fixes
-
TPU
-
Model
What's Changed
- [Bugfix] Fix interface for Olmo2 on V1 by @ywang96 in #14976
- [CI/Build] Use
AutoModelForImageTextToText
to load image models in tests by @DarkLight1337 in #14945 - [V1] Guard Against Main Thread Usage by @robertgshaw2-redhat in #14972
- [V1] TPU - Fix CI/CD runner for V1 and remove V0 tests by @alexm-redhat in #14974
- [Bugfix] Fix bnb quantization for models with both HF-format and Mistral-format weights by @tristanleclercq in #14950
- [Neuron] trim attention kernel tests to fit trn1.2x instance by @liangfu in #14988
- [Doc][V1] Fix V1 APC doc by @shen-shanshan in #14920
- [Kernels] LoRA - Retire SGMV and BGMV Kernels by @varun-sundar-rabindranath in #14685
- [Mistral-Small 3.1] Update docs and tests by @patrickvonplaten in #14977
- [Misc] Embedding model support LoRA by @jeejeelee in #14935
- [Bugfix] torchrun compatibility by @hiyouga in #14899
- [Bugfix][Frontend] Fix validation of
logprobs
inChatCompletionRequest
by @schoennenbeck in #14352 - [Misc][Docs] fix the comments of KV_T and CACHE_T in CALL_RESHAPE_AND_CACHE_XX macros by @yangsijia-serena in #14347
- [Bugfix] Loosen type check to avoid errors in V1 by @DarkLight1337 in #15021
- [Bugfix] Register serializers for V0 MQ Engine by @simon-mo in #15009
- [TPU][V1][Bugfix] Fix chunked prefill with padding by @NickLucche in #15037
- MI325 configs, fused_moe_kernel bugfix by @ekuznetsov139 in #14987
- [MODEL] Add support for Zamba2 models by @yury-tokpanov in #13185
- [Bugfix] Fix broken CPU quantization due to triton import by @Isotr0py in #15038
- [Bugfix] Fix LoRA extra vocab size by @jeejeelee in #15047
- [V1] Refactor Structured Output for multiple backends by @russellb in #14694
- [V1][Spec Decode] Optimize Rejection Sampler with Triton Kernels by @WoosukKwon in #14930
- [V1] TPU - CI/CD use smaller model by @alexm-redhat in #15054
- fix long dtype in topk sampling by @chujiezheng in #15049
- [Doc] Minor v1_user_guide update by @JenZhao in #15064
- [Misc][V1] Skip device checking if not available by @comaniac in #15061
- [Model] Pixtral: Remove layer instantiation duplication by @juliendenize in #15053
- [Model] Remove duplicated message check in Mistral chat completion request by @b8zhong in #15069
- [Core] Update dtype detection and defaults by @DarkLight1337 in #14858
- [V1] Ensure using int64 for sampled token ids by @WoosukKwon in #15065
- [Bugfix] Re-enable Gemma3 for V1 by @DarkLight1337 in #14980
- [CI][Intel GPU] update XPU dockerfile and CI script by @jikunshang in #15109
- [V1][Bugfix] Fix oracle for device checking by @ywang96 in #15104
- [Misc] Avoid unnecessary HF
do_rescale
warning when passing dummy data by @DarkLight1337 in #15107 - [Bugfix] Fix size calculation of processing cache by @DarkLight1337 in #15114
- [Doc] Update tip info on using latest transformers when creating a custom Dockerfile by @MarcCote in #15070
- [Misc][Benchmark] Add support for different
tokenizer_mode
by @aarnphm in #15040 - [Bugfix] Adjust mllama to regional compilation by @jkaniecki in #15112
- [Doc] Update the "the first vLLM China Meetup" slides link to point to the first page by @imkero in #15134
- [Frontend] Remove custom_cache_manager by @fulvius31 in #13791
- [V1] Minor V1 async engine test refactor by @andoorve in #15075
New Contributors
- @tristanleclercq made their first contribution in #14950
- @hiyouga made their first contribution in #14899
- @ekuznetsov139 made their first contribution in #14987
- @yury-tokpanov made their first contribution in #13185
- @juliendenize made their first contribution in #15053
- @MarcCote made their first contribution in #15070
- @jkaniecki made their first contribution in #15112
- @fulvius31 made their first contribution in #13791
Full Changelog: v0.8.0...v0.8.1
v0.8.0
v0.8.0 featured 523 commits from 166 total contributors (68 new contributors)!
Highlights
V1
We have now enabled V1 engine by default (#13726) for supported use cases. Please refer to V1 user guide for more detail. We expect better performance for supported scenarios. If you'd like to disable V1 mode, please specify the environment variable VLLM_USE_V1=0
, and send us a GitHub issue sharing the reason!
- Support variety of sampling parameters (#13376, #10980, #13210, #13774)
- Compatability prompt logprobs + prefix caching (#13949), sliding window + prefix caching (#13069)
- Stability fixes (#14380, #14379, #13298)
- Pluggable scheduler (#14466)
SupportsV0Only
protocol for model definitions (#13959)- Metrics enhancements (#13299, #13504, #14695, #14082)
- V1 user guide (#13991) and design doc (#12745)
- Support for Structured Outputs (#12388, #14590, #14625, #14630, #14851)
- Support for LoRA (#13705, #13096, #14626)
- Enhance Pipeline Parallelism (#14585, #14643)
- Ngram speculative decoding (#13729, #13933)
DeepSeek Improvements
We observe state of the art performance with vLLM running DeepSeek model on latest version of vLLM:
- MLA Enhancements:
- Distributed Expert Parallelism (EP) and Data Parallelism (DP)
- MTP: Expand DeepSeek MTP code to support k > n_predict (#13626)
- Pipeline Parallelism:
- GEMM
New Models
- Gemma 3 (#14660)
- Note: You have to install transformers from main branch (
pip install git+https://github.com/huggingface/transformers.git
) to use this model. Also, there may be numerical instabilities forfloat16
/half
dtype. Please usebfloat16
(preferred by HF) orfloat32
dtype.
- Note: You have to install transformers from main branch (
- Mistral Small 3.1 (#14957)
- Phi-4-multimodal-instruct (#14119)
- Grok1 (#13795)
- QwQ-32B and toll calling (#14479, #14478)
- Zamba2 (#13185)
NVIDIA Blackwell
- Support nvfp4 cutlass gemm (#13571)
- Add cutlass support for blackwell fp8 gemm (#13798)
- Update the flash attn tag to support Blackwell (#14244)
- Add ModelOpt FP4 Checkpoint Support (#12520)
Breaking Changes
- The default value of
seed
is nowNone
to align with PyTorch and Hugging Face. Please explicitly set seed for reproduciblity. (#14274) - The
kv_cache
andattn_metadata
arguments for model's forward method has been removed; as the attention backend has access to these value viaforward_context
. (#13887) - vLLM will now default
generation_config
from model for chat template, sampling parameters such as temperature, etc. (#12622) - Several request time metrics (
vllm:time_in_queue_requests
,vllm:model_forward_time_milliseconds
,vllm:model_execute_time_milliseconds
) has been deprecated and subject to removal (#14135)
Updates
- Update to PyTorch 2.6.0 (#12721, #13860)
- Update to Python 3.9 typing (#14492, #13971)
- Update to CUDA 12.4 as default for release and nightly wheels (#12098)
- Update to Ray 2.43 (#13994)
- Upgrade aiohttp to incldue CVE fix (#14840)
- Upgrade jinja2 to get 3 moderate CVE fixes (#14839)
Features
Frontend API
- API Server
- Support
return_tokens_as_token_id
as a request param (#14066) - Support Image Emedding as input (#13955)
- New /load endpoint for load statistics (#13950)
- New API endpoint
/is_sleeping
(#14312) - Enables /score endpoint for embedding models (#12846)
- Enable streaming for Transcription API (#13301)
- Make model param optional in request (#13568)
- Support SSL Key Rotation in HTTP Server (#13495)
- Support
- Reasoning
- CLI
- Make LLM API compatible for torchrun launcher (#13642)
Disaggregated Serving
- Support KV cache offloading and disagg prefill with LMCache connector (#12953)
- Support chunked prefill for LMCache connector (#14505)
LoRA
- Add LoRA support for TransformersModel (#13770)
- Make the deviceprofilerinclude LoRA memory. (#14469)
- Gemma3ForConditionalGeneration supports LoRA (#14797)
- Retire SGMV and BGMV Kernels (#14685) (#14685)
VLM
- Generalized prompt updates for multi-modal processor (#13964)
- Deprecate legacy input mapper for OOT multimodal models (#13979)
- Refer code examples for common cases in dev multimodal processor (#14278)
Quantization
- BaiChuan SupportsQuant (#13710)
- BartModel SupportsQuant (#14699)
- Bamba SupportsQuant (#14698)
- Deepseek GGUF support (#13167)
- GGUF MoE kernel (#14613)
- Add GPTQAllSpark Quantization (#12931)
- Better performance of gptq marlin kernel when n is small (#14138)
Structured Output
- xgrammar: Expand list of unsupported jsonschema keywords (#13783)
Hardware Support
AMD
- Faster Custom Paged Attention kernels (#12348)
- Improved performance for V1 Triton (ROCm) backend (#14152)
- Chunked prefill/paged attention in MLA on ROCm (#14316)
- Perf improvement for DSv3 on AMD GPUs (#13718)
- MoE fp8 block quant tuning support (#14068)
TPU
- Integrate the new ragged paged attention kernel with vLLM v1 on TPU (#13379)
- Support start_profile/stop_profile in TPU worker (#13988)
- Add TPU v1 test (#14834)
- TPU multimodal model support for ragged attention (#14158)
- Add tensor parallel support via Ray (#13618)
- Enable prefix caching by default (#14773)
Neuron
- Add Neuron device communicator for vLLM v1 (#14085)
- Add custom_ops for neuron backend (#13246)
- Add reshape_and_cache (#14391)
- Vectorize KV cache load in FlashPagedAttention to maximize DMA bandwidth (#13245)
CPU
s390x
- Adding cpu inference with VXE ISA for s390x architecture (#12613)
- Add documentation for s390x cpu implementation (#14198)
Plugins
Bugfix and Enhancements
- Illegal memory access for MoE On H20 (#13693)
- Fix FP16 overflow for DeepSeek V2 (#13232)
- Illegal Memory Access in the blockwise cutlass fp8 GEMMs (#14396)
- Pass all driver env vars to ray workers unless excluded (#14099)
- Use xgrammar shared context to avoid copy overhead for offline engine (#13837)
- Capture and log the time of loading weights (#13666)
Developer Tooling
Benchmarks
CI and Build
Documentation
- Add RLHF document (#14482)
- Add nsight guide to profiling docs (#14298)
- Add K8s deployment guide (#14084)
- Add developer documentation for
torch.compile
integration (#14437)
What's Changed
- Update
pre-commit
'sisort
version to remove warnings by @hmellor in #13614 - [V1][Minor] Print KV cache size in token counts by @WoosukKwon in #13596
- fix neuron performance issue by @ajayvohra2005 in #13589
- [Frontend] Add backend-specific options for guided decoding by @joerunde in #13505
- [Bugfix] Fix max_num_batched_tokens for MLA by @mgoin in #13620
- [Neuron][Kernel] Vectorize KV cache load in FlashPagedAttention to maximize DMA bandwidth by @lingfanyu in #13245
- Add llmaz as another integration by @kerthcet in #13643
- [Misc] Adding script to setup ray for multi-node vllm deployments by @Edwinhr716 in #12913
- [NVIDIA] Fix an issue to use current stream for the nvfp4 quant by @kaixih in #13632
- Use pre-commit to update
requirements-test.txt
by @hmellor in #13617 - [Bugfix] Add
mm_processor_kwargs
to chat-related protocols by @ywang96 in #13644 - [V1][Sampler] Avoid an operation during temperature application by @njhill in #13587
- Missing comment explaining VDR variable in GGUF kernels by @SzymonOzog in #13290
- [FEATURE] Enables /score endpoint for embedding models by @gmarinho2 in #12846
- [ci] Fix metrics test model path by @khluu in #13635
- [Kernel]Add streamK for block-quantized CUTLASS kernels by @Hongbosherlock in #12978
- [Bugfix][CPU] Fix cpu all-reduce using native pytorch implementation by @Isotr0py in #13586
- fix typo of grafana dashboard, with correct datasource by @johnzheng1975 in https://...
v0.8.0rc2
What's Changed
- [V1] Remove input cache client by @DarkLight1337 in #14864
- [Misc][XPU] Use None as device capacity for XPU by @yma11 in #14932
- [Doc] Add vLLM Beijing meetup slide by @heheda12345 in #14938
- setup.py: drop assumption about local
main
branch by @russellb in #14692 - [MISC] More AMD unused var clean up by @houseroad in #14926
- fix minor miscalled method by @kushanam in #14327
- [V1][TPU] Apply the ragged paged attention kernel fix and remove the padding. by @vanbasten23 in #14846
- [Bugfix] Fix Ultravox on V1 by @DarkLight1337 in #14929
- [Misc] Add
--seed
option to offline multi-modal examples by @DarkLight1337 in #14934 - [Bugfix][ROCm] running new process using spawn method for rocm in tests. by @vllmellm in #14810
- [Doc] Fix misleading log during multi-modal profiling by @DarkLight1337 in #14955
- Add patch merger by @patrickvonplaten in #14957
- [V1] Default MLA to V1 by @simon-mo in #14921
- [Bugfix] Fix precommit - line too long in pixtral.py by @tlrmchlsmth in #14960
- [Bugfix][Model] Mixtral: use unused head_dim config argument by @qtrrb in #14961
- [Fix][Structured Output] using vocab_size to construct matcher by @aarnphm in #14868
- [Bugfix] Make Gemma3 MM V0 only for now by @ywang96 in #14971
New Contributors
Full Changelog: v0.8.0rc1...v0.8.0rc2
v0.8.0rc1
Note: vLLM no longer sets the global seed (#14274). Please set the seed
parameter if you need to reproduce your results.
What's Changed
- Update
pre-commit
'sisort
version to remove warnings by @hmellor in #13614 - [V1][Minor] Print KV cache size in token counts by @WoosukKwon in #13596
- fix neuron performance issue by @ajayvohra2005 in #13589
- [Frontend] Add backend-specific options for guided decoding by @joerunde in #13505
- [Bugfix] Fix max_num_batched_tokens for MLA by @mgoin in #13620
- [Neuron][Kernel] Vectorize KV cache load in FlashPagedAttention to maximize DMA bandwidth by @lingfanyu in #13245
- Add llmaz as another integration by @kerthcet in #13643
- [Misc] Adding script to setup ray for multi-node vllm deployments by @Edwinhr716 in #12913
- [NVIDIA] Fix an issue to use current stream for the nvfp4 quant by @kaixih in #13632
- Use pre-commit to update
requirements-test.txt
by @hmellor in #13617 - [Bugfix] Add
mm_processor_kwargs
to chat-related protocols by @ywang96 in #13644 - [V1][Sampler] Avoid an operation during temperature application by @njhill in #13587
- Missing comment explaining VDR variable in GGUF kernels by @SzymonOzog in #13290
- [FEATURE] Enables /score endpoint for embedding models by @gmarinho2 in #12846
- [ci] Fix metrics test model path by @khluu in #13635
- [Kernel]Add streamK for block-quantized CUTLASS kernels by @Hongbosherlock in #12978
- [Bugfix][CPU] Fix cpu all-reduce using native pytorch implementation by @Isotr0py in #13586
- fix typo of grafana dashboard, with correct datasource by @johnzheng1975 in #13668
- [Attention] MLA with chunked prefill by @LucasWilkinson in #12639
- [Misc] Fix yapf linting tools etc not running on pre-commit by @Isotr0py in #13695
- docs: Add a note on full CI run in contributing guide by @terrytangyuan in #13646
- [HTTP Server] Make model param optional in request by @youngkent in #13568
- [Bugfix][API Server] Fix invalid usage of 'ge' and 'le' in port valid… by @WangErXiao in #13672
- [Misc] Capture and log the time of loading weights by @waltforme in #13666
- [ROCM] fix native attention function call by @gongdao123 in #13650
- [Bugfix][Model] OLMo 2: split qkv correctly for GQA and MQA by @2015aroras in #13687
- [Misc] Bump compressed-tensors by @dsikka in #13619
- [Bugfix] Fix benchmark script bug: inaccurate stats for vllm backend when max_model_len < input_len + output_len by @WangErXiao in #13691
- [v1] Support allowed_token_ids in v1 Sampler by @houseroad in #13210
- [Bugfix] V1 Memory Profiling: V0 Sampler Integration without Rejection Sampler by @JenZhao in #13594
- Correction to TP logic for Mamba Mixer 2 when Num Groups not divisible by TP Size by @fabianlim in #13660
- [V1][Metrics] Support
vllm:cache_config_info
by @markmc in #13299 - [Metrics] Add
--show-hidden-metrics-for-version
CLI arg by @markmc in #13295 - [Misc] Reduce LoRA-related static variable by @jeejeelee in #13166
- [CI/Build] Fix pre-commit errors by @DarkLight1337 in #13696
- [core] set up data parallel communication by @youkaichao in #13591
- [ci] fix linter by @youkaichao in #13701
- Support SSL Key Rotation in HTTP Server by @youngkent in #13495
- [NVIDIA] Support nvfp4 cutlass gemm by @kaixih in #13571
- [V1][Kernel] Refactor the prefix_prefill kernel so that the caller no longer has to pass in the context lengths by @SageMoore in #13095
- [ROCm] Apply FP8 weights padding to values not divisible by 512 bytes on ROCm by @gshtras in #13231
- [Doc] Dockerfile instructions for optional dependencies and dev transformers by @DarkLight1337 in #13699
- [Bugfix] Fix boolean conversion for OpenVINO env variable by @helena-intel in #13615
- [XPU]fix setuptools version for xpu by @yma11 in #13548
- [CI/Build] fix uv caching in Dockerfile by @dtrifiro in #13611
- [CI/Build] Fix pre-commit errors from #13571 by @ywang96 in #13709
- [BugFix] Minor: logger import in attention backend by @andylolu2 in #13706
- [ci] Use env var to control whether to use S3 bucket in CI by @khluu in #13634
- [Quant] BaiChuan SupportsQuant by @kylesayrs in #13710
- [LMM] Implement merged multimodal processor for whisper by @Isotr0py in #13278
- [Core][Distributed] Use IPC (domain socket) ZMQ socket for local comms by @njhill in #13688
- [Misc] Deprecate
--dataset
frombenchmark_serving.py
by @ywang96 in #13708 - [v1] torchrun compatibility by @youkaichao in #13642
- [V1][BugFix] Fix engine core client shutdown hangs by @njhill in #13298
- Fix some issues with benchmark data output by @huydhn in #13641
- [ci] Add logic to change model to S3 path only when S3 CI env var is on by @khluu in #13727
- [V1][Core] Fix memory issue with logits & sampling by @ywang96 in #13721
- [model][refactor] remove cuda hard code in models and layers by @MengqingCao in #13658
- [Bugfix] fix(logging): add missing opening square bracket by @bufferoverflow in #13011
- [CI/Build] add python-json-logger to requirements-common by @bufferoverflow in #12842
- Expert Parallelism (EP) Support for DeepSeek Models by @cakeng in #12583
- [BugFix] Illegal memory access for MoE On H20 by @Abatom in #13693
- [Misc][Docs] Raise error when flashinfer is not installed and
VLLM_ATTENTION_BACKEND
is set by @NickLucche in #12513 - [V1] V1 engine implements parallel sampling (AsyncLLM and LLMEngine) by @afeldman-nm in #10980
- Revert "[V1][Core] Fix memory issue with logits & sampling" by @ywang96 in #13775
- Fix precommit fail in fused_moe intermediate_cache2 chunking by @mgoin in #13772
- [Misc] Clean Up
EngineArgs.create_engine_config
by @robertgshaw2-redhat in #13734 - [Misc][Chore] Clean Up
AsyncOutputProcessing
Logs by @robertgshaw2-redhat in #13780 - Remove unused kwargs from model definitions by @hmellor in #13555
- [Doc] arg_utils.py: fixed a typo by @eli-b in #13785
- [Misc] set single whitespace between log sentences by @cjackal in #13771
- [Bugfix][Quantization] Fix FP8 + EP by @tlrmchlsmth in #13784
- [Misc][Attention][Quantization] init property earlier by @wangxiyuan in #13733
- [V1][Metrics] Implement vllm:lora_requests_info metric by @markmc in #13504
- [Bugfix] Fix deepseek-v2 error: "missing 1 required positional argument: 'residual'" by @LucasWilkinson in #13802
- [Bugfix] Support MLA for CompressedTensorsWNA16 by @mgoin in #13725
- Fix CompressedTensorsWNA16MoE with grouped scales by @mgoin in #13769
- [Core] LoRA V1 - Add add/pin/list/remove_lora functions by @varun-sundar-rabindranath in #13705
- [Misc] Check that the model can be inspected upon registration by @DarkLight1337 in #13743
- [Core] xgrammar: Expand list of unsupported jsonschema keywords by @russellb in #13783
- [Bugf...
v0.7.3
Highlights
🎉 253 commits from 93 contributors, including 29 new contributors!
- Deepseek enhancements:
- Support for DeepSeek Multi-Token Prediction, 1.69x speedup in low QPS scenarios (#12755)
- AMD support: DeepSeek tunings, yielding 17% latency reduction (#13199)
- Using FlashAttention3 for MLA (#12807)
- Align the expert selection code path with official implementation (#13474)
- Optimize moe_align_block_size for deepseek_v3 (#12850)
- Expand MLA to support most types of quantization (#13181)
- V1 Engine:
- LoRA Support (#10957, #12883)
- Logprobs and prompt logprobs support (#9880), min_p sampling support (#13191), logit_bias in v1 Sampler (#13079)
- Use msgpack for core request serialization (#12918)
- Pipeline parallelism support (#12996, #13353, #13472, #13417, #13315)
- Metrics enhancements: GPU prefix cache hit rate % gauge (#12592), iteration_tokens_total histogram (#13288), several request timing histograms (#12644)
- Initial speculative decoding support with ngrams (#12193, #13365)
Model Support
- Enhancement to Qwen2.5-VL: BNB support (#12944), LoRA (#13261), Optimizations (#13155)
- Support GPTQModel Dynamic [2,3,4,8]bit GPTQ quantization (#7086)
- Support Unsloth Dynamic 4bit BnB quantization (#12974)
- IBM/NASA Prithvi Geospatial model (#12830)
- Support Mamba2 (Codestral Mamba) (#9292), Bamba Model (#10909)
- Ultravox Model: Support v0.5 Release (#12912)
transformers
backend- VLM:
Hardware Support
- Pluggable platform-specific scheduler (#13161)
- NVIDIA: Support nvfp4 quantization (#12784)
- AMD:
- TPU: V1 Support (#13049)
- Neuron: Support Longer Sequences in NKI-based Flash PagedAttention and Improve Efficiency (#12921)
- Gaudi:
Engine Feature
- Add sleep and wake up endpoint and v1 support (#12987)
- Add
/v1/audio/transcriptions
OpenAI API endpoint (#12909)
Performance
Others
- Make vLLM compatible with veRL (#12824)
- Fixes for cases of FA2 illegal memory access error (#12848)
- choice-based structured output with xgrammar (#12632)
- Run v1 benchmark and integrate with PyTorch OSS benchmark database (#13068)
What's Changed
- [Misc] Update w2 scale loading for GPTQMarlinMoE by @dsikka in #12757
- [Docs] Add Google Cloud Slides by @simon-mo in #12814
- [Attention] Use FA3 for MLA on Hopper by @LucasWilkinson in #12807
- [misc] Reduce number of config file requests to HuggingFace by @khluu in #12797
- [Misc] Remove unnecessary decode call by @DarkLight1337 in #12833
- [Kernel] Make rotary_embedding ops more flexible with input shape by @Isotr0py in #12777
- [torch.compile] PyTorch 2.6 and nightly compatibility by @youkaichao in #12393
- [Doc] double quote cmake package in build.inc.md by @jitseklomp in #12840
- [Bugfix] Fix unsupported FA version check for Turing GPU by @Isotr0py in #12828
- [V1] LoRA Support by @varun-sundar-rabindranath in #10957
- Add Bamba Model by @fabianlim in #10909
- [MISC] Check space in the file names in the pre commit checks by @houseroad in #12804
- [misc] Revert # 12833 by @khluu in #12857
- [Bugfix] FA2 illegal memory access by @LucasWilkinson in #12848
- Make vllm compatible with verl by @ZSL98 in #12824
- [Bugfix] Missing quant_config in deepseek embedding layer by @SzymonOzog in #12836
- Prevent unecessary requests to huggingface hub by @maxdebayser in #12837
- [MISC][EASY] Break check file names into entry and args in the pre-commit hooks by @houseroad in #12880
- [Misc] Remove unnecessary detokenization in multimodal processing by @DarkLight1337 in #12868
- [Model] Add support for partial rotary embeddings in Phi3 model by @garg-amit in #12718
- [V1] Logprobs and prompt logprobs support by @afeldman-nm in #9880
- [ROCm] [Feature] [Doc] [Dockerfile] [BugFix] Support Per-Token-Activation Per-Channel-Weight FP8 Quantization Inferencing by @tjtanaa in #12501
- [V1] LM Eval With Streaming Integration Tests by @robertgshaw2-redhat in #11590
- [Bugfix] Fix disagg hang caused by the prefill and decode communication issues by @houseroad in #12723
- [V1][Minor] Remove outdated comment by @WoosukKwon in #12928
- [V1] Move KV block hashes from Request to KVCacheManager by @WoosukKwon in #12922
- [Bugfix] Fix Qwen2_5_VLForConditionalGeneration packed_modules_mapping by @jeejeelee in #12905
- [Misc] Fix typo in the example file by @DK-DARKmatter in #12896
- [Bugfix] Fix multi-round chat error when mistral tokenizer is used by @zifeitong in #12859
- [bugfix] respect distributed_executor_backend in world_size=1 by @youkaichao in #12934
- [Misc] Add offline test for disaggregated prefill by @Shaoting-Feng in #12418
- [V1][Minor] Move cascade attn logic outside _prepare_inputs by @WoosukKwon in #12943
- [Build] Make pypi install work on CPU platform by @wangxiyuan in #12874
- [Hardware][Intel-Gaudi] Enable long-contexts + LoRA support for Intel Gaudi by @sanjucsudhakaran in #12812
- [misc] Add LoRA to benchmark_serving by @varun-sundar-rabindranath in #12898
- [Misc] Log time consumption on weight downloading by @waltforme in #12926
- [CI] Resolve transformers-neuronx version conflict by @liangfu in #12925
- [Doc] Correct HF repository for TeleChat2 models by @waltforme in #12949
- [Misc] Add qwen2.5-vl BNB support by @Isotr0py in #12944
- [CI/Build] Auto-fix Markdown files by @DarkLight1337 in #12941
- [Bugfix] Remove unused seq_group_metadata_list from ModelInputForGPU by @ShangmingCai in #12935
- [bugfix] fix early import of flash attention by @youkaichao in #12959
- [VLM] Merged multi-modal processor for GLM4V by @jeejeelee in #12449
- [V1][Minor] Remove outdated comment by @WoosukKwon in #12968
- [RFC] [Mistral] FP8 format by @patrickvonplaten in #10130
- [V1] Cache
uses_mrope
in GPUModelRunner by @WoosukKwon in #12969 - [core] port pynvml into vllm codebase by @youkaichao in #12963
- [MISC] Always import version library first in the vllm package by @houseroad in #12979
- [core] improve error handling when wake up from sleep mode by @youkaichao in #12981
- [core][rlhf] add colocate example for RLHF by @youkaichao in #12984
- [V1] Use msgpack for core request serialization by @njhill in #12918
- [Bugfix][Platform] Check whether selected backend is None in get_attn_backend_cls() by @terrytangyuan in #12975
- [core] fix sleep mode and pytorch checkpoint compatibility by @youkaichao in #13001
- [Doc] Add link to tool_choice tracking issue in tool_calling.md by @terrytangyuan in #13003
- [misc] Add retries with exponential backoff for HF file existence check by @khluu in #13008
- [Bugfix] Clean up and fix multi-modal processors by @DarkLight1337 in #13012
- Fix seed parameter behavior in vLLM by @SmartManoj in #13007
- [Model] Ultravox Model: Support v0.5 Release by @farzadab in #12912
- [misc] Fix setup.py condition to avoid AMD from being mistaken with CPU by @khluu in https://github.com/vllm-proje...
v0.7.2
Highlights
- Qwen2.5-VL is now supported in vLLM. Please note that it requires a source installation from Hugging Face
transformers
library at the moment (#12604) - Add
transformers
backend support via--model-impl=transformers
. This allows vLLM to be ran with arbitrary Hugging Face text models (#11330, #12785, #12727). - Performance enhancement to DeepSeek models.
Core Engine
- Use
VLLM_LOGITS_PROCESSOR_THREADS
to speed up structured decoding in high batch size scenarios (#12368)
Security Update
- Improve hash collision avoidance in prefix caching (#12621)
- Add SPDX-License-Identifier headers to python source files (#12628)
Other
- Enable FusedSDPA support for Intel Gaudi (HPU) (#12359)
What's Changed
- Apply torch.compile to fused_moe/grouped_topk by @mgoin in #12637
- doc: fixing minor typo in readme.md by @vicenteherrera in #12643
- [Bugfix] fix moe_wna16 get_quant_method by @jinzhen-lin in #12648
- [Core] Silence unnecessary deprecation warnings by @russellb in #12620
- [V1][Minor] Avoid frequently creating ConstantList by @WoosukKwon in #12653
- [Core][v1] Unify allocating slots in prefill and decode in KV cache manager by @ShawnD200 in #12608
- [Hardware][Intel GPU] add XPU bf16 support by @jikunshang in #12392
- [Misc] Add SPDX-License-Identifier headers to python source files by @russellb in #12628
- [doc][misc] clarify VLLM_HOST_IP for multi-node inference by @youkaichao in #12667
- [Doc] Deprecate Discord by @zhuohan123 in #12668
- [Kernel] port sgl moe_align_block_size kernels by @chenyang78 in #12574
- make sure mistral_common not imported for non-mistral models by @youkaichao in #12669
- Properly check if all fused layers are in the list of targets by @eldarkurtic in #12666
- Fix for attention layers to remain unquantized during moe_wn16 quant by @srikanthsrnvs in #12570
- [cuda] manually import the correct pynvml module by @youkaichao in #12679
- [ci/build] fix gh200 test by @youkaichao in #12681
- [Model]: Add
transformers
backend support by @ArthurZucker in #11330 - [Misc] Fix improper placement of SPDX header in scripts by @russellb in #12694
- [Bugfix][Kernel] Fix per-token/per-channel quantization for Hopper scaled mm by @tlrmchlsmth in #12696
- Squelch MLA warning for Compressed-Tensors Models by @kylesayrs in #12704
- [Model] Add Deepseek V3 fp8_w8a8 configs for B200 by @kushanam in #12707
- [MISC] Remove model input dumping when exception by @comaniac in #12582
- [V1] Revert
uncache_blocks
and support recaching full blocks by @comaniac in #12415 - [Core] Improve hash collision avoidance in prefix caching by @russellb in #12621
- Support Pixtral-Large HF by using llava multimodal_projector_bias config by @mgoin in #12710
- [Doc] Replace ibm-fms with ibm-ai-platform by @tdoublep in #12709
- [Quant] Fix use_mla TypeError and support loading pure-sparsity Compressed Tensors configs by @kylesayrs in #12711
- [AMD][ROCm] Enable DeepSeek model on ROCm by @hongxiayang in #12662
- [Misc] Add BNB quantization for Whisper by @jeejeelee in #12381
- [VLM] Merged multi-modal processor for InternVL-based models by @DarkLight1337 in #12553
- [V1] Remove constraints on partial requests by @WoosukKwon in #12674
- [VLM] Implement merged multimodal processor and V1 support for idefics3 by @Isotr0py in #12660
- [Model] [Bugfix] Fix loading of fine-tuned models based on Phi-3-Small by @mgtk77 in #12689
- Avoid unnecessary multi-modal input data copy when len(batch) == 1 by @imkero in #12722
- [Build] update requirements of no-device for plugin usage by @sducouedic in #12630
- [Bugfix] Fix CI failures for InternVL and Mantis models by @DarkLight1337 in #12728
- [V1][Metrics] Add request_success_total counter, labelled with finish reason by @markmc in #12579
- [Perf] Mem align KV caches for CUDA devices (MLA perf improvement) by @LucasWilkinson in #12676
- [Core] add and implement
VLLM_LOGITS_PROCESSOR_THREADS
by @akeshet in #12368 - [ROCM][AMD][TRITON] Halving warps number for fw_prefill to reduce spilling by @maleksan85 in #12713
- Refactor
Linear
handling inTransformersModel
by @hmellor in #12727 - [VLM] Add MLA with pure RoPE support for deepseek-vl2 models by @Isotr0py in #12729
- [Misc] Bump the compressed-tensors version by @dsikka in #12736
- [Model][Quant] Fix GLM, Fix fused module mappings for quantization by @kylesayrs in #12634
- [Doc] Update PR Reminder with link to Developer Slack by @mgoin in #12748
- [Bugfix] Fix OpenVINO model runner by @hmellor in #12750
- [V1][Misc] Shorten
FinishReason
enum and use constant strings by @njhill in #12760 - [Doc] Remove performance warning for auto_awq.md by @mgoin in #12743
- [Bugfix] Fix 'ModuleNotFoundError: No module named 'intel_extension_for_pytorch'' for --tensor-parallel-size more than 1 by @Akashcodes732 in #12546
- [core][distributed] exact ray placement control by @youkaichao in #12732
- [Kernel] Use self.kv_cache and forward_context.attn_metadata in Attention.forward by @heheda12345 in #12536
- [Hardware][Intel-Gaudi] Enable FusedSDPA support for Intel Gaudi (HPU) by @sanjucsudhakaran in #12359
- Add: Support for Sparse24Bitmask Compressed Models by @rahul-tuli in #12097
- [VLM] Use shared field to pass token ids to model by @DarkLight1337 in #12767
- [Docs] Drop duplicate [source] links by @russellb in #12780
- [VLM] Qwen2.5-VL by @ywang96 in #12604
- [VLM] Update compatibility with transformers 4.49 by @DarkLight1337 in #12781
- Quantization and MoE configs for GH200 machines by @arvindsun in #12717
- [ROCm][Kernel] Using the correct warp_size value by @gshtras in #12789
- [Bugfix] Better FP8 supported defaults by @LucasWilkinson in #12796
- [Misc][Easy] Remove the space from the file name by @houseroad in #12799
- [Model] LoRA Support for Ultravox model by @thedebugger in #11253
- [Bugfix] Fix the test_ultravox.py's license by @houseroad in #12806
- Improve
TransformersModel
UX by @hmellor in #12785 - [Misc] Remove duplicated DeepSeek V2/V3 model definition by @mgoin in #12793
- [Misc] Improve error message for incorrect pynvml by @youkaichao in #12809
New Contributors
- @vicenteherrera made their first contribution in #12643
- @chenyang78 made their first contribution in #12574
- @srikanthsrnvs made their first contribution in #12570
- @ArthurZucker made their first contribution in #11330
- @mgtk77 made their first contribution in #12689
- @sducouedic made their first contribution in #12630
- @akeshet made their first contribution in #12368
- @arvindsun made their first contribution in #12717
- @thedebugger made their first contribution in ht...
v0.7.1
Highlights
This release features MLA optimization for Deepseek family of models. Compared to v0.7.0 released this Monday, we offer ~3x the generation throughput, ~10x the memory capacity for tokens, and horizontal context scalability with pipeline parallelism
V1
For the V1 architecture, we
- Added a new design document for zero overhead prefix caching here (#12598)
- Add metrics and enhance logging for V1 engine (#12569, #12561, #12416, #12516, #12530, #12478)
Models
- New Model: MiniCPM-o (text outputs only) (#12069)
Hardwares
- Neuron: NKI-based flash-attention kernel with paged KV cache (#11277)
- AMD: llama 3.2 support upstreaming (#12421)
Others
- Support override generation config in engine arguments (#12409)
- Support reasoning content in API for deepseek R1 (#12473)
What's Changed
- [Bugfix] Fix missing seq_start_loc in xformers prefill metadata by @Isotr0py in #12464
- [V1][Minor] Minor optimizations for update_from_output by @WoosukKwon in #12454
- [Bugfix] Fix gpt2 GGUF inference by @Isotr0py in #12467
- [Build] Only build 9.0a for scaled_mm and sparse kernels by @LucasWilkinson in #12339
- [V1][Metrics] Add initial Prometheus logger by @markmc in #12416
- [V1][CI/Test] Do basic test for top-p & top-k sampling by @WoosukKwon in #12469
- [FlashInfer] Upgrade to 0.2.0 by @abmfy in #11194
- [Feature] [Spec decode]: Enable MLPSpeculator/Medusa and
prompt_logprobs
with ChunkedPrefill by @NickLucche in #10132 - Update
pre-commit
hooks by @hmellor in #12475 - [Neuron][Kernel] NKI-based flash-attention kernel with paged KV cache by @liangfu in #11277
- Fix bad path in prometheus example by @mgoin in #12481
- [CI/Build] Fixed the xla nightly issue report in #12451 by @hosseinsarshar in #12453
- [FEATURE] Enables offline /score for embedding models by @gmarinho2 in #12021
- [CI] fix pre-commit error by @MengqingCao in #12494
- Update README.md with V1 alpha release by @ywang96 in #12495
- [V1] Include Engine Version in Logs by @robertgshaw2-redhat in #12496
- [Core] Make raw_request optional in ServingCompletion by @schoennenbeck in #12503
- [VLM] Merged multi-modal processor and V1 support for Qwen-VL by @DarkLight1337 in #12504
- [Doc] Fix typo for x86 CPU installation by @waltforme in #12514
- [V1][Metrics] Hook up IterationStats for Prometheus metrics by @markmc in #12478
- Replace missed warning_once for rerank API by @mgoin in #12472
- Do not run
suggestion
pre-commit
hook multiple times by @hmellor in #12521 - [V1][Metrics] Add per-request prompt/generation_tokens histograms by @markmc in #12516
- [Kernel] Pipe attn_logits_soft_cap through paged attention TPU kernels by @fenghuizhang in #12482
- [TPU] Add example for profiling TPU inference by @mgoin in #12531
- [Frontend] Support reasoning content for deepseek r1 by @gaocegege in #12473
- [Doc] Convert docs to use colon fences by @hmellor in #12471
- [V1][Metrics] Add TTFT and TPOT histograms by @markmc in #12530
- Bugfix for whisper quantization due to fake k_proj bias by @mgoin in #12524
- [V1] Improve Error Message for Unsupported Config by @robertgshaw2-redhat in #12535
- Fix the pydantic logging validator by @maxdebayser in #12420
- [Bugfix] handle alignment of arguments in convert_sparse_cross_attention_mask_to_dense by @tjohnson31415 in #12347
- [Model] Refactoring of MiniCPM-V and add MiniCPM-o-2.6 support for vLLM by @HwwwwwwwH in #12069
- [Frontend] Support override generation config in args by @liuyanyi in #12409
- [Hardware][NV] Fix Modelopt model loading for k-v-scales for Llama models. by @pavanimajety in #11787
- [Kernel] add triton fused moe kernel for gptq/awq by @jinzhen-lin in #12185
- Revert "[Build/CI] Fix libcuda.so linkage" by @tlrmchlsmth in #12552
- [V1][BugFix] Free encoder cache for aborted requests by @WoosukKwon in #12545
- [Misc][MoE] add Deepseek-V3 moe tuning support by @divakar-amd in #12558
- [V1][Metrics] Add GPU cache usage % gauge by @markmc in #12561
- Set
?device={device}
when changing tab in installation guides by @hmellor in #12560 - [Misc] fix typo: add missing space in lora adapter error message by @Beim in #12564
- [Kernel] Triton Configs for Fp8 Block Quantization by @robertgshaw2-redhat in #11589
- [CPU][PPC] Updated torch, torchvision, torchaudio dependencies by @npanpaliya in #12555
- [V1][Log] Add max request concurrency log to V1 by @mgoin in #12569
- [Kernel] Update
cutlass_scaled_mm
to support 2d group (blockwise) scaling by @LucasWilkinson in #11868 - [ROCm][AMD][Model] llama 3.2 support upstreaming by @maleksan85 in #12421
- [Attention] MLA decode optimizations by @LucasWilkinson in #12528
- [Bugfix] Gracefully handle huggingface hub http error by @ywang96 in #12571
- Add favicon to docs by @hmellor in #12611
- [BugFix] Fix Torch.Compile For DeepSeek by @robertgshaw2-redhat in #12594
- [Git] Automatically sign-off commits by @comaniac in #12595
- [Docs][V1] Prefix caching design by @comaniac in #12598
- [v1][Bugfix] Add extra_keys to block_hash for prefix caching by @heheda12345 in #12603
- [release] Add input step to ask for Release version by @khluu in #12631
- [Bugfix] Revert MoE Triton Config Default by @robertgshaw2-redhat in #12629
- [Kernel][Quantization] Integrate block-quantized CUTLASS kernels for DeepSeekV3 by @tlrmchlsmth in #12587
- [Feature] Fix guided decoding blocking bitmask memcpy by @xpbowler in #12563
- [Doc] Improve installation signposting by @hmellor in #12575
- [Doc] int4 w4a16 example by @brian-dellabetta in #12585
- [V1] Bugfix: Validate Model Input Length by @robertgshaw2-redhat in #12600
- [BugFix] fix wrong output when using lora and num_scheduler_steps=8 by @sleepwalker2017 in #11161
- Fix target matching for fused layers with compressed-tensors by @eldarkurtic in #12617
- [ci] Upgrade transformers to 4.48.2 in CI dependencies by @khluu in #12599
- [Bugfix/CI] Fixup benchmark_moe.py by @tlrmchlsmth in #12562
- Fix: Respect
sparsity_config.ignore
in Cutlass Integration by @rahul-tuli in #12517 - [Attention] Deepseek v3 MLA support with FP8 compute by @LucasWilkinson in #12601
- [CI/Build] Add label automation for structured-output, speculative-decoding, v1 by @russellb in #12280
- Disable chunked prefill and/or prefix caching when MLA is enabled by @simon-mo in #12642
New Contributors
- @abmfy made their first contribution in #11194
- @hosseinsarshar made their first contribution in #12453
- @gmarinho2 made their first contribution in #12021
- @waltforme made their first contribution in #12514
- @fenghuizhang made their first contribution in #12482
- @gaocegege made their first contribution in #12473
- @Beim made their first contribution in https://github.com/vllm-pro...
v0.7.0
Highlights
- vLLM's V1 engine is ready for testing! This is a rewritten engine designed for performance and architectural simplicity. You can turn it on by setting environment variable
VLLM_USE_V1=1
. See our blog for more details. (44 commits). - New methods (
LLM.sleep
,LLM.wake_up
,LLM.collective_rpc
,LLM.reset_prefix_cache
) in vLLM for the post training frameworks! (#12361, #12084, #12284). torch.compile
is now fully integrated in vLLM, and enabled by default in V1. You can turn it on via-O3
engine parameter. (#11614, #12243, #12043, #12191, #11677, #12182, #12246).
This release features
- 400 commits from 132 contributors, including 57 new contributors.
- 28 CI and build enhancements, including testing for nightly torch (#12270) and inclusion of genai-perf for benchmark (#10704).
- 58 documentation enhancements, including reorganized documentation structure (#11645, #11755, #11766, #11843, #11896).
- more than 161 bug fixes and miscellaneous enhancements
Features
Models
- New generative models: CogAgent (#11742), Deepseek-VL2 (#11578, #12068, #12169), fairseq2 Llama (#11442), InternLM3 (#12037), Whisper (#11280)
- New pooling models: Qwen2 PRM (#12202), InternLM2 reward models (#11571)
- VLM: Merged multi-modal processor is now ready for model developers! (#11620, #11900, #11682, #11717, #11669, #11396)
- Any model that implements merged multi-modal processor and the
get_*_embeddings
methods according to this guide is automatically supported by V1 engine.
- Any model that implements merged multi-modal processor and the
Hardwares
- Apple: Native support for macOS Apple Silicon (#11696)
- AMD: MI300 FP8 format for block_quant (#12134), Tuned MoE configurations for multiple models (#12408, #12049), block size heuristic for avg 2.8x speedup for int8 models (#11698)
- TPU: support for
W8A8
(#11785) - x86: Multi-LoRA (#11100) and MoE Support (#11831)
- Progress in out-of-tree hardware support (#12009, #11981, #11948, #11609, #12264, #11516, #11503, #11369, #11602)
Features
- Distributed:
- API Server: Jina- and Cohere-compatible Rerank API (#12376)
- Kernels:
Others
- Benchmark: new script for CPU offloading (#11533)
- Security: Set
weights_only=True
when usingtorch.load()
(#12366)
What's Changed
- [Docs] Document Deepseek V3 support by @simon-mo in #11535
- Update openai_compatible_server.md by @robertgshaw2-redhat in #11536
- [V1] Use FlashInfer Sampling Kernel for Top-P & Top-K Sampling by @WoosukKwon in #11394
- [V1] Fix yapf by @WoosukKwon in #11538
- [CI] Fix broken CI by @robertgshaw2-redhat in #11543
- [misc] fix typing by @youkaichao in #11540
- [V1][3/N] API Server: Reduce Task Switching + Handle Abort Properly by @robertgshaw2-redhat in #11534
- [BugFix] Deepseekv3 broke quantization for all other methods by @robertgshaw2-redhat in #11547
- [Platform] Move model arch check to platform by @MengqingCao in #11503
- Update deploying_with_k8s.md with AMD ROCm GPU example by @AlexHe99 in #11465
- [Bugfix] Fix TeleChat2ForCausalLM weights mapper by @jeejeelee in #11546
- [Misc] Abstract out the logic for reading and writing media content by @DarkLight1337 in #11527
- [Doc] Add xgrammar in doc by @Chen-0210 in #11549
- [VLM] Support caching in merged multi-modal processor by @DarkLight1337 in #11396
- [MODEL] Update LoRA modules supported by Jamba by @ErezSC42 in #11209
- [Misc]Add BNB quantization for MolmoForCausalLM by @jeejeelee in #11551
- [Misc] Improve BNB loader to handle mixture of sharded and merged weights with same suffix by @Isotr0py in #11566
- [Bugfix] Fix for ROCM compressed tensor support by @selalipop in #11561
- [Doc] Update mllama example based on official doc by @heheda12345 in #11567
- [V1] [4/N] API Server: ZMQ/MP Utilities by @robertgshaw2-redhat in #11541
- [Bugfix] Last token measurement fix by @rajveerb in #11376
- [Model] Support InternLM2 Reward models by @Isotr0py in #11571
- [Model] Remove hardcoded image tokens ids from Pixtral by @ywang96 in #11582
- [Hardware][AMD]: Replace HIPCC version with more precise ROCm version by @hj-wei in #11515
- [V1][Minor] Set pin_memory=False for token_ids_cpu tensor by @WoosukKwon in #11581
- [Doc] Minor documentation fixes by @DarkLight1337 in #11580
- [bugfix] interleaving sliding window for cohere2 model by @youkaichao in #11583
- [V1] [5/N] API Server: unify
Detokenizer
andEngineCore
input by @robertgshaw2-redhat in #11545 - [Doc] Convert list tables to MyST by @DarkLight1337 in #11594
- [v1][bugfix] fix cudagraph with inplace buffer assignment by @youkaichao in #11596
- [Misc] Use registry-based initialization for KV cache transfer connector. by @KuntaiDu in #11481
- Remove print statement in DeepseekScalingRotaryEmbedding by @mgoin in #11604
- [v1] fix compilation cache by @youkaichao in #11598
- [Docker] bump up neuron sdk v2.21 by @liangfu in #11593
- [Build][Kernel] Update CUTLASS to v3.6.0 by @tlrmchlsmth in #11607
- [CI/Build][CPU] Fix CPU CI by lazy importing triton FP8 kernels by @bigPYJ1151 in #11618
- [platforms] enable platform plugins by @youkaichao in #11602
- [VLM] Abstract out multi-modal data parsing in merged processor by @DarkLight1337 in #11620
- [V1] [6/N] API Server: Better Shutdown by @robertgshaw2-redhat in #11586
- [Bugfix] Validate and concatenate image embeddings in MiniCPMVBaseModel by @whyiug in #11631
- [benchmark] Remove dependency for H100 benchmark step by @khluu in #11572
- [Model][LoRA]LoRA support added for MolmoForCausalLM by @ayylemao in #11439
- [Bugfix] Fix OpenAI parallel sampling when using xgrammar by @mgoin in #11637
- [Misc][LoRA] Support Rank Stabilized LoRA (RSLoRA) by @JohnGiorgi in #6909
- [Bugfix] Move the _touch(computed_blocks) call in the allocate_slots method to after the check for allocating new blocks. by @sakunkun in #11565
- [V1] Simpify vision block hash for prefix caching by removing offset from hash by @heheda12345 in #11646
- [V1][VLM] V1 support for selected single-image models. by @ywang96 in #11632
- [Benchmark] Add benchmark script for CPU offloading by @ApostaC in #11533
- [Bugfix][Refactor] Unify model management in frontend by @joerunde in #11660
- [VLM] Add max-count checking in data parser for single image models by @DarkLight1337 in #11661
- [Misc] Optimize Qwen2-VL LoRA test by @jeejeelee in #11663
- [Misc] Replace space with - in the file names by @houseroad in #11667
- [Doc] Fix typo by @serihiro in #11666
- [V1] Implement Cascade Attention by @WoosukKwon in #11635
- [VLM] Move supported limits and max tokens to merged multi-modal processor by @DarkLight1337 in #11669
- [VLM][Bugfix] Multi-modal processor compatible with V1 multi-input by @DarkLight1337 in #11674
- [mypy] Pass type checking in vllm/inputs by @CloseChoice in #11680
- [VLM] Merged multi-modal processor for LLaVA-NeXT by @DarkLight1337 in #11682
- According to vllm.EngineArgs, the name should be distributed_executor_backend by @chunyang-wen in #11689
- [Bugfix] Free cross attention block table for preempted-for-recompute sequence group. by @kathyyu-google in #10013
- [V1]...
v0.6.6.post1
This release restore functionalities for other quantized MoEs, which was introduced as part of initial DeepSeek V3 support 🙇 .
What's Changed
- [Docs] Document Deepseek V3 support by @simon-mo in #11535
- Update openai_compatible_server.md by @robertgshaw2-neuralmagic in #11536
- [V1] Use FlashInfer Sampling Kernel for Top-P & Top-K Sampling by @WoosukKwon in #11394
- [V1] Fix yapf by @WoosukKwon in #11538
- [CI] Fix broken CI by @robertgshaw2-neuralmagic in #11543
- [misc] fix typing by @youkaichao in #11540
- [V1][3/N] API Server: Reduce Task Switching + Handle Abort Properly by @robertgshaw2-neuralmagic in #11534
- [BugFix] Deepseekv3 broke quantization for all other methods by @robertgshaw2-neuralmagic in #11547
Full Changelog: v0.6.6...v0.6.6.post1