Releases · vllm-project/vllm

23 Mar 21:05

github-actions

v0.8.2

25f560a

v0.8.2 Latest

Latest

This release contains important bug fix for the V1 engine's memory usage. We highly recommend you upgrading!

Highlights

Revert "Use uv python for docker rather than ppa:deadsnakess/ppa (#13569)" (#15377)
Remove openvino support in favor of external plugin (#15339)

V1 Engine

Fix V1 Engine crash while handling requests with duplicate request id (#15043)
Support FP8 KV Cache (#14570, #15191)
Add flag to disable cascade attention (#15243)
Scheduler Refactoring: Add Scheduler Interface (#15250)
Structured Output
- Add disable-any-whitespace option support for xgrammar (#15316)
- guidance backend for structured output + auto fallback mode (#14779)
Spec Decode
- Enable spec decode for top-p & top-k sampling (#15063)
- Use better defaults for N-gram (#15358)
- Update target_logits in place for rejection sampling (#15427)
AMD
- Enable Triton(ROCm) Attention backend for Nvidia GPUs (#14071)
TPU
- Support V1 Sampler for ragged attention (#14227)
- Tensor parallel MP support (#15059)
- MHA Pallas backend (#15288)

Features

Integrate fastsafetensors loader for loading model weights (#10647)
Add guidance backend for structured output (#14589)

Others

Add Kubernetes deployment guide with CPUs (#14865)
Support reset prefix cache by specified device (#15003)
Support tool calling and reasoning parser (#14511)
Support --disable-uvicorn-access-log parameters (#14754)
Support Tele-FLM Model (#15023)
Add pipeline parallel support to TransformersModel (#12832)
Enable CUDA graph support for llama 3.2 vision (#14917)

What's Changed

[FEAT]Support reset prefix cache by specified device by @maobaolong in #15003
[BugFix][V1] Update stats.py by @WrRan in #15139
[V1][TPU] Change kv cache shape. by @vanbasten23 in #15145
[FrontEnd][Perf] merge_async_iterators fast-path for single-prompt requests by @njhill in #15150
[Docs] Annouce Ollama and Singapore Meetups by @simon-mo in #15161
[V1] TPU - Tensor parallel MP support by @alexm-redhat in #15059
[BugFix] Lazily import XgrammarBackend to avoid early cuda init by @njhill in #15171
[Doc] Clarify run vllm only on one node in distributed inference by @ruisearch42 in #15148
Fix broken tests by @jovsa in #14713
[Bugfix] Fix embedding assignment for InternVL-based models by @DarkLight1337 in #15086
fix "Total generated tokens:" is 0 if using --backend tgi and --endpo… by @sywangyi in #14673
[V1][TPU] Support V1 Sampler for ragged attention by @NickLucche in #14227
[Benchmark] Allow oversample request in benchmark dataset by @JenZhao in #15170
[Core][V0] Add guidance backend for structured output by @russellb in #14589
[Doc] Update Mistral Small 3.1/Pixtral example by @ywang96 in #15184
[Misc] support --disable-uvicorn-access-log parameters by @chaunceyjiang in #14754
[Attention] Flash Attention 3 - fp8 by @mickaelseznec in #14570
[Doc] Update README.md by @DarkLight1337 in #15187
Enable CUDA graph support for llama 3.2 vision by @mritterfigma in #14917
typo: Update config.py by @WrRan in #15189
[Frontend][Bugfix] support prefill decode disaggregation on deepseek by @billishyahao in #14824
[release] Tag vllm-cpu with latest upon new version released by @khluu in #15193
Fixing Imprecise Type Annotations by @WrRan in #15192
[macOS] Ugrade pytorch to 2.6.0 by @linktohack in #15129
[Bugfix] Multi-video inference on LLaVA-Onevision by @DarkLight1337 in #15082
Add user forum to README by @hmellor in #15220
Fix env vars for running Ray distributed backend on GKE by @richardsliu in #15166
Replace misc issues with link to forum by @hmellor in #15226
[ci] feat: make the test_torchrun_example run with tp=2, external_dp=2 by @vermouth1992 in #15172
[Bugfix] fix V1 Engine crash while handling requests with duplicate request id by @JasonJ2021 in #15043
[V1] Add flag to disable cascade attention by @WoosukKwon in #15243
Enforce that TP > 1 is not supported for Mamba2 if Quantization is Enabled. by @fabianlim in #14617
[V1] Scheduler Refactoring [1/N] - Add Scheduler Interface by @WoosukKwon in #15250
[CI/Build] LoRA : make add_lora_test safer by @varun-sundar-rabindranath in #15181
Fix CUDA kernel index data type in vllm/csrc/quantization/fused_kernels/layernorm_utils.cuh +10 by @houseroad in #15159
[Misc] Clean up the BitsAndBytes arguments by @jeejeelee in #15140
[ROCM] Upgrade torch to 2.6 by @SageMoore in #15244
[Bugfix] Fix incorrect qwen2.5-vl attention mask pre-computation by @Isotr0py in #15200
Mention extra_body as a way top pass vLLM only parameters using the OpenAI client by @hmellor in #15240
[V1][TPU] Speed up top-k on TPU by using torch.topk by @hyeygit in #15242
[Bugfix] detect alibi and revert to FA2 by @tjohnson31415 in #15231
[Model] RE: Mamba2 Prefill Performance Tweaks: Fixing Flurry of Unnecessary Memory Copies by @cyang49 in #14857
[Docs] Trim the latest news in README by @WoosukKwon in #15261
[Misc] Better RayExecutor and multiprocessing compatibility by @comaniac in #14705
Add an example for reproducibility by @WoosukKwon in #15262
[Hardware][TPU] Add check for no additional graph compilation during runtime by @lsy323 in #14710
[V1] Enable Triton(ROCm) Attention backend for Nvidia GPUs by @Isotr0py in #14071
[Doc] Update LWS docs by @Edwinhr716 in #15163
[V1] Avoid redundant input processing in n>1 case by @njhill in #14985
[Feature] specify model in config.yaml by @wayzeng in #14855
[Bugfix] Add int8 torch dtype for KVCache by @shen-shanshan in #15260
[Misc] Add attention mask pre-computation optimization back to Qwen2.5-VL by @Isotr0py in #15273
[Bugfix] Fix incorrect resolving order for transformers fallback by @Isotr0py in #15279
[V1] Fix wrong import path of get_flash_attn_version by @lhtin in #15280
[Bugfix] Fix broken kernel test due to missing rename for v1 Triton backend by @Isotr0py in #15282
[Misc] Add cProfile helpers by @russellb in #15074
[v1] Refactor KVCacheConfig by @heheda12345 in #14079
[Bugfix][VLM] fix llava processor by @MengqingCao in #15285
Revert "[Feature] specify model in config.yaml (#14855)" by @DarkLight1337 in #15293
[TPU][V1] MHA Pallas backend by @NickLucche in #15288
[Build/CI] Fix env var typo by @russellb in #15305
[Misc] Increase RayDistributedExecutor RAY_CGRAPH_get_timeout by @ruisearch42 in #15301
[Bugfix][V0] Multi-sequence logprobs streaming edge case by @andylolu2 in #15259
[FEAT] [ROCm]: Add AITER RMS Norm (Layer Norm) Feature by @tjtanaa in #14959
[Doc] add load_format items in docs by @wwl2755 in #14804
[Bugfix] Fix torch.compile raise FileNotFoundError by @jeejeelee in #15278
[Bugfix] LoRA V0 - Fix case where max_num_seqs is between cudagraph capture sizes by @varun-sundar-rabindranath in #15308
[Model] Support Tele-FLM Model by @atone in #15023
[V1] Add disable-any-whitespace option support for xgrammar by @russellb in #15316
[BugFix][Typing] Fix Imprecise Type Annotations by @WrRan in #15...

Contributors

russellb, Qubitium, and 62 other contributors

Assets 6

19 Mar 17:40

github-actions

v0.8.1

61c7a1b

v0.8.1

This release contains important bug fixes for v0.8.0. We highly recommend upgrading!

V1 Fixes
- Ensure using int64 for sampled token ids (#15065)
- Fix long dtype in topk sampling (#15049)
- Refactor Structured Output for multiple backends (#14694)
- Fix size calculation of processing cache (#15114)
- Optimize Rejection Sampler with Triton Kernels (#14930)
- Fix oracle for device checking (#15104)
TPU
- Fix chunked prefill with padding (#15037)
- Enhanced CI/CD (#15054, 14974)
Model
- Re-enable Gemma3 for V1 (#14980)
- Embedding model support LoRA (#14935)
- Pixtral: Remove layer instantiation duplication (#15053)

What's Changed

[Bugfix] Fix interface for Olmo2 on V1 by @ywang96 in #14976
[CI/Build] Use AutoModelForImageTextToText to load image models in tests by @DarkLight1337 in #14945
[V1] Guard Against Main Thread Usage by @robertgshaw2-redhat in #14972
[V1] TPU - Fix CI/CD runner for V1 and remove V0 tests by @alexm-redhat in #14974
[Bugfix] Fix bnb quantization for models with both HF-format and Mistral-format weights by @tristanleclercq in #14950
[Neuron] trim attention kernel tests to fit trn1.2x instance by @liangfu in #14988
[Doc][V1] Fix V1 APC doc by @shen-shanshan in #14920
[Kernels] LoRA - Retire SGMV and BGMV Kernels by @varun-sundar-rabindranath in #14685
[Mistral-Small 3.1] Update docs and tests by @patrickvonplaten in #14977
[Misc] Embedding model support LoRA by @jeejeelee in #14935
[Bugfix] torchrun compatibility by @hiyouga in #14899
[Bugfix][Frontend] Fix validation of logprobs in ChatCompletionRequest by @schoennenbeck in #14352
[Misc][Docs] fix the comments of KV_T and CACHE_T in CALL_RESHAPE_AND_CACHE_XX macros by @yangsijia-serena in #14347
[Bugfix] Loosen type check to avoid errors in V1 by @DarkLight1337 in #15021
[Bugfix] Register serializers for V0 MQ Engine by @simon-mo in #15009
[TPU][V1][Bugfix] Fix chunked prefill with padding by @NickLucche in #15037
MI325 configs, fused_moe_kernel bugfix by @ekuznetsov139 in #14987
[MODEL] Add support for Zamba2 models by @yury-tokpanov in #13185
[Bugfix] Fix broken CPU quantization due to triton import by @Isotr0py in #15038
[Bugfix] Fix LoRA extra vocab size by @jeejeelee in #15047
[V1] Refactor Structured Output for multiple backends by @russellb in #14694
[V1][Spec Decode] Optimize Rejection Sampler with Triton Kernels by @WoosukKwon in #14930
[V1] TPU - CI/CD use smaller model by @alexm-redhat in #15054
fix long dtype in topk sampling by @chujiezheng in #15049
[Doc] Minor v1_user_guide update by @JenZhao in #15064
[Misc][V1] Skip device checking if not available by @comaniac in #15061
[Model] Pixtral: Remove layer instantiation duplication by @juliendenize in #15053
[Model] Remove duplicated message check in Mistral chat completion request by @b8zhong in #15069
[Core] Update dtype detection and defaults by @DarkLight1337 in #14858
[V1] Ensure using int64 for sampled token ids by @WoosukKwon in #15065
[Bugfix] Re-enable Gemma3 for V1 by @DarkLight1337 in #14980
[CI][Intel GPU] update XPU dockerfile and CI script by @jikunshang in #15109
[V1][Bugfix] Fix oracle for device checking by @ywang96 in #15104
[Misc] Avoid unnecessary HF do_rescale warning when passing dummy data by @DarkLight1337 in #15107
[Bugfix] Fix size calculation of processing cache by @DarkLight1337 in #15114
[Doc] Update tip info on using latest transformers when creating a custom Dockerfile by @MarcCote in #15070
[Misc][Benchmark] Add support for different tokenizer_mode by @aarnphm in #15040
[Bugfix] Adjust mllama to regional compilation by @jkaniecki in #15112
[Doc] Update the "the first vLLM China Meetup" slides link to point to the first page by @imkero in #15134
[Frontend] Remove custom_cache_manager by @fulvius31 in #13791
[V1] Minor V1 async engine test refactor by @andoorve in #15075

New Contributors

@tristanleclercq made their first contribution in #14950
@hiyouga made their first contribution in #14899
@ekuznetsov139 made their first contribution in #14987
@yury-tokpanov made their first contribution in #13185
@juliendenize made their first contribution in #15053
@MarcCote made their first contribution in #15070
@jkaniecki made their first contribution in #15112
@fulvius31 made their first contribution in #13791

Full Changelog: v0.8.0...v0.8.1

Contributors

russellb, MarcCote, and 30 other contributors

Assets 6

18 Mar 17:52

github-actions

v0.8.0

966f933

v0.8.0

v0.8.0 featured 523 commits from 166 total contributors (68 new contributors)!

Highlights

V1

We have now enabled V1 engine by default (#13726) for supported use cases. Please refer to V1 user guide for more detail. We expect better performance for supported scenarios. If you'd like to disable V1 mode, please specify the environment variable VLLM_USE_V1=0, and send us a GitHub issue sharing the reason!

Support variety of sampling parameters (#13376, #10980, #13210, #13774)
Compatability prompt logprobs + prefix caching (#13949), sliding window + prefix caching (#13069)
Stability fixes (#14380, #14379, #13298)
Pluggable scheduler (#14466)
SupportsV0Only protocol for model definitions (#13959)
Metrics enhancements (#13299, #13504, #14695, #14082)
V1 user guide (#13991) and design doc (#12745)
Support for Structured Outputs (#12388, #14590, #14625, #14630, #14851)
Support for LoRA (#13705, #13096, #14626)
Enhance Pipeline Parallelism (#14585, #14643)
Ngram speculative decoding (#13729, #13933)

DeepSeek Improvements

We observe state of the art performance with vLLM running DeepSeek model on latest version of vLLM:

MLA Enhancements:
- FlashMLA integration (#13747, #13867, #14451)
- MLA support for V1 (#13789, #14253, #14384, #14540, #14921)
- MLA with chunked prefill (#12639)
- Holistic memory and performance optimization (#14769, #14770,#14842)
- Support MLA for CompressedTensorsWNA16 (#13725)
Distributed Expert Parallelism (EP) and Data Parallelism (DP)
- EP Support for DeepSeek Models (#12583)
- Add enable_expert_parallel arg (#14305)
- EP/TP MoE + DP Attention (#13931)
- Set up data parallel communication (#13591)
MTP: Expand DeepSeek MTP code to support k > n_predict (#13626)
Pipeline Parallelism:
- DeepSeek V2/V3/R1 only place lm_head on last pp rank (#13833)
- Improve pipeline partitioning (#13839)
GEMM
- Add streamK for block-quantized CUTLASS kernels (#12978)
- Add benchmark for DeepGEMM and vLLM Block FP8 Dense GEMM (#13917)
- Add more tuned configs for H20 and others (#14877)

New Models

Gemma 3 (#14660)
- Note: You have to install transformers from main branch (pip install git+https://github.com/huggingface/transformers.git) to use this model. Also, there may be numerical instabilities for float16/half dtype. Please use bfloat16 (preferred by HF) or float32 dtype.
Mistral Small 3.1 (#14957)
Phi-4-multimodal-instruct (#14119)
Grok1 (#13795)
QwQ-32B and toll calling (#14479, #14478)
Zamba2 (#13185)

NVIDIA Blackwell

Support nvfp4 cutlass gemm (#13571)
Add cutlass support for blackwell fp8 gemm (#13798)
Update the flash attn tag to support Blackwell (#14244)
Add ModelOpt FP4 Checkpoint Support (#12520)

Breaking Changes

The default value of seed is now None to align with PyTorch and Hugging Face. Please explicitly set seed for reproduciblity. (#14274)
The kv_cache and attn_metadata arguments for model's forward method has been removed; as the attention backend has access to these value via forward_context. (#13887)
vLLM will now default generation_config from model for chat template, sampling parameters such as temperature, etc. (#12622)
Several request time metrics (vllm:time_in_queue_requests, vllm:model_forward_time_milliseconds, vllm:model_execute_time_milliseconds) has been deprecated and subject to removal (#14135)

Updates

Update to PyTorch 2.6.0 (#12721, #13860)
Update to Python 3.9 typing (#14492, #13971)
Update to CUDA 12.4 as default for release and nightly wheels (#12098)
Update to Ray 2.43 (#13994)
Upgrade aiohttp to incldue CVE fix (#14840)
Upgrade jinja2 to get 3 moderate CVE fixes (#14839)

Features

Frontend API

API Server
- Support return_tokens_as_token_id as a request param (#14066)
- Support Image Emedding as input (#13955)
- New /load endpoint for load statistics (#13950)
- New API endpoint /is_sleeping (#14312)
- Enables /score endpoint for embedding models (#12846)
- Enable streaming for Transcription API (#13301)
- Make model param optional in request (#13568)
- Support SSL Key Rotation in HTTP Server (#13495)
Reasoning
- Support reasoning output (#12955)
- Support outlines engine with reasoning outputs (#14114)
- Update reasoning with stream example to use OpenAI library (#14077)
CLI
- Ensure out-of-tree quantization method recognize by cli args (#14328)
- Add vllm bench CLI (#13993)
Make LLM API compatible for torchrun launcher (#13642)

Disaggregated Serving

Support KV cache offloading and disagg prefill with LMCache connector (#12953)
Support chunked prefill for LMCache connector (#14505)

LoRA

Add LoRA support for TransformersModel (#13770)
Make the deviceprofilerinclude LoRA memory. (#14469)
Gemma3ForConditionalGeneration supports LoRA (#14797)
Retire SGMV and BGMV Kernels (#14685) (#14685)

VLM

Generalized prompt updates for multi-modal processor (#13964)
Deprecate legacy input mapper for OOT multimodal models (#13979)
Refer code examples for common cases in dev multimodal processor (#14278)

Quantization

BaiChuan SupportsQuant (#13710)
BartModel SupportsQuant (#14699)
Bamba SupportsQuant (#14698)
Deepseek GGUF support (#13167)
GGUF MoE kernel (#14613)
Add GPTQAllSpark Quantization (#12931)
Better performance of gptq marlin kernel when n is small (#14138)

Structured Output

xgrammar: Expand list of unsupported jsonschema keywords (#13783)

Hardware Support

AMD

Faster Custom Paged Attention kernels (#12348)
Improved performance for V1 Triton (ROCm) backend (#14152)
Chunked prefill/paged attention in MLA on ROCm (#14316)
Perf improvement for DSv3 on AMD GPUs (#13718)
MoE fp8 block quant tuning support (#14068)

TPU

Integrate the new ragged paged attention kernel with vLLM v1 on TPU (#13379)
Support start_profile/stop_profile in TPU worker (#13988)
Add TPU v1 test (#14834)
TPU multimodal model support for ragged attention (#14158)
Add tensor parallel support via Ray (#13618)
Enable prefix caching by default (#14773)

Neuron

Add Neuron device communicator for vLLM v1 (#14085)
Add custom_ops for neuron backend (#13246)
Add reshape_and_cache (#14391)
Vectorize KV cache load in FlashPagedAttention to maximize DMA bandwidth (#13245)

CPU

Upgrade CPU backend to torch-2.6 (#13381)
Support FP8 KV cache in CPU Backend(#14741)

s390x

Adding cpu inference with VXE ISA for s390x architecture (#12613)
Add documentation for s390x cpu implementation (#14198)

Plugins

Remove cuda hard code in models and layers (#13658)
Move use allgather to platform (#14010)

Bugfix and Enhancements

Illegal memory access for MoE On H20 (#13693)
Fix FP16 overflow for DeepSeek V2 (#13232)
Illegal Memory Access in the blockwise cutlass fp8 GEMMs (#14396)
Pass all driver env vars to ray workers unless excluded (#14099)
Use xgrammar shared context to avoid copy overhead for offline engine (#13837)
Capture and log the time of loading weights (#13666)

Developer Tooling

Benchmarks

Consolidate performance benchmark datasets (#14036)
Update benchmarks README (#14646)

CI and Build

Add RELEASE.md (#13926)
Use env var to control whether to use S3 bucket in CI (#13634)

Documentation

Add RLHF document (#14482)
Add nsight guide to profiling docs (#14298)
Add K8s deployment guide (#14084)
Add developer documentation for torch.compile integration (#14437)

What's Changed

Update pre-commit's isort version to remove warnings by @hmellor in #13614
[V1][Minor] Print KV cache size in token counts by @WoosukKwon in #13596
fix neuron performance issue by @ajayvohra2005 in #13589
[Frontend] Add backend-specific options for guided decoding by @joerunde in #13505
[Bugfix] Fix max_num_batched_tokens for MLA by @mgoin in #13620
[Neuron][Kernel] Vectorize KV cache load in FlashPagedAttention to maximize DMA bandwidth by @lingfanyu in #13245
Add llmaz as another integration by @kerthcet in #13643
[Misc] Adding script to setup ray for multi-node vllm deployments by @Edwinhr716 in #12913
[NVIDIA] Fix an issue to use current stream for the nvfp4 quant by @kaixih in #13632
Use pre-commit to update requirements-test.txt by @hmellor in #13617
[Bugfix] Add mm_processor_kwargs to chat-related protocols by @ywang96 in #13644
[V1][Sampler] Avoid an operation during temperature application by @njhill in #13587
Missing comment explaining VDR variable in GGUF kernels by @SzymonOzog in #13290
[FEATURE] Enables /score endpoint for embedding models by @gmarinho2 in #12846
[ci] Fix metrics test model path by @khluu in #13635
[Kernel]Add streamK for block-quantized CUTLASS kernels by @Hongbosherlock in #12978
[Bugfix][CPU] Fix cpu all-reduce using native pytorch implementation by @Isotr0py in #13586
fix typo of grafana dashboard, with correct datasource by @johnzheng1975 in https://...

Contributors

markmc, russellb, and 164 other contributors

Assets 6

17 Mar 17:08

github-actions

v0.8.0rc2

37e3806

v0.8.0rc2 Pre-release

Pre-release

What's Changed

[V1] Remove input cache client by @DarkLight1337 in #14864
[Misc][XPU] Use None as device capacity for XPU by @yma11 in #14932
[Doc] Add vLLM Beijing meetup slide by @heheda12345 in #14938
setup.py: drop assumption about local main branch by @russellb in #14692
[MISC] More AMD unused var clean up by @houseroad in #14926
fix minor miscalled method by @kushanam in #14327
[V1][TPU] Apply the ragged paged attention kernel fix and remove the padding. by @vanbasten23 in #14846
[Bugfix] Fix Ultravox on V1 by @DarkLight1337 in #14929
[Misc] Add --seed option to offline multi-modal examples by @DarkLight1337 in #14934
[Bugfix][ROCm] running new process using spawn method for rocm in tests. by @vllmellm in #14810
[Doc] Fix misleading log during multi-modal profiling by @DarkLight1337 in #14955
Add patch merger by @patrickvonplaten in #14957
[V1] Default MLA to V1 by @simon-mo in #14921
[Bugfix] Fix precommit - line too long in pixtral.py by @tlrmchlsmth in #14960
[Bugfix][Model] Mixtral: use unused head_dim config argument by @qtrrb in #14961
[Fix][Structured Output] using vocab_size to construct matcher by @aarnphm in #14868
[Bugfix] Make Gemma3 MM V0 only for now by @ywang96 in #14971

New Contributors

@vllmellm made their first contribution in #14810
@qtrrb made their first contribution in #14961

Full Changelog: v0.8.0rc1...v0.8.0rc2

Contributors

russellb, tlrmchlsmth, and 12 other contributors

Assets 2

17 Mar 05:13

github-actions

v0.8.0rc1

8d6cf89

v0.8.0rc1 Pre-release

Pre-release

Note: vLLM no longer sets the global seed (#14274). Please set the seed parameter if you need to reproduce your results.

What's Changed

Update pre-commit's isort version to remove warnings by @hmellor in #13614
[V1][Minor] Print KV cache size in token counts by @WoosukKwon in #13596
fix neuron performance issue by @ajayvohra2005 in #13589
[Frontend] Add backend-specific options for guided decoding by @joerunde in #13505
[Bugfix] Fix max_num_batched_tokens for MLA by @mgoin in #13620
[Neuron][Kernel] Vectorize KV cache load in FlashPagedAttention to maximize DMA bandwidth by @lingfanyu in #13245
Add llmaz as another integration by @kerthcet in #13643
[Misc] Adding script to setup ray for multi-node vllm deployments by @Edwinhr716 in #12913
[NVIDIA] Fix an issue to use current stream for the nvfp4 quant by @kaixih in #13632
Use pre-commit to update requirements-test.txt by @hmellor in #13617
[Bugfix] Add mm_processor_kwargs to chat-related protocols by @ywang96 in #13644
[V1][Sampler] Avoid an operation during temperature application by @njhill in #13587
Missing comment explaining VDR variable in GGUF kernels by @SzymonOzog in #13290
[FEATURE] Enables /score endpoint for embedding models by @gmarinho2 in #12846
[ci] Fix metrics test model path by @khluu in #13635
[Kernel]Add streamK for block-quantized CUTLASS kernels by @Hongbosherlock in #12978
[Bugfix][CPU] Fix cpu all-reduce using native pytorch implementation by @Isotr0py in #13586
fix typo of grafana dashboard, with correct datasource by @johnzheng1975 in #13668
[Attention] MLA with chunked prefill by @LucasWilkinson in #12639
[Misc] Fix yapf linting tools etc not running on pre-commit by @Isotr0py in #13695
docs: Add a note on full CI run in contributing guide by @terrytangyuan in #13646
[HTTP Server] Make model param optional in request by @youngkent in #13568
[Bugfix][API Server] Fix invalid usage of 'ge' and 'le' in port valid… by @WangErXiao in #13672
[Misc] Capture and log the time of loading weights by @waltforme in #13666
[ROCM] fix native attention function call by @gongdao123 in #13650
[Bugfix][Model] OLMo 2: split qkv correctly for GQA and MQA by @2015aroras in #13687
[Misc] Bump compressed-tensors by @dsikka in #13619
[Bugfix] Fix benchmark script bug: inaccurate stats for vllm backend when max_model_len < input_len + output_len by @WangErXiao in #13691
[v1] Support allowed_token_ids in v1 Sampler by @houseroad in #13210
[Bugfix] V1 Memory Profiling: V0 Sampler Integration without Rejection Sampler by @JenZhao in #13594
Correction to TP logic for Mamba Mixer 2 when Num Groups not divisible by TP Size by @fabianlim in #13660
[V1][Metrics] Support vllm:cache_config_info by @markmc in #13299
[Metrics] Add --show-hidden-metrics-for-version CLI arg by @markmc in #13295
[Misc] Reduce LoRA-related static variable by @jeejeelee in #13166
[CI/Build] Fix pre-commit errors by @DarkLight1337 in #13696
[core] set up data parallel communication by @youkaichao in #13591
[ci] fix linter by @youkaichao in #13701
Support SSL Key Rotation in HTTP Server by @youngkent in #13495
[NVIDIA] Support nvfp4 cutlass gemm by @kaixih in #13571
[V1][Kernel] Refactor the prefix_prefill kernel so that the caller no longer has to pass in the context lengths by @SageMoore in #13095
[ROCm] Apply FP8 weights padding to values not divisible by 512 bytes on ROCm by @gshtras in #13231
[Doc] Dockerfile instructions for optional dependencies and dev transformers by @DarkLight1337 in #13699
[Bugfix] Fix boolean conversion for OpenVINO env variable by @helena-intel in #13615
[XPU]fix setuptools version for xpu by @yma11 in #13548
[CI/Build] fix uv caching in Dockerfile by @dtrifiro in #13611
[CI/Build] Fix pre-commit errors from #13571 by @ywang96 in #13709
[BugFix] Minor: logger import in attention backend by @andylolu2 in #13706
[ci] Use env var to control whether to use S3 bucket in CI by @khluu in #13634
[Quant] BaiChuan SupportsQuant by @kylesayrs in #13710
[LMM] Implement merged multimodal processor for whisper by @Isotr0py in #13278
[Core][Distributed] Use IPC (domain socket) ZMQ socket for local comms by @njhill in #13688
[Misc] Deprecate --dataset from benchmark_serving.py by @ywang96 in #13708
[v1] torchrun compatibility by @youkaichao in #13642
[V1][BugFix] Fix engine core client shutdown hangs by @njhill in #13298
Fix some issues with benchmark data output by @huydhn in #13641
[ci] Add logic to change model to S3 path only when S3 CI env var is on by @khluu in #13727
[V1][Core] Fix memory issue with logits & sampling by @ywang96 in #13721
[model][refactor] remove cuda hard code in models and layers by @MengqingCao in #13658
[Bugfix] fix(logging): add missing opening square bracket by @bufferoverflow in #13011
[CI/Build] add python-json-logger to requirements-common by @bufferoverflow in #12842
Expert Parallelism (EP) Support for DeepSeek Models by @cakeng in #12583
[BugFix] Illegal memory access for MoE On H20 by @Abatom in #13693
[Misc][Docs] Raise error when flashinfer is not installed and VLLM_ATTENTION_BACKEND is set by @NickLucche in #12513
[V1] V1 engine implements parallel sampling (AsyncLLM and LLMEngine) by @afeldman-nm in #10980
Revert "[V1][Core] Fix memory issue with logits & sampling" by @ywang96 in #13775
Fix precommit fail in fused_moe intermediate_cache2 chunking by @mgoin in #13772
[Misc] Clean Up EngineArgs.create_engine_config by @robertgshaw2-redhat in #13734
[Misc][Chore] Clean Up AsyncOutputProcessing Logs by @robertgshaw2-redhat in #13780
Remove unused kwargs from model definitions by @hmellor in #13555
[Doc] arg_utils.py: fixed a typo by @eli-b in #13785
[Misc] set single whitespace between log sentences by @cjackal in #13771
[Bugfix][Quantization] Fix FP8 + EP by @tlrmchlsmth in #13784
[Misc][Attention][Quantization] init property earlier by @wangxiyuan in #13733
[V1][Metrics] Implement vllm:lora_requests_info metric by @markmc in #13504
[Bugfix] Fix deepseek-v2 error: "missing 1 required positional argument: 'residual'" by @LucasWilkinson in #13802
[Bugfix] Support MLA for CompressedTensorsWNA16 by @mgoin in #13725
Fix CompressedTensorsWNA16MoE with grouped scales by @mgoin in #13769
[Core] LoRA V1 - Add add/pin/list/remove_lora functions by @varun-sundar-rabindranath in #13705
[Misc] Check that the model can be inspected upon registration by @DarkLight1337 in #13743
[Core] xgrammar: Expand list of unsupported jsonschema keywords by @russellb in #13783
[Bugf...

Contributors

markmc, russellb, and 161 other contributors

Assets 2

20 Feb 17:08

github-actions

v0.7.3

ed6e907

v0.7.3

Highlights

🎉 253 commits from 93 contributors, including 29 new contributors!

Deepseek enhancements:
- Support for DeepSeek Multi-Token Prediction, 1.69x speedup in low QPS scenarios (#12755)
- AMD support: DeepSeek tunings, yielding 17% latency reduction (#13199)
- Using FlashAttention3 for MLA (#12807)
- Align the expert selection code path with official implementation (#13474)
- Optimize moe_align_block_size for deepseek_v3 (#12850)
- Expand MLA to support most types of quantization (#13181)
V1 Engine:
- LoRA Support (#10957, #12883)
- Logprobs and prompt logprobs support (#9880), min_p sampling support (#13191), logit_bias in v1 Sampler (#13079)
- Use msgpack for core request serialization (#12918)
- Pipeline parallelism support (#12996, #13353, #13472, #13417, #13315)
- Metrics enhancements: GPU prefix cache hit rate % gauge (#12592), iteration_tokens_total histogram (#13288), several request timing histograms (#12644)
- Initial speculative decoding support with ngrams (#12193, #13365)

Model Support

Enhancement to Qwen2.5-VL: BNB support (#12944), LoRA (#13261), Optimizations (#13155)
Support GPTQModel Dynamic [2,3,4,8]bit GPTQ quantization (#7086)
Support Unsloth Dynamic 4bit BnB quantization (#12974)
IBM/NASA Prithvi Geospatial model (#12830)
Support Mamba2 (Codestral Mamba) (#9292), Bamba Model (#10909)
Ultravox Model: Support v0.5 Release (#12912)
transformers backend
- Enable quantization support for transformers backend (#12960)
- Set torch_dtype in TransformersModel (#13088)
VLM:
- Implement merged multimodal processor for Mllama (#11427), GLM4V (#12449), Molmo (#12966)
- Separate text-only and vision variants of the same model architecture (#13157)

Hardware Support

Pluggable platform-specific scheduler (#13161)
NVIDIA: Support nvfp4 quantization (#12784)
AMD:
- Per-Token-Activation Per-Channel-Weight FP8 (#12501)
- Tuning for Mixtral on MI325 and Qwen MoE on MI300 (#13503), Mixtral8x7B on MI300 (#13577)
- Add intial ROCm support to V1 (#12790)
TPU: V1 Support (#13049)
Neuron: Support Longer Sequences in NKI-based Flash PagedAttention and Improve Efficiency (#12921)
Gaudi:
- Support Contiguous Cache Fetch (#12139)
- Enable long-contexts + LoRA support (#12812)

Engine Feature

Add sleep and wake up endpoint and v1 support (#12987)
Add /v1/audio/transcriptions OpenAI API endpoint (#12909)

Performance

Reduce TTFT with concurrent partial prefills (#10235)
LoRA - Refactor sgmv kernels (#13110)

Others

Make vLLM compatible with veRL (#12824)
Fixes for cases of FA2 illegal memory access error (#12848)
choice-based structured output with xgrammar (#12632)
Run v1 benchmark and integrate with PyTorch OSS benchmark database (#13068)

What's Changed

[Misc] Update w2 scale loading for GPTQMarlinMoE by @dsikka in #12757
[Docs] Add Google Cloud Slides by @simon-mo in #12814
[Attention] Use FA3 for MLA on Hopper by @LucasWilkinson in #12807
[misc] Reduce number of config file requests to HuggingFace by @khluu in #12797
[Misc] Remove unnecessary decode call by @DarkLight1337 in #12833
[Kernel] Make rotary_embedding ops more flexible with input shape by @Isotr0py in #12777
[torch.compile] PyTorch 2.6 and nightly compatibility by @youkaichao in #12393
[Doc] double quote cmake package in build.inc.md by @jitseklomp in #12840
[Bugfix] Fix unsupported FA version check for Turing GPU by @Isotr0py in #12828
[V1] LoRA Support by @varun-sundar-rabindranath in #10957
Add Bamba Model by @fabianlim in #10909
[MISC] Check space in the file names in the pre commit checks by @houseroad in #12804
[misc] Revert # 12833 by @khluu in #12857
[Bugfix] FA2 illegal memory access by @LucasWilkinson in #12848
Make vllm compatible with verl by @ZSL98 in #12824
[Bugfix] Missing quant_config in deepseek embedding layer by @SzymonOzog in #12836
Prevent unecessary requests to huggingface hub by @maxdebayser in #12837
[MISC][EASY] Break check file names into entry and args in the pre-commit hooks by @houseroad in #12880
[Misc] Remove unnecessary detokenization in multimodal processing by @DarkLight1337 in #12868
[Model] Add support for partial rotary embeddings in Phi3 model by @garg-amit in #12718
[V1] Logprobs and prompt logprobs support by @afeldman-nm in #9880
[ROCm] [Feature] [Doc] [Dockerfile] [BugFix] Support Per-Token-Activation Per-Channel-Weight FP8 Quantization Inferencing by @tjtanaa in #12501
[V1] LM Eval With Streaming Integration Tests by @robertgshaw2-redhat in #11590
[Bugfix] Fix disagg hang caused by the prefill and decode communication issues by @houseroad in #12723
[V1][Minor] Remove outdated comment by @WoosukKwon in #12928
[V1] Move KV block hashes from Request to KVCacheManager by @WoosukKwon in #12922
[Bugfix] Fix Qwen2_5_VLForConditionalGeneration packed_modules_mapping by @jeejeelee in #12905
[Misc] Fix typo in the example file by @DK-DARKmatter in #12896
[Bugfix] Fix multi-round chat error when mistral tokenizer is used by @zifeitong in #12859
[bugfix] respect distributed_executor_backend in world_size=1 by @youkaichao in #12934
[Misc] Add offline test for disaggregated prefill by @Shaoting-Feng in #12418
[V1][Minor] Move cascade attn logic outside _prepare_inputs by @WoosukKwon in #12943
[Build] Make pypi install work on CPU platform by @wangxiyuan in #12874
[Hardware][Intel-Gaudi] Enable long-contexts + LoRA support for Intel Gaudi by @sanjucsudhakaran in #12812
[misc] Add LoRA to benchmark_serving by @varun-sundar-rabindranath in #12898
[Misc] Log time consumption on weight downloading by @waltforme in #12926
[CI] Resolve transformers-neuronx version conflict by @liangfu in #12925
[Doc] Correct HF repository for TeleChat2 models by @waltforme in #12949
[Misc] Add qwen2.5-vl BNB support by @Isotr0py in #12944
[CI/Build] Auto-fix Markdown files by @DarkLight1337 in #12941
[Bugfix] Remove unused seq_group_metadata_list from ModelInputForGPU by @ShangmingCai in #12935
[bugfix] fix early import of flash attention by @youkaichao in #12959
[VLM] Merged multi-modal processor for GLM4V by @jeejeelee in #12449
[V1][Minor] Remove outdated comment by @WoosukKwon in #12968
[RFC] [Mistral] FP8 format by @patrickvonplaten in #10130
[V1] Cache uses_mrope in GPUModelRunner by @WoosukKwon in #12969
[core] port pynvml into vllm codebase by @youkaichao in #12963
[MISC] Always import version library first in the vllm package by @houseroad in #12979
[core] improve error handling when wake up from sleep mode by @youkaichao in #12981
[core][rlhf] add colocate example for RLHF by @youkaichao in #12984
[V1] Use msgpack for core request serialization by @njhill in #12918
[Bugfix][Platform] Check whether selected backend is None in get_attn_backend_cls() by @terrytangyuan in #12975
[core] fix sleep mode and pytorch checkpoint compatibility by @youkaichao in #13001
[Doc] Add link to tool_choice tracking issue in tool_calling.md by @terrytangyuan in #13003
[misc] Add retries with exponential backoff for HF file existence check by @khluu in #13008
[Bugfix] Clean up and fix multi-modal processors by @DarkLight1337 in #13012
Fix seed parameter behavior in vLLM by @SmartManoj in #13007
[Model] Ultravox Model: Support v0.5 Release by @farzadab in #12912
[misc] Fix setup.py condition to avoid AMD from being mistaken with CPU by @khluu in https://github.com/vllm-proje...

Contributors

markmc, rasmith, and 91 other contributors

Assets 5

06 Feb 07:30

github-actions

v0.7.2

0408efc

v0.7.2

Highlights

Qwen2.5-VL is now supported in vLLM. Please note that it requires a source installation from Hugging Face transformers library at the moment (#12604)
Add transformers backend support via --model-impl=transformers. This allows vLLM to be ran with arbitrary Hugging Face text models (#11330, #12785, #12727).
Performance enhancement to DeepSeek models.
- Align KV caches entries to start 256 byte boundaries, yielding 43% throughput enhancement (#12676)
- Apply torch.compile to fused_moe/grouped_topk, yielding 5% throughput enhancement (#12637)
- Enable MLA for DeepSeek VL2 (#12729)
- Enable DeepSeek model on ROCm (#12662)

Core Engine

Use VLLM_LOGITS_PROCESSOR_THREADS to speed up structured decoding in high batch size scenarios (#12368)

Security Update

Improve hash collision avoidance in prefix caching (#12621)
Add SPDX-License-Identifier headers to python source files (#12628)

Other

Enable FusedSDPA support for Intel Gaudi (HPU) (#12359)

What's Changed

Apply torch.compile to fused_moe/grouped_topk by @mgoin in #12637
doc: fixing minor typo in readme.md by @vicenteherrera in #12643
[Bugfix] fix moe_wna16 get_quant_method by @jinzhen-lin in #12648
[Core] Silence unnecessary deprecation warnings by @russellb in #12620
[V1][Minor] Avoid frequently creating ConstantList by @WoosukKwon in #12653
[Core][v1] Unify allocating slots in prefill and decode in KV cache manager by @ShawnD200 in #12608
[Hardware][Intel GPU] add XPU bf16 support by @jikunshang in #12392
[Misc] Add SPDX-License-Identifier headers to python source files by @russellb in #12628
[doc][misc] clarify VLLM_HOST_IP for multi-node inference by @youkaichao in #12667
[Doc] Deprecate Discord by @zhuohan123 in #12668
[Kernel] port sgl moe_align_block_size kernels by @chenyang78 in #12574
make sure mistral_common not imported for non-mistral models by @youkaichao in #12669
Properly check if all fused layers are in the list of targets by @eldarkurtic in #12666
Fix for attention layers to remain unquantized during moe_wn16 quant by @srikanthsrnvs in #12570
[cuda] manually import the correct pynvml module by @youkaichao in #12679
[ci/build] fix gh200 test by @youkaichao in #12681
[Model]: Add transformers backend support by @ArthurZucker in #11330
[Misc] Fix improper placement of SPDX header in scripts by @russellb in #12694
[Bugfix][Kernel] Fix per-token/per-channel quantization for Hopper scaled mm by @tlrmchlsmth in #12696
Squelch MLA warning for Compressed-Tensors Models by @kylesayrs in #12704
[Model] Add Deepseek V3 fp8_w8a8 configs for B200 by @kushanam in #12707
[MISC] Remove model input dumping when exception by @comaniac in #12582
[V1] Revert uncache_blocks and support recaching full blocks by @comaniac in #12415
[Core] Improve hash collision avoidance in prefix caching by @russellb in #12621
Support Pixtral-Large HF by using llava multimodal_projector_bias config by @mgoin in #12710
[Doc] Replace ibm-fms with ibm-ai-platform by @tdoublep in #12709
[Quant] Fix use_mla TypeError and support loading pure-sparsity Compressed Tensors configs by @kylesayrs in #12711
[AMD][ROCm] Enable DeepSeek model on ROCm by @hongxiayang in #12662
[Misc] Add BNB quantization for Whisper by @jeejeelee in #12381
[VLM] Merged multi-modal processor for InternVL-based models by @DarkLight1337 in #12553
[V1] Remove constraints on partial requests by @WoosukKwon in #12674
[VLM] Implement merged multimodal processor and V1 support for idefics3 by @Isotr0py in #12660
[Model] [Bugfix] Fix loading of fine-tuned models based on Phi-3-Small by @mgtk77 in #12689
Avoid unnecessary multi-modal input data copy when len(batch) == 1 by @imkero in #12722
[Build] update requirements of no-device for plugin usage by @sducouedic in #12630
[Bugfix] Fix CI failures for InternVL and Mantis models by @DarkLight1337 in #12728
[V1][Metrics] Add request_success_total counter, labelled with finish reason by @markmc in #12579
[Perf] Mem align KV caches for CUDA devices (MLA perf improvement) by @LucasWilkinson in #12676
[Core] add and implement VLLM_LOGITS_PROCESSOR_THREADS by @akeshet in #12368
[ROCM][AMD][TRITON] Halving warps number for fw_prefill to reduce spilling by @maleksan85 in #12713
Refactor Linear handling in TransformersModel by @hmellor in #12727
[VLM] Add MLA with pure RoPE support for deepseek-vl2 models by @Isotr0py in #12729
[Misc] Bump the compressed-tensors version by @dsikka in #12736
[Model][Quant] Fix GLM, Fix fused module mappings for quantization by @kylesayrs in #12634
[Doc] Update PR Reminder with link to Developer Slack by @mgoin in #12748
[Bugfix] Fix OpenVINO model runner by @hmellor in #12750
[V1][Misc] Shorten FinishReason enum and use constant strings by @njhill in #12760
[Doc] Remove performance warning for auto_awq.md by @mgoin in #12743
[Bugfix] Fix 'ModuleNotFoundError: No module named 'intel_extension_for_pytorch'' for --tensor-parallel-size more than 1 by @Akashcodes732 in #12546
[core][distributed] exact ray placement control by @youkaichao in #12732
[Kernel] Use self.kv_cache and forward_context.attn_metadata in Attention.forward by @heheda12345 in #12536
[Hardware][Intel-Gaudi] Enable FusedSDPA support for Intel Gaudi (HPU) by @sanjucsudhakaran in #12359
Add: Support for Sparse24Bitmask Compressed Models by @rahul-tuli in #12097
[VLM] Use shared field to pass token ids to model by @DarkLight1337 in #12767
[Docs] Drop duplicate [source] links by @russellb in #12780
[VLM] Qwen2.5-VL by @ywang96 in #12604
[VLM] Update compatibility with transformers 4.49 by @DarkLight1337 in #12781
Quantization and MoE configs for GH200 machines by @arvindsun in #12717
[ROCm][Kernel] Using the correct warp_size value by @gshtras in #12789
[Bugfix] Better FP8 supported defaults by @LucasWilkinson in #12796
[Misc][Easy] Remove the space from the file name by @houseroad in #12799
[Model] LoRA Support for Ultravox model by @thedebugger in #11253
[Bugfix] Fix the test_ultravox.py's license by @houseroad in #12806
Improve TransformersModel UX by @hmellor in #12785
[Misc] Remove duplicated DeepSeek V2/V3 model definition by @mgoin in #12793
[Misc] Improve error message for incorrect pynvml by @youkaichao in #12809

New Contributors

@vicenteherrera made their first contribution in #12643
@chenyang78 made their first contribution in #12574
@srikanthsrnvs made their first contribution in #12570
@ArthurZucker made their first contribution in #11330
@mgtk77 made their first contribution in #12689
@sducouedic made their first contribution in #12630
@akeshet made their first contribution in #12368
@arvindsun made their first contribution in #12717
@thedebugger made their first contribution in ht...

Contributors

markmc, russellb, and 39 other contributors

Assets 5

01 Feb 18:02

github-actions

v0.7.1

4f4d427

v0.7.1

Highlights

This release features MLA optimization for Deepseek family of models. Compared to v0.7.0 released this Monday, we offer ~3x the generation throughput, ~10x the memory capacity for tokens, and horizontal context scalability with pipeline parallelism

MLA Kernel (#12601, #12642,#12528).
FP8 Kernels (#11589, #11868, #12587)

V1

For the V1 architecture, we

Added a new design document for zero overhead prefix caching here (#12598)
Add metrics and enhance logging for V1 engine (#12569, #12561, #12416, #12516, #12530, #12478)

Models

New Model: MiniCPM-o (text outputs only) (#12069)

Hardwares

Neuron: NKI-based flash-attention kernel with paged KV cache (#11277)
AMD: llama 3.2 support upstreaming (#12421)

Others

Support override generation config in engine arguments (#12409)
Support reasoning content in API for deepseek R1 (#12473)

What's Changed

[Bugfix] Fix missing seq_start_loc in xformers prefill metadata by @Isotr0py in #12464
[V1][Minor] Minor optimizations for update_from_output by @WoosukKwon in #12454
[Bugfix] Fix gpt2 GGUF inference by @Isotr0py in #12467
[Build] Only build 9.0a for scaled_mm and sparse kernels by @LucasWilkinson in #12339
[V1][Metrics] Add initial Prometheus logger by @markmc in #12416
[V1][CI/Test] Do basic test for top-p & top-k sampling by @WoosukKwon in #12469
[FlashInfer] Upgrade to 0.2.0 by @abmfy in #11194
[Feature] [Spec decode]: Enable MLPSpeculator/Medusa and prompt_logprobs with ChunkedPrefill by @NickLucche in #10132
Update pre-commit hooks by @hmellor in #12475
[Neuron][Kernel] NKI-based flash-attention kernel with paged KV cache by @liangfu in #11277
Fix bad path in prometheus example by @mgoin in #12481
[CI/Build] Fixed the xla nightly issue report in #12451 by @hosseinsarshar in #12453
[FEATURE] Enables offline /score for embedding models by @gmarinho2 in #12021
[CI] fix pre-commit error by @MengqingCao in #12494
Update README.md with V1 alpha release by @ywang96 in #12495
[V1] Include Engine Version in Logs by @robertgshaw2-redhat in #12496
[Core] Make raw_request optional in ServingCompletion by @schoennenbeck in #12503
[VLM] Merged multi-modal processor and V1 support for Qwen-VL by @DarkLight1337 in #12504
[Doc] Fix typo for x86 CPU installation by @waltforme in #12514
[V1][Metrics] Hook up IterationStats for Prometheus metrics by @markmc in #12478
Replace missed warning_once for rerank API by @mgoin in #12472
Do not run suggestion pre-commit hook multiple times by @hmellor in #12521
[V1][Metrics] Add per-request prompt/generation_tokens histograms by @markmc in #12516
[Kernel] Pipe attn_logits_soft_cap through paged attention TPU kernels by @fenghuizhang in #12482
[TPU] Add example for profiling TPU inference by @mgoin in #12531
[Frontend] Support reasoning content for deepseek r1 by @gaocegege in #12473
[Doc] Convert docs to use colon fences by @hmellor in #12471
[V1][Metrics] Add TTFT and TPOT histograms by @markmc in #12530
Bugfix for whisper quantization due to fake k_proj bias by @mgoin in #12524
[V1] Improve Error Message for Unsupported Config by @robertgshaw2-redhat in #12535
Fix the pydantic logging validator by @maxdebayser in #12420
[Bugfix] handle alignment of arguments in convert_sparse_cross_attention_mask_to_dense by @tjohnson31415 in #12347
[Model] Refactoring of MiniCPM-V and add MiniCPM-o-2.6 support for vLLM by @HwwwwwwwH in #12069
[Frontend] Support override generation config in args by @liuyanyi in #12409
[Hardware][NV] Fix Modelopt model loading for k-v-scales for Llama models. by @pavanimajety in #11787
[Kernel] add triton fused moe kernel for gptq/awq by @jinzhen-lin in #12185
Revert "[Build/CI] Fix libcuda.so linkage" by @tlrmchlsmth in #12552
[V1][BugFix] Free encoder cache for aborted requests by @WoosukKwon in #12545
[Misc][MoE] add Deepseek-V3 moe tuning support by @divakar-amd in #12558
[V1][Metrics] Add GPU cache usage % gauge by @markmc in #12561
Set ?device={device} when changing tab in installation guides by @hmellor in #12560
[Misc] fix typo: add missing space in lora adapter error message by @Beim in #12564
[Kernel] Triton Configs for Fp8 Block Quantization by @robertgshaw2-redhat in #11589
[CPU][PPC] Updated torch, torchvision, torchaudio dependencies by @npanpaliya in #12555
[V1][Log] Add max request concurrency log to V1 by @mgoin in #12569
[Kernel] Update cutlass_scaled_mm to support 2d group (blockwise) scaling by @LucasWilkinson in #11868
[ROCm][AMD][Model] llama 3.2 support upstreaming by @maleksan85 in #12421
[Attention] MLA decode optimizations by @LucasWilkinson in #12528
[Bugfix] Gracefully handle huggingface hub http error by @ywang96 in #12571
Add favicon to docs by @hmellor in #12611
[BugFix] Fix Torch.Compile For DeepSeek by @robertgshaw2-redhat in #12594
[Git] Automatically sign-off commits by @comaniac in #12595
[Docs][V1] Prefix caching design by @comaniac in #12598
[v1][Bugfix] Add extra_keys to block_hash for prefix caching by @heheda12345 in #12603
[release] Add input step to ask for Release version by @khluu in #12631
[Bugfix] Revert MoE Triton Config Default by @robertgshaw2-redhat in #12629
[Kernel][Quantization] Integrate block-quantized CUTLASS kernels for DeepSeekV3 by @tlrmchlsmth in #12587
[Feature] Fix guided decoding blocking bitmask memcpy by @xpbowler in #12563
[Doc] Improve installation signposting by @hmellor in #12575
[Doc] int4 w4a16 example by @brian-dellabetta in #12585
[V1] Bugfix: Validate Model Input Length by @robertgshaw2-redhat in #12600
[BugFix] fix wrong output when using lora and num_scheduler_steps=8 by @sleepwalker2017 in #11161
Fix target matching for fused layers with compressed-tensors by @eldarkurtic in #12617
[ci] Upgrade transformers to 4.48.2 in CI dependencies by @khluu in #12599
[Bugfix/CI] Fixup benchmark_moe.py by @tlrmchlsmth in #12562
Fix: Respect sparsity_config.ignore in Cutlass Integration by @rahul-tuli in #12517
[Attention] Deepseek v3 MLA support with FP8 compute by @LucasWilkinson in #12601
[CI/Build] Add label automation for structured-output, speculative-decoding, v1 by @russellb in #12280
Disable chunked prefill and/or prefix caching when MLA is enabled by @simon-mo in #12642

New Contributors

@abmfy made their first contribution in #11194
@hosseinsarshar made their first contribution in #12453
@gmarinho2 made their first contribution in #12021
@waltforme made their first contribution in #12514
@fenghuizhang made their first contribution in #12482
@gaocegege made their first contribution in #12473
@Beim made their first contribution in https://github.com/vllm-pro...

Contributors

markmc, russellb, and 38 other contributors

Assets 5

27 Jan 05:50

github-actions

v0.7.0

5204ff5

v0.7.0

Highlights

vLLM's V1 engine is ready for testing! This is a rewritten engine designed for performance and architectural simplicity. You can turn it on by setting environment variable VLLM_USE_V1=1. See our blog for more details. (44 commits).
New methods (LLM.sleep, LLM.wake_up, LLM.collective_rpc, LLM.reset_prefix_cache) in vLLM for the post training frameworks! (#12361, #12084, #12284).
torch.compile is now fully integrated in vLLM, and enabled by default in V1. You can turn it on via -O3 engine parameter. (#11614, #12243, #12043, #12191, #11677, #12182, #12246).

This release features

400 commits from 132 contributors, including 57 new contributors.
- 28 CI and build enhancements, including testing for nightly torch (#12270) and inclusion of genai-perf for benchmark (#10704).
- 58 documentation enhancements, including reorganized documentation structure (#11645, #11755, #11766, #11843, #11896).
- more than 161 bug fixes and miscellaneous enhancements

Features

Models

New generative models: CogAgent (#11742), Deepseek-VL2 (#11578, #12068, #12169), fairseq2 Llama (#11442), InternLM3 (#12037), Whisper (#11280)
New pooling models: Qwen2 PRM (#12202), InternLM2 reward models (#11571)
VLM: Merged multi-modal processor is now ready for model developers! (#11620, #11900, #11682, #11717, #11669, #11396)
- Any model that implements merged multi-modal processor and the get_*_embeddings methods according to this guide is automatically supported by V1 engine.

Hardwares

Apple: Native support for macOS Apple Silicon (#11696)
AMD: MI300 FP8 format for block_quant (#12134), Tuned MoE configurations for multiple models (#12408, #12049), block size heuristic for avg 2.8x speedup for int8 models (#11698)
TPU: support for W8A8 (#11785)
x86: Multi-LoRA (#11100) and MoE Support (#11831)
Progress in out-of-tree hardware support (#12009, #11981, #11948, #11609, #12264, #11516, #11503, #11369, #11602)

Features

Distributed:
- Support torchrun and SPMD-style offline inference (#12071)
- New collective_rpc abstraction (#12151, #11256)
API Server: Jina- and Cohere-compatible Rerank API (#12376)
Kernels:
- Flash Attention 3 Support (#12093)
- Punica prefill kernels fusion (#11234)
- For Deepseek V3: optimize moe_align_block_size for cuda graph and large num_experts (#12222)

Others

Benchmark: new script for CPU offloading (#11533)
Security: Set weights_only=True when using torch.load() (#12366)

What's Changed

[Docs] Document Deepseek V3 support by @simon-mo in #11535
Update openai_compatible_server.md by @robertgshaw2-redhat in #11536
[V1] Use FlashInfer Sampling Kernel for Top-P & Top-K Sampling by @WoosukKwon in #11394
[V1] Fix yapf by @WoosukKwon in #11538
[CI] Fix broken CI by @robertgshaw2-redhat in #11543
[misc] fix typing by @youkaichao in #11540
[V1][3/N] API Server: Reduce Task Switching + Handle Abort Properly by @robertgshaw2-redhat in #11534
[BugFix] Deepseekv3 broke quantization for all other methods by @robertgshaw2-redhat in #11547
[Platform] Move model arch check to platform by @MengqingCao in #11503
Update deploying_with_k8s.md with AMD ROCm GPU example by @AlexHe99 in #11465
[Bugfix] Fix TeleChat2ForCausalLM weights mapper by @jeejeelee in #11546
[Misc] Abstract out the logic for reading and writing media content by @DarkLight1337 in #11527
[Doc] Add xgrammar in doc by @Chen-0210 in #11549
[VLM] Support caching in merged multi-modal processor by @DarkLight1337 in #11396
[MODEL] Update LoRA modules supported by Jamba by @ErezSC42 in #11209
[Misc]Add BNB quantization for MolmoForCausalLM by @jeejeelee in #11551
[Misc] Improve BNB loader to handle mixture of sharded and merged weights with same suffix by @Isotr0py in #11566
[Bugfix] Fix for ROCM compressed tensor support by @selalipop in #11561
[Doc] Update mllama example based on official doc by @heheda12345 in #11567
[V1] [4/N] API Server: ZMQ/MP Utilities by @robertgshaw2-redhat in #11541
[Bugfix] Last token measurement fix by @rajveerb in #11376
[Model] Support InternLM2 Reward models by @Isotr0py in #11571
[Model] Remove hardcoded image tokens ids from Pixtral by @ywang96 in #11582
[Hardware][AMD]: Replace HIPCC version with more precise ROCm version by @hj-wei in #11515
[V1][Minor] Set pin_memory=False for token_ids_cpu tensor by @WoosukKwon in #11581
[Doc] Minor documentation fixes by @DarkLight1337 in #11580
[bugfix] interleaving sliding window for cohere2 model by @youkaichao in #11583
[V1] [5/N] API Server: unify Detokenizer and EngineCore input by @robertgshaw2-redhat in #11545
[Doc] Convert list tables to MyST by @DarkLight1337 in #11594
[v1][bugfix] fix cudagraph with inplace buffer assignment by @youkaichao in #11596
[Misc] Use registry-based initialization for KV cache transfer connector. by @KuntaiDu in #11481
Remove print statement in DeepseekScalingRotaryEmbedding by @mgoin in #11604
[v1] fix compilation cache by @youkaichao in #11598
[Docker] bump up neuron sdk v2.21 by @liangfu in #11593
[Build][Kernel] Update CUTLASS to v3.6.0 by @tlrmchlsmth in #11607
[CI/Build][CPU] Fix CPU CI by lazy importing triton FP8 kernels by @bigPYJ1151 in #11618
[platforms] enable platform plugins by @youkaichao in #11602
[VLM] Abstract out multi-modal data parsing in merged processor by @DarkLight1337 in #11620
[V1] [6/N] API Server: Better Shutdown by @robertgshaw2-redhat in #11586
[Bugfix] Validate and concatenate image embeddings in MiniCPMVBaseModel by @whyiug in #11631
[benchmark] Remove dependency for H100 benchmark step by @khluu in #11572
[Model][LoRA]LoRA support added for MolmoForCausalLM by @ayylemao in #11439
[Bugfix] Fix OpenAI parallel sampling when using xgrammar by @mgoin in #11637
[Misc][LoRA] Support Rank Stabilized LoRA (RSLoRA) by @JohnGiorgi in #6909
[Bugfix] Move the _touch(computed_blocks) call in the allocate_slots method to after the check for allocating new blocks. by @sakunkun in #11565
[V1] Simpify vision block hash for prefix caching by removing offset from hash by @heheda12345 in #11646
[V1][VLM] V1 support for selected single-image models. by @ywang96 in #11632
[Benchmark] Add benchmark script for CPU offloading by @ApostaC in #11533
[Bugfix][Refactor] Unify model management in frontend by @joerunde in #11660
[VLM] Add max-count checking in data parser for single image models by @DarkLight1337 in #11661
[Misc] Optimize Qwen2-VL LoRA test by @jeejeelee in #11663
[Misc] Replace space with - in the file names by @houseroad in #11667
[Doc] Fix typo by @serihiro in #11666
[V1] Implement Cascade Attention by @WoosukKwon in #11635
[VLM] Move supported limits and max tokens to merged multi-modal processor by @DarkLight1337 in #11669
[VLM][Bugfix] Multi-modal processor compatible with V1 multi-input by @DarkLight1337 in #11674
[mypy] Pass type checking in vllm/inputs by @CloseChoice in #11680
[VLM] Merged multi-modal processor for LLaVA-NeXT by @DarkLight1337 in #11682
According to vllm.EngineArgs, the name should be distributed_executor_backend by @chunyang-wen in #11689
[Bugfix] Free cross attention block table for preempted-for-recompute sequence group. by @kathyyu-google in #10013
[V1]...

Contributors

zhouyuan, janimo, and 130 other contributors

Assets 4

27 Dec 06:24

github-actions

v0.6.6.post1

2339d59

v0.6.6.post1

This release restore functionalities for other quantized MoEs, which was introduced as part of initial DeepSeek V3 support 🙇 .

What's Changed

[Docs] Document Deepseek V3 support by @simon-mo in #11535
Update openai_compatible_server.md by @robertgshaw2-neuralmagic in #11536
[V1] Use FlashInfer Sampling Kernel for Top-P & Top-K Sampling by @WoosukKwon in #11394
[V1] Fix yapf by @WoosukKwon in #11538
[CI] Fix broken CI by @robertgshaw2-neuralmagic in #11543
[misc] fix typing by @youkaichao in #11540
[V1][3/N] API Server: Reduce Task Switching + Handle Abort Properly by @robertgshaw2-neuralmagic in #11534
[BugFix] Deepseekv3 broke quantization for all other methods by @robertgshaw2-neuralmagic in #11547

Full Changelog: v0.6.6...v0.6.6.post1

Contributors

simon-mo, youkaichao, and 2 other contributors

Assets 3

Releases: vllm-project/vllm

v0.8.2

Highlights

V1 Engine

Features

Others

What's Changed

Contributors

v0.8.1

What's Changed

New Contributors

Contributors

v0.8.0

Highlights

V1

DeepSeek Improvements

New Models

NVIDIA Blackwell

Breaking Changes

Updates

Features

Frontend API

Disaggregated Serving

LoRA

VLM

Quantization

Structured Output

Hardware Support

Bugfix and Enhancements

Developer Tooling

Documentation

What's Changed

Contributors

v0.8.0rc2

What's Changed

New Contributors

Contributors

v0.8.0rc1

What's Changed

Contributors

v0.7.3

Highlights

Model Support

Hardware Support

Engine Feature

Performance

Others

What's Changed

Contributors

v0.7.2

Highlights

Core Engine

Security Update

Other

What's Changed

New Contributors

Contributors

v0.7.1

Highlights

V1

Models

Hardwares

Others

What's Changed

New Contributors

Contributors

v0.7.0

Highlights

Features

Others

What's Changed

Contributors

v0.6.6.post1

What's Changed

Contributors