Skip to content

Releases: vllm-project/vllm

v0.8.2

23 Mar 21:05
25f560a
Compare
Choose a tag to compare

This release contains important bug fix for the V1 engine's memory usage. We highly recommend you upgrading!

Highlights

  • Revert "Use uv python for docker rather than ppa:deadsnakess/ppa (#13569)" (#15377)
  • Remove openvino support in favor of external plugin (#15339)

V1 Engine

  • Fix V1 Engine crash while handling requests with duplicate request id (#15043)
  • Support FP8 KV Cache (#14570, #15191)
  • Add flag to disable cascade attention (#15243)
  • Scheduler Refactoring: Add Scheduler Interface (#15250)
  • Structured Output
    • Add disable-any-whitespace option support for xgrammar (#15316)
    • guidance backend for structured output + auto fallback mode (#14779)
  • Spec Decode
    • Enable spec decode for top-p & top-k sampling (#15063)
    • Use better defaults for N-gram (#15358)
    • Update target_logits in place for rejection sampling (#15427)
  • AMD
    • Enable Triton(ROCm) Attention backend for Nvidia GPUs (#14071)
  • TPU
    • Support V1 Sampler for ragged attention (#14227)
    • Tensor parallel MP support (#15059)
    • MHA Pallas backend (#15288)

Features

  • Integrate fastsafetensors loader for loading model weights (#10647)
  • Add guidance backend for structured output (#14589)

Others

  • Add Kubernetes deployment guide with CPUs (#14865)
  • Support reset prefix cache by specified device (#15003)
  • Support tool calling and reasoning parser (#14511)
  • Support --disable-uvicorn-access-log parameters (#14754)
  • Support Tele-FLM Model (#15023)
  • Add pipeline parallel support to TransformersModel (#12832)
  • Enable CUDA graph support for llama 3.2 vision (#14917)

What's Changed

Read more

v0.8.1

19 Mar 17:40
61c7a1b
Compare
Choose a tag to compare

This release contains important bug fixes for v0.8.0. We highly recommend upgrading!

  • V1 Fixes

    • Ensure using int64 for sampled token ids (#15065)
    • Fix long dtype in topk sampling (#15049)
    • Refactor Structured Output for multiple backends (#14694)
    • Fix size calculation of processing cache (#15114)
    • Optimize Rejection Sampler with Triton Kernels (#14930)
    • Fix oracle for device checking (#15104)
  • TPU

    • Fix chunked prefill with padding (#15037)
    • Enhanced CI/CD (#15054, 14974)
  • Model

    • Re-enable Gemma3 for V1 (#14980)
    • Embedding model support LoRA (#14935)
    • Pixtral: Remove layer instantiation duplication (#15053)

What's Changed

New Contributors

Full Changelog: v0.8.0...v0.8.1

v0.8.0

18 Mar 17:52
Compare
Choose a tag to compare

v0.8.0 featured 523 commits from 166 total contributors (68 new contributors)!

Highlights

V1

We have now enabled V1 engine by default (#13726) for supported use cases. Please refer to V1 user guide for more detail. We expect better performance for supported scenarios. If you'd like to disable V1 mode, please specify the environment variable VLLM_USE_V1=0, and send us a GitHub issue sharing the reason!

DeepSeek Improvements

We observe state of the art performance with vLLM running DeepSeek model on latest version of vLLM:

  • MLA Enhancements:
  • Distributed Expert Parallelism (EP) and Data Parallelism (DP)
    • EP Support for DeepSeek Models (#12583)
    • Add enable_expert_parallel arg (#14305)
    • EP/TP MoE + DP Attention (#13931)
    • Set up data parallel communication (#13591)
  • MTP: Expand DeepSeek MTP code to support k > n_predict (#13626)
  • Pipeline Parallelism:
    • DeepSeek V2/V3/R1 only place lm_head on last pp rank (#13833)
    • Improve pipeline partitioning (#13839)
  • GEMM
    • Add streamK for block-quantized CUTLASS kernels (#12978)
    • Add benchmark for DeepGEMM and vLLM Block FP8 Dense GEMM (#13917)
    • Add more tuned configs for H20 and others (#14877)

New Models

  • Gemma 3 (#14660)
    • Note: You have to install transformers from main branch (pip install git+https://github.com/huggingface/transformers.git) to use this model. Also, there may be numerical instabilities for float16/half dtype. Please use bfloat16 (preferred by HF) or float32 dtype.
  • Mistral Small 3.1 (#14957)
  • Phi-4-multimodal-instruct (#14119)
  • Grok1 (#13795)
  • QwQ-32B and toll calling (#14479, #14478)
  • Zamba2 (#13185)

NVIDIA Blackwell

  • Support nvfp4 cutlass gemm (#13571)
  • Add cutlass support for blackwell fp8 gemm (#13798)
  • Update the flash attn tag to support Blackwell (#14244)
  • Add ModelOpt FP4 Checkpoint Support (#12520)

Breaking Changes

  • The default value of seed is now None to align with PyTorch and Hugging Face. Please explicitly set seed for reproduciblity. (#14274)
  • The kv_cache and attn_metadata arguments for model's forward method has been removed; as the attention backend has access to these value via forward_context. (#13887)
  • vLLM will now default generation_config from model for chat template, sampling parameters such as temperature, etc. (#12622)
  • Several request time metrics (vllm:time_in_queue_requests, vllm:model_forward_time_milliseconds, vllm:model_execute_time_milliseconds) has been deprecated and subject to removal (#14135)

Updates

  • Update to PyTorch 2.6.0 (#12721, #13860)
  • Update to Python 3.9 typing (#14492, #13971)
  • Update to CUDA 12.4 as default for release and nightly wheels (#12098)
  • Update to Ray 2.43 (#13994)
  • Upgrade aiohttp to incldue CVE fix (#14840)
  • Upgrade jinja2 to get 3 moderate CVE fixes (#14839)

Features

Frontend API

  • API Server
    • Support return_tokens_as_token_id as a request param (#14066)
    • Support Image Emedding as input (#13955)
    • New /load endpoint for load statistics (#13950)
    • New API endpoint /is_sleeping (#14312)
    • Enables /score endpoint for embedding models (#12846)
    • Enable streaming for Transcription API (#13301)
    • Make model param optional in request (#13568)
    • Support SSL Key Rotation in HTTP Server (#13495)
  • Reasoning
    • Support reasoning output (#12955)
    • Support outlines engine with reasoning outputs (#14114)
    • Update reasoning with stream example to use OpenAI library (#14077)
  • CLI
    • Ensure out-of-tree quantization method recognize by cli args (#14328)
    • Add vllm bench CLI (#13993)
  • Make LLM API compatible for torchrun launcher (#13642)

Disaggregated Serving

  • Support KV cache offloading and disagg prefill with LMCache connector (#12953)
  • Support chunked prefill for LMCache connector (#14505)

LoRA

  • Add LoRA support for TransformersModel (#13770)
  • Make the deviceprofilerinclude LoRA memory. (#14469)
  • Gemma3ForConditionalGeneration supports LoRA (#14797)
  • Retire SGMV and BGMV Kernels (#14685) (#14685)

VLM

  • Generalized prompt updates for multi-modal processor (#13964)
  • Deprecate legacy input mapper for OOT multimodal models (#13979)
  • Refer code examples for common cases in dev multimodal processor (#14278)

Quantization

  • BaiChuan SupportsQuant (#13710)
  • BartModel SupportsQuant (#14699)
  • Bamba SupportsQuant (#14698)
  • Deepseek GGUF support (#13167)
  • GGUF MoE kernel (#14613)
  • Add GPTQAllSpark Quantization (#12931)
  • Better performance of gptq marlin kernel when n is small (#14138)

Structured Output

  • xgrammar: Expand list of unsupported jsonschema keywords (#13783)

Hardware Support

AMD

  • Faster Custom Paged Attention kernels (#12348)
  • Improved performance for V1 Triton (ROCm) backend (#14152)
  • Chunked prefill/paged attention in MLA on ROCm (#14316)
  • Perf improvement for DSv3 on AMD GPUs (#13718)
  • MoE fp8 block quant tuning support (#14068)

TPU

  • Integrate the new ragged paged attention kernel with vLLM v1 on TPU (#13379)
  • Support start_profile/stop_profile in TPU worker (#13988)
  • Add TPU v1 test (#14834)
  • TPU multimodal model support for ragged attention (#14158)
  • Add tensor parallel support via Ray (#13618)
  • Enable prefix caching by default (#14773)

Neuron

  • Add Neuron device communicator for vLLM v1 (#14085)
  • Add custom_ops for neuron backend (#13246)
  • Add reshape_and_cache (#14391)
  • Vectorize KV cache load in FlashPagedAttention to maximize DMA bandwidth (#13245)

CPU

  • Upgrade CPU backend to torch-2.6 (#13381)
  • Support FP8 KV cache in CPU Backend(#14741)

s390x

  • Adding cpu inference with VXE ISA for s390x architecture (#12613)
  • Add documentation for s390x cpu implementation (#14198)

Plugins

  • Remove cuda hard code in models and layers (#13658)
  • Move use allgather to platform (#14010)

Bugfix and Enhancements

  • Illegal memory access for MoE On H20 (#13693)
  • Fix FP16 overflow for DeepSeek V2 (#13232)
  • Illegal Memory Access in the blockwise cutlass fp8 GEMMs (#14396)
  • Pass all driver env vars to ray workers unless excluded (#14099)
  • Use xgrammar shared context to avoid copy overhead for offline engine (#13837)
  • Capture and log the time of loading weights (#13666)

Developer Tooling

Benchmarks

  • Consolidate performance benchmark datasets (#14036)
  • Update benchmarks README (#14646)

CI and Build

  • Add RELEASE.md (#13926)
  • Use env var to control whether to use S3 bucket in CI (#13634)

Documentation

  • Add RLHF document (#14482)
  • Add nsight guide to profiling docs (#14298)
  • Add K8s deployment guide (#14084)
  • Add developer documentation for torch.compile integration (#14437)

What's Changed

  • Update pre-commit's isort version to remove warnings by @hmellor in #13614
  • [V1][Minor] Print KV cache size in token counts by @WoosukKwon in #13596
  • fix neuron performance issue by @ajayvohra2005 in #13589
  • [Frontend] Add backend-specific options for guided decoding by @joerunde in #13505
  • [Bugfix] Fix max_num_batched_tokens for MLA by @mgoin in #13620
  • [Neuron][Kernel] Vectorize KV cache load in FlashPagedAttention to maximize DMA bandwidth by @lingfanyu in #13245
  • Add llmaz as another integration by @kerthcet in #13643
  • [Misc] Adding script to setup ray for multi-node vllm deployments by @Edwinhr716 in #12913
  • [NVIDIA] Fix an issue to use current stream for the nvfp4 quant by @kaixih in #13632
  • Use pre-commit to update requirements-test.txt by @hmellor in #13617
  • [Bugfix] Add mm_processor_kwargs to chat-related protocols by @ywang96 in #13644
  • [V1][Sampler] Avoid an operation during temperature application by @njhill in #13587
  • Missing comment explaining VDR variable in GGUF kernels by @SzymonOzog in #13290
  • [FEATURE] Enables /score endpoint for embedding models by @gmarinho2 in #12846
  • [ci] Fix metrics test model path by @khluu in #13635
  • [Kernel]Add streamK for block-quantized CUTLASS kernels by @Hongbosherlock in #12978
  • [Bugfix][CPU] Fix cpu all-reduce using native pytorch implementation by @Isotr0py in #13586
  • fix typo of grafana dashboard, with correct datasource by @johnzheng1975 in https://...
Read more

v0.8.0rc2

17 Mar 17:08
37e3806
Compare
Choose a tag to compare
v0.8.0rc2 Pre-release
Pre-release

What's Changed

New Contributors

Full Changelog: v0.8.0rc1...v0.8.0rc2

v0.8.0rc1

17 Mar 05:13
8d6cf89
Compare
Choose a tag to compare
v0.8.0rc1 Pre-release
Pre-release

Note: vLLM no longer sets the global seed (#14274). Please set the seed parameter if you need to reproduce your results.

What's Changed

Read more

v0.7.3

20 Feb 17:08
ed6e907
Compare
Choose a tag to compare

Highlights

🎉 253 commits from 93 contributors, including 29 new contributors!

  • Deepseek enhancements:
    • Support for DeepSeek Multi-Token Prediction, 1.69x speedup in low QPS scenarios (#12755)
    • AMD support: DeepSeek tunings, yielding 17% latency reduction (#13199)
    • Using FlashAttention3 for MLA (#12807)
    • Align the expert selection code path with official implementation (#13474)
    • Optimize moe_align_block_size for deepseek_v3 (#12850)
    • Expand MLA to support most types of quantization (#13181)
  • V1 Engine:
    • LoRA Support (#10957, #12883)
    • Logprobs and prompt logprobs support (#9880), min_p sampling support (#13191), logit_bias in v1 Sampler (#13079)
    • Use msgpack for core request serialization (#12918)
    • Pipeline parallelism support (#12996, #13353, #13472, #13417, #13315)
    • Metrics enhancements: GPU prefix cache hit rate % gauge (#12592), iteration_tokens_total histogram (#13288), several request timing histograms (#12644)
    • Initial speculative decoding support with ngrams (#12193, #13365)

Model Support

  • Enhancement to Qwen2.5-VL: BNB support (#12944), LoRA (#13261), Optimizations (#13155)
  • Support GPTQModel Dynamic [2,3,4,8]bit GPTQ quantization (#7086)
  • Support Unsloth Dynamic 4bit BnB quantization (#12974)
  • IBM/NASA Prithvi Geospatial model (#12830)
  • Support Mamba2 (Codestral Mamba) (#9292), Bamba Model (#10909)
  • Ultravox Model: Support v0.5 Release (#12912)
  • transformers backend
    • Enable quantization support for transformers backend (#12960)
    • Set torch_dtype in TransformersModel (#13088)
  • VLM:
    • Implement merged multimodal processor for Mllama (#11427), GLM4V (#12449), Molmo (#12966)
    • Separate text-only and vision variants of the same model architecture (#13157)

Hardware Support

  • Pluggable platform-specific scheduler (#13161)
  • NVIDIA: Support nvfp4 quantization (#12784)
  • AMD:
    • Per-Token-Activation Per-Channel-Weight FP8 (#12501)
    • Tuning for Mixtral on MI325 and Qwen MoE on MI300 (#13503), Mixtral8x7B on MI300 (#13577)
    • Add intial ROCm support to V1 (#12790)
  • TPU: V1 Support (#13049)
  • Neuron: Support Longer Sequences in NKI-based Flash PagedAttention and Improve Efficiency (#12921)
  • Gaudi:
    • Support Contiguous Cache Fetch (#12139)
    • Enable long-contexts + LoRA support (#12812)

Engine Feature

  • Add sleep and wake up endpoint and v1 support (#12987)
  • Add /v1/audio/transcriptions OpenAI API endpoint (#12909)

Performance

  • Reduce TTFT with concurrent partial prefills (#10235)
  • LoRA - Refactor sgmv kernels (#13110)

Others

  • Make vLLM compatible with veRL (#12824)
  • Fixes for cases of FA2 illegal memory access error (#12848)
  • choice-based structured output with xgrammar (#12632)
  • Run v1 benchmark and integrate with PyTorch OSS benchmark database (#13068)

What's Changed

Read more

v0.7.2

06 Feb 07:30
0408efc
Compare
Choose a tag to compare

Highlights

  • Qwen2.5-VL is now supported in vLLM. Please note that it requires a source installation from Hugging Face transformers library at the moment (#12604)
  • Add transformers backend support via --model-impl=transformers. This allows vLLM to be ran with arbitrary Hugging Face text models (#11330, #12785, #12727).
  • Performance enhancement to DeepSeek models.
    • Align KV caches entries to start 256 byte boundaries, yielding 43% throughput enhancement (#12676)
    • Apply torch.compile to fused_moe/grouped_topk, yielding 5% throughput enhancement (#12637)
    • Enable MLA for DeepSeek VL2 (#12729)
    • Enable DeepSeek model on ROCm (#12662)

Core Engine

  • Use VLLM_LOGITS_PROCESSOR_THREADS to speed up structured decoding in high batch size scenarios (#12368)

Security Update

  • Improve hash collision avoidance in prefix caching (#12621)
  • Add SPDX-License-Identifier headers to python source files (#12628)

Other

  • Enable FusedSDPA support for Intel Gaudi (HPU) (#12359)

What's Changed

New Contributors

Read more

v0.7.1

01 Feb 18:02
4f4d427
Compare
Choose a tag to compare

Highlights

This release features MLA optimization for Deepseek family of models. Compared to v0.7.0 released this Monday, we offer ~3x the generation throughput, ~10x the memory capacity for tokens, and horizontal context scalability with pipeline parallelism

V1

For the V1 architecture, we

Models

  • New Model: MiniCPM-o (text outputs only) (#12069)

Hardwares

  • Neuron: NKI-based flash-attention kernel with paged KV cache (#11277)
  • AMD: llama 3.2 support upstreaming (#12421)

Others

  • Support override generation config in engine arguments (#12409)
  • Support reasoning content in API for deepseek R1 (#12473)

What's Changed

New Contributors

Read more

v0.7.0

27 Jan 05:50
5204ff5
Compare
Choose a tag to compare

Highlights

  • vLLM's V1 engine is ready for testing! This is a rewritten engine designed for performance and architectural simplicity. You can turn it on by setting environment variable VLLM_USE_V1=1. See our blog for more details. (44 commits).
  • New methods (LLM.sleep, LLM.wake_up, LLM.collective_rpc, LLM.reset_prefix_cache) in vLLM for the post training frameworks! (#12361, #12084, #12284).
  • torch.compile is now fully integrated in vLLM, and enabled by default in V1. You can turn it on via -O3 engine parameter. (#11614, #12243, #12043, #12191, #11677, #12182, #12246).

This release features

  • 400 commits from 132 contributors, including 57 new contributors.
    • 28 CI and build enhancements, including testing for nightly torch (#12270) and inclusion of genai-perf for benchmark (#10704).
    • 58 documentation enhancements, including reorganized documentation structure (#11645, #11755, #11766, #11843, #11896).
    • more than 161 bug fixes and miscellaneous enhancements

Features

Models

Hardwares

Features

  • Distributed:
    • Support torchrun and SPMD-style offline inference (#12071)
    • New collective_rpc abstraction (#12151, #11256)
  • API Server: Jina- and Cohere-compatible Rerank API (#12376)
  • Kernels:
    • Flash Attention 3 Support (#12093)
    • Punica prefill kernels fusion (#11234)
    • For Deepseek V3: optimize moe_align_block_size for cuda graph and large num_experts (#12222)

Others

  • Benchmark: new script for CPU offloading (#11533)
  • Security: Set weights_only=True when using torch.load() (#12366)

What's Changed

Read more

v0.6.6.post1

27 Dec 06:24
2339d59
Compare
Choose a tag to compare

This release restore functionalities for other quantized MoEs, which was introduced as part of initial DeepSeek V3 support 🙇 .

What's Changed

  • [Docs] Document Deepseek V3 support by @simon-mo in #11535
  • Update openai_compatible_server.md by @robertgshaw2-neuralmagic in #11536
  • [V1] Use FlashInfer Sampling Kernel for Top-P & Top-K Sampling by @WoosukKwon in #11394
  • [V1] Fix yapf by @WoosukKwon in #11538
  • [CI] Fix broken CI by @robertgshaw2-neuralmagic in #11543
  • [misc] fix typing by @youkaichao in #11540
  • [V1][3/N] API Server: Reduce Task Switching + Handle Abort Properly by @robertgshaw2-neuralmagic in #11534
  • [BugFix] Deepseekv3 broke quantization for all other methods by @robertgshaw2-neuralmagic in #11547

Full Changelog: v0.6.6...v0.6.6.post1