Releases: EricLBuehler/mistral.rs
Releases · EricLBuehler/mistral.rs
v0.4.0
New features
- 🔥 New models!
- DeepSeek V2
- DeepSeek V3 and R1
- MiniCpm-O 2.6
- 🧮 Imatrix quantization
- ⚙️ Automatic device mapping
- BNB quantization
- Support blockwise FP8 dequantization and FP8 on Metal
- Integrate the llguidance library (@mmoskal)
- Metal PagedAttention
- Many fixes and improvements from contributors!
Breaking changes
- The Rust device mapping API has changed.
MSRV
The MSRV of this release is 1.83.0.
What's Changed
- Use CUDA_COMPUTE_CAP if nvidia-smi not found by @EricLBuehler in #944
- fix(docs): fix broken link by @sammcj in #945
- Better diffusion interactive mode by @EricLBuehler in #948
- Implement Imatrix for ISQ by @EricLBuehler in #949
- Support imatrix quantization for vision models by @EricLBuehler in #950
- Perplexity calculations with imatrix by @EricLBuehler in #952
- set minimum rustc version to 1.82 by @mmoskal in #957
- Fix append_sliding_window by @EricLBuehler in #958
- Fix completion api behavior of best_of by @EricLBuehler in #959
- Ensure support for cuda cc 5.3 by @EricLBuehler in #960
- Improve test speeds on Windows by @EricLBuehler in #961
- use llguidance library for constraints (including json schemas) by @mmoskal in #899
- Fix metal fp8 quantization by @EricLBuehler in #962
- Fix example gguf_locally to match chat template requirements by @msk in #966
- Bitsandbytes quantization: loading and kernels by @EricLBuehler in #967
- updated the tokenizers dependency of core to 0.21 by @vkomenda in #975
- Remove outdated binaries mention in the readme by @BafS in #973
- Improve error handling by @cdoko in #974
- Add None check to prevent panic in evict_all_to_cpu in prefix_cacher.rs by @cdoko in #979
- Include start offset for metal bitwise ops by @EricLBuehler in #978
- Fail fast on TcpListener bind errors by @cdoko in #982
- Inplace softmax long-seqlen attention optimizations by @EricLBuehler in #984
- Fix cuda cublaslt when using vllama mask by @EricLBuehler in #985
- Add cross attn quantization for mllama by @EricLBuehler in #987
- fix mistralrs-server ignoring interactive_mode arg by @haricot in #990
- Adding streaming function to mistralrs server. by @Narsil in #986
- Fixes for bnb and more apis in mistralrs-quant by @EricLBuehler in #972
- Support send + sync in loader by @EricLBuehler in #991
- More vllama optimizations by @EricLBuehler in #992
- Update docs by @EricLBuehler in #993
- Use metal autorelease to optimize memory usage by @EricLBuehler in #996
- Partial Fix for Sliding Window Attention by @cdoko in #994
- Only dep on objc when building on metal by @EricLBuehler in #998
- Prefix cacher v2 by @EricLBuehler in #1000
- Add
--cpu
flag tomistralrs-server
by @cdoko in #997 - Metal PagedAttention support by @EricLBuehler in #1001
- Fix cross attention + prefix cacher v2 support by @EricLBuehler in #1006
- Support for normal cache for mllama, phi3v, qwen2vl by @EricLBuehler in #1007
- Cleaner creation of dummy pa input metadata by @EricLBuehler in #1014
- Support BF16 kvcache, rope and attentions for inference of GGUF/GGML models by @guoqingbao in #1009
- Support device mapping for Paged Attention by @cdoko in #1011
- Prefix cacher fixes by @EricLBuehler in #1018
- More fixes for the prefix cacher by @EricLBuehler in #1019
- Support uqff for idefics3 by @EricLBuehler in #1020
- Prepare for v0.3.5 by @EricLBuehler in #1021
- Cleaner pipeline no prefix cache setting by @EricLBuehler in #1022
- Support uqff load/save for idefics3 by @EricLBuehler in #1023
- Update license for 2025 by @EricLBuehler in #1024
- Implement DeepSeekV2 by @EricLBuehler in #1010
- Use cudarc fork to fix CUDA build on Windows by @EricLBuehler in #1032
- Fix metal paged attn phi3 by @EricLBuehler in #1033
- Use float8 mistralrs_cudarc_fork feature by @EricLBuehler in #1034
- Patch prefix caching to fix incorrect outputs by @EricLBuehler in #1035
- Allocate paged attn cache as empty instead of zeros by @EricLBuehler in #1036
- Remove ug and cudarc transient dep by @EricLBuehler in #1037
- Rename MemoryGpuConfig::Amount->MbAmount by @EricLBuehler in #1038
- CUDA dequant kernels conditional compilation by @EricLBuehler in #1039
- F16 support for mllama, introduce FloatInfo by @EricLBuehler in #1041
- Automatic device mapping support by @EricLBuehler in #1042
- Support automatic device mapping for gguf models by @EricLBuehler in #1044
- Support loading models without ISQ using device map by @EricLBuehler in #1045
- Fix GGUF auto device mapping by @EricLBuehler in #1047
- More efficient loading of safetensors when casting by @EricLBuehler in #1048
- Fix Loading and Running on CPU by @cdoko in #1052
- Work on better device mapping for mllama by @EricLBuehler in #1049
- Mention interactive mode or server port in readme for gguf by @EricLBuehler in #1055
- Fix panic in mistralrs-server by @cdoko in #981
- Include device memory avail in device map err by @EricLBuehler in #1060
- Fix
--cpu
on cuda by @cdoko in #1056 - Improve pagedattn support in mistralrs bench by @EricLBuehler in #1063
- Paged attention support for multi gpu by @EricLBuehler in #1059
- Ergonomic automatic device mapping support by @EricLBuehler in #1054
- Examples for automatic device mapping by @EricLBuehler in #1065
- Fix metal pagedattn half8 vec impl by @EricLBuehler in #1067
- Improve support for GGUF auto device map by @EricLBuehler in #1069
- Fix missing field in idefics3 during loading by @EricLBuehler in #1070
- Fix missing field in idefics3 during loading by @EricLBuehler in #1072
- Fix paged attention for vision models on multiple devices by @cdoko in #1071
- Fixes for idefics3 and idefics2 by @EricLBuehler in #1073
- Improve automatic device map by @EricLBuehler in #1076
- Implement the DeepSeekV3 model (support full DeepSeek R1) by @EricLBuehler in #1077
- Don't print GGUF model metadata when silent=true by @Jeadie in #1079
- Allow
ChatCompletionChunkResponse
(and therefore streaming) to haveUsage
. by @Jeadie in #1078 - Support loading blockwise...
v0.3.4
New features
- Qwen2-VL support
- Idefics 3/SmolVLM support
- ️🔥 6x prompt performance boost (all benchmarks faster than or comparable to MLX, llama.cpp)!
- 🗂️ More efficient non-PagedAttention KV cache implementation!
- Public tokenization API
Python wheels
The wheels now include support for Windows, Linux, and Mac with x84_64 and aarch64.
MSRV
1.79.0
What's Changed
- Update Dockerfile by @Reckon-11 in #895
- Add the Qwen2-VL model by @EricLBuehler in #894
- ISQ for mistralrs-bench by @EricLBuehler in #902
- Use tokenizers v0.20 by @EricLBuehler in #904
- Fix metal sdpa for v stride by @EricLBuehler in #905
- Better parsing of the image path by @EricLBuehler in #906
- Add some Metal kernels for HQQ dequant by @EricLBuehler in #907
- Handle assistant messages with 'tool_calls' by @Jeadie in #824
- Attention-fused softmax for Metal by @EricLBuehler in #908
- Metal qmatmul mat-mat product (5.4x performance increase) by @EricLBuehler in #909
- Support --dtype in mistralrs bench by @EricLBuehler in #911
- Metal: Use mtl resource shared to avoid one copy by @EricLBuehler in #914
- Preallocated KV cache by @EricLBuehler in #916
- Fixes for kv cache grow by @EricLBuehler in #917
- Dont always compile with fp8, bf16 for cuda by @EricLBuehler in #920
- Expand attnmask on cuda by @EricLBuehler in #923
- Faster CUDA prompt speeds by @EricLBuehler in #925
- Paged Attention alibi support by @EricLBuehler in #926
- Default to SDPA for faster VLlama PP T/s by @EricLBuehler in #927
- VLlama vision model ISQ support by @EricLBuehler in #928
- Support fp8 on Metal by @EricLBuehler in #930
- Bump rustls from 0.23.15 to 0.23.18 by @dependabot in #932
- Calculate perplexity of ISQ models by @EricLBuehler in #931
- Integrate fast MLX kernel for SDPA with long seqlen by @EricLBuehler in #933
- Always cast image to rgb8 for qwenvl2 by @EricLBuehler in #936
- Fix etag missing in hf hub by @EricLBuehler in #934
- Fix some examples for vllama 3.2 by @EricLBuehler in #937
- Improve memory efficency of vllama by @EricLBuehler in #938
- Implement the Idefics 3 models (Idefics 3, SmolVLM-Instruct) by @EricLBuehler in #939
- Expose a public tokenization API by @EricLBuehler in #940
- Prepare for v0.3.4 by @EricLBuehler in #942
New Contributors
- @Reckon-11 made their first contribution in #895
Full Changelog: v0.3.2...v0.3.4
v0.3.2
Key changes
- General improvements and fixes
- ISQ FP8
- GPTQ Marlin
- 26% performance boost on Metal
- Python package wheels are available. See below and the various PyPi packages.
What's Changed
- Update docs and deps by @EricLBuehler in #804
- Support Qwen 2.5 by @EricLBuehler in #805
- Update docs with clarifications and notes by @EricLBuehler in #806
- Improved inverting for Attention Mask by @EricLBuehler in #811
- Fix
repeat_interleave
by @EricLBuehler in #812 - Use f32 for neg inf in cross attn mask by @EricLBuehler in #814
- Improve UQFF memory efficiency by @EricLBuehler in #813
- Update Metal, CUDA Candle impls and ISQ by @EricLBuehler in #816
- chore: update pagedattention.cu by @eltociear in #822
- MLlama - if f16, load vision model in f32 by @EricLBuehler in #820
- ci: Upgrade actions by @polarathene in #823
- docs: added a top button because of readme length by @bhargavshirin in #833
- Typo in error of model architecture enum by @nikolaydubina in #835
- Expose config for Rust api, tweak modekind by @EricLBuehler in #841
- Add ISQ FP8 by @EricLBuehler in #832
- Fix Metal F8 build errors by @EricLBuehler in #846
- Bump pyo3 from 0.22.3 to 0.22.4 by @dependabot in #854
- Generate standalone UQFF models by @EricLBuehler in #849
- Update README.MD by @kaleaditya779 in #848
- Add GPTQ Marlin support for 4 and 8 bit by @EricLBuehler in #856
- Adds wrap_help feature to clap by @DaveTJones in #858
- Patch UQFF metal generation by @EricLBuehler in #857
- Add GGUF Qwen 2 by @EricLBuehler in #860
- Avoid duplicate Metal command buffer encodings during ISQ by @EricLBuehler in #861
- Fix for isnanf by @EricLBuehler in #859
- Fix some metal warnings by @EricLBuehler in #862
- Support interactive mode markdown bold/italics via ANSI codes by @EricLBuehler in #879
- Even better V-Llama accuracy by @EricLBuehler in #881
- Trim whitespace (such as carriage returns) from nvidia-smi output. by @asaddi in #880
- MODEL_ID not "MODEL_ID" by @simonw in #863
- Sync ggml metal kernels by @EricLBuehler in #885
- Increase Metal decoding T/s by 26% by @EricLBuehler in #887
- Remove pretty-printer by @EricLBuehler in #889
- Fix typo in documentation by @msk in #888
- fix Half-Quadratic Quantization and Dequantization on CPU by @haricot in #873
- Prepare for v0.3.2 by @EricLBuehler in #891
New Contributors
- @bhargavshirin made their first contribution in #833
- @nikolaydubina made their first contribution in #835
- @kaleaditya779 made their first contribution in #848
- @DaveTJones made their first contribution in #858
- @asaddi made their first contribution in #880
- @simonw made their first contribution in #863
- @msk made their first contribution in #888
- @haricot made their first contribution in #873
Full Changelog: v0.3.1...v0.3.2
v0.3.1
Highlights
- UQFF
- FLUX model
- Llama 3.2 Vision model
MSRV
The MSRV of this release is 1.79.0.
What's Changed
- Enable automatic determination of normal loader type by @EricLBuehler in #742
- Add the
ForwardInputsResult
api by @EricLBuehler in #745 - Implement Mixture of Quantized Experts (MoQE) by @EricLBuehler in #747
- Bump quinn-proto from 0.11.6 to 0.11.8 by @dependabot in #748
- Fix f64-f32 type mismatch for Metal/Accelerate by @EricLBuehler in #752
- Nicer error when misconfigured PagedAttention input metadata by @EricLBuehler in #753
- Update deps, support CUDA 12.6 by @EricLBuehler in #755
- Patch bug when not using PagedAttention by @EricLBuehler in #759
- Fix
MistralRs
Drop impl in tokio runtime by @EricLBuehler in #762 - Use nicer Candle Error APIs by @EricLBuehler in #767
- Support setting seed by @EricLBuehler in #766
- Fix Metal build error with seed by @EricLBuehler in #771
- Fix and add checks for no kv cache by @EricLBuehler in #776
- UQFF: The uniquely powerful quantized file format. by @EricLBuehler in #770
- Add
Scheduler::running_len
by @EricLBuehler in #780 - Deduplicate RoPE caches by @EricLBuehler in #787
- Easier and simpler Rust-side API by @EricLBuehler in #785
- Add some examples for AnyMoE by @EricLBuehler in #788
- Rust API for sampling by @EricLBuehler in #790
- Our first Diffusion model: FLUX by @EricLBuehler in #758
- Fix build bugs with metal, NSUInteger by @EricLBuehler in #792
- Support weight tying in Llama 3.2 GGUF models by @EricLBuehler in #801
- Implement the Llama 3.2 vision models by @EricLBuehler in #796
Full Changelog: v0.3.0...v0.3.1
v0.3.0
Highlights
- New model topology feature: ISQ and device mapping
- 🔥Faster FlashAttention support when batching
- Removed
plotly
and associated JS dependencies - φ³ Support Phi 3.5, Phi 3.5 vision, Phi 3.5 MoE
- Improved Rust API ergonomics
- Support multiple (shaded) GGUF files
MSRV
The Rust MSRV of this version is 1.79.0
What's Changed
- Fixes for auto dtype selection with RUST_BACKTRACE=1 by @EricLBuehler in #690
- Add support multiple GGUF files by @EricLBuehler in #692
- Refactor normal and vision loaders by @EricLBuehler in #693
- Fix
split.count
GGUF duplication handling by @EricLBuehler in #695 - Batching example by @EricLBuehler in #694
- Some fixes by @EricLBuehler in #697
- Improve vision rust examples by @EricLBuehler in #698
- Add ISQ topology by @EricLBuehler in #701
- Add custom logits processor API by @EricLBuehler in #702
- Add Gemma 2 PagedAttention support by @EricLBuehler in #704
- Faster RmsNorm in Gemma/Gemma2 by @EricLBuehler in #703
- Fix bug in Metal ISQ by @EricLBuehler in #706
- Support GGUF BF16 tensors by @EricLBuehler in #691
- Better support for FlashAttention: real batching + sliding window + softcap by @EricLBuehler in #707
- Remove some usages of
pub
in models by @EricLBuehler in #708 - Support the Phi 3.5 V model by @EricLBuehler in #710
- Implement the Phi 3.5 MoE model by @EricLBuehler in #709
- Device map topology by @EricLBuehler in #717
- Implement DRY penalty by @EricLBuehler in #637
- Remove plotly and just output CSV loss file by @EricLBuehler in #700
- Using once_cell to reduce MSRV by @EricLBuehler in #724
- Fixes for Windows build by @EricLBuehler in #729
- Even more phi3.5moe fix attempts by @EricLBuehler in #731
- Add example for Phi 3.5 MoE by @EricLBuehler in #733
- Add Phi 3.5 chat template by @EricLBuehler in #734
- Patch ISQ for Mixtral by @EricLBuehler in #730
- Gracefully handle Engine Drop with termination request by @EricLBuehler in #735
- feat(vision): add support for proper file and data image URLs by @Schuwi in #727
- Add new parsing to Python API by @EricLBuehler in #737
- Remove test and add custom error type to Python API by @EricLBuehler in #738
- Update kernels for metal bf16 by @EricLBuehler in #719
- Better
Response
Result API by @EricLBuehler in #739 - More Metal quantized kernel fixes by @EricLBuehler in #740
- [Breaking] Bump version to v0.3.0 by @EricLBuehler in #736
- Final changes for v0.3.0 by @EricLBuehler in #741
New Contributors
Full Changelog: v0.2.5...v0.3.0
v0.2.5
What's Changed
- Refactor ISQ quant parsing by @EricLBuehler in #664
- Refactor server examples to use OpenAI Python client by @EricLBuehler in #665
- Implement prompt chunking by @EricLBuehler in #623
- Python example and server example cleanup by @EricLBuehler in #668
- Implement GPTQ quantization by @EricLBuehler in #467
- Update deps by @EricLBuehler in #672
- Rework the automatic dtype selection feature by @EricLBuehler in #676
- Fix backend Candle fork Metal, flash attn, also Llama linear by @EricLBuehler in #681
- Use converted tokenizer.json in tests by @EricLBuehler in #682
- Refactor ISQ and mistralrs-quant by @EricLBuehler in #683
- Fix metal build for isq by @EricLBuehler in #686
- Add missing error case in automatic dtype selection feature by @ac3xx in #685
- fix null in tool type response by @wseaton in #687
- Implement HQQ quantization by @EricLBuehler in #677
- Bump version to 0.2.5 by @EricLBuehler in #688
New Contributors
Full Changelog: v0.2.4...v0.2.5
Install mistralrs-server 0.2.5
Install prebuilt binaries via shell script
curl --proto '=https' --tlsv1.2 -LsSf https://github.com/EricLBuehler/mistral.rs/releases/download/v0.2.5/mistralrs-server-installer.sh | sh
Download mistralrs-server 0.2.5
File | Platform | Checksum |
---|---|---|
mistralrs-server-aarch64-apple-darwin.tar.xz | Apple Silicon macOS | checksum |
mistralrs-server-x86_64-apple-darwin.tar.xz | Intel macOS | checksum |
mistralrs-server-x86_64-unknown-linux-gnu.tar.xz | x64 Linux | checksum |
v0.2.4
What's Changed
- fix build on metal by returning Device by @rgbkrk in #642
- Add invite to Matrix chatroom by @EricLBuehler in #644
- Make sure we don't have dead links by @EricLBuehler in #647
- Fix more links by @EricLBuehler in #648
- Throughput for interactive mode by @EricLBuehler in #655
- Implement tool calling by @EricLBuehler in #649
- Fix device map check for paged attn by @EricLBuehler in #656
- Fix for mistral nemo in gguf by @EricLBuehler in #657
- Fix check of cache config when device mapping + PA by @EricLBuehler in #658
- Biollama in tool calling example by @EricLBuehler in #659
- Biollama in tool calling example by @EricLBuehler in #660
- Examples for simple tool calling by @EricLBuehler in #661
- Bump version to 0.2.4 by @EricLBuehler in #662
New Contributors
Full Changelog: v0.2.3...v0.2.4
MSRV
MSRV is 1.75
Install mistralrs-server 0.2.4
Install prebuilt binaries via shell script
curl --proto '=https' --tlsv1.2 -LsSf https://github.com/EricLBuehler/mistral.rs/releases/download/v0.2.4/mistralrs-server-installer.sh | sh
Download mistralrs-server 0.2.4
File | Platform | Checksum |
---|---|---|
mistralrs-server-aarch64-apple-darwin.tar.xz | Apple Silicon macOS | checksum |
mistralrs-server-x86_64-apple-darwin.tar.xz | Intel macOS | checksum |
mistralrs-server-x86_64-unknown-linux-gnu.tar.xz | x64 Linux | checksum |
v0.2.3
What's Changed
- Implement min-p sampling by @EricLBuehler in #625
- Tweak handling when PA cannot allocate by @EricLBuehler in #632
- Update deps by @EricLBuehler in #633
- Improve penalty context window calculation by @EricLBuehler in #636
- Allow setting PagedAttention KV cache allocation from context size by @EricLBuehler in #640
- Bump version to 0.2.3 by @EricLBuehler in #638
Full Changelog: v0.2.2...v0.2.3
Install mistralrs-server 0.2.3
Install prebuilt binaries via shell script
curl --proto '=https' --tlsv1.2 -LsSf https://github.com/EricLBuehler/mistral.rs/releases/download/v0.2.3/mistralrs-server-installer.sh | sh
Download mistralrs-server 0.2.3
File | Platform | Checksum |
---|---|---|
mistralrs-server-aarch64-apple-darwin.tar.xz | Apple Silicon macOS | checksum |
mistralrs-server-x86_64-apple-darwin.tar.xz | Intel macOS | checksum |
mistralrs-server-x86_64-unknown-linux-gnu.tar.xz | x64 Linux | checksum |
v0.2.2
What's Changed
- Fix ctrlc handling for scheduler v2 by @EricLBuehler in #614
- Make
sliding_window
optional for mixtral by @csicar in #616 - Support Llama 3.1 scaled rope by @EricLBuehler in #618
New Contributors
Full Changelog: v0.2.1...v0.2.2
MSRV
MSRV is 1.75
.
Install mistralrs-server 0.2.2
Install prebuilt binaries via shell script
curl --proto '=https' --tlsv1.2 -LsSf https://github.com/EricLBuehler/mistral.rs/releases/download/v0.2.2/mistralrs-server-installer.sh | sh
Download mistralrs-server 0.2.2
File | Platform | Checksum |
---|---|---|
mistralrs-server-aarch64-apple-darwin.tar.xz | Apple Silicon macOS | checksum |
mistralrs-server-x86_64-apple-darwin.tar.xz | Intel macOS | checksum |
mistralrs-server-x86_64-unknown-linux-gnu.tar.xz | x64 Linux | checksum |
v0.2.1
What's Changed
- Fix path normalize for mistralrs-paged-attn by @EricLBuehler in #592
- ISQ python example by @EricLBuehler in #593
- Add support for mistral nemo by @EricLBuehler in #595
- Fix dtype with QLinear by @EricLBuehler in #600
- Update paged-attn build.rs with NVCC flags by @joshpopelka20 in #604
- Bump openssl from 0.10.64 to 0.10.66 by @dependabot in #605
- Update GitHub issue templates by @EricLBuehler in #607
- Add server throughput logging by @EricLBuehler in #608
- Make the plotly feature optional by @EricLBuehler in #597
- Use OnceLock for Python bindings device by @EricLBuehler in #602
- Topk for X-LoRA scalings by @EricLBuehler in #609
- Fix server cross-origin errors by @openmynet in #610
- Refactor sampler by @EricLBuehler in #611
- Bump version to 0.2.1 by @EricLBuehler in #613
New Contributors
- @dependabot made their first contribution in #605
- @openmynet made their first contribution in #610
Full Changelog: v0.2.0...v0.2.1
Install mistralrs-server 0.2.1
Install prebuilt binaries via shell script
curl --proto '=https' --tlsv1.2 -LsSf https://github.com/EricLBuehler/mistral.rs/releases/download/v0.2.1/mistralrs-server-installer.sh | sh
Download mistralrs-server 0.2.1
File | Platform | Checksum |
---|---|---|
mistralrs-server-aarch64-apple-darwin.tar.xz | Apple Silicon macOS | checksum |
mistralrs-server-x86_64-apple-darwin.tar.xz | Intel macOS | checksum |
mistralrs-server-x86_64-unknown-linux-gnu.tar.xz | x64 Linux | checksum |