ONNX Runtime v1.20.0
Release Manager: @apsonawane
Announcements
- All ONNX Runtime Training packages have been deprecated. ORT 1.19.2 was the last release for which onnxruntime-training (PyPI), onnxruntime-training-cpu (PyPI), Microsoft.ML.OnnxRuntime.Training (Nuget), onnxruntime-training-c (CocoaPods), onnxruntime-training-objc (CocoaPods), and onnxruntime-training-android (Maven Central) were published.
- ONNX Runtime packages will stop supporting Python 3.8 and Python 3.9. This decision aligns with NumPy Python version support. To continue using ORT with Python 3.8 and Python 3.9, you can use ORT 1.19.2 and earlier.
- ONNX Runtime 1.20 CUDA packages will include new dependencies that were not required in 1.19 packages. The following dependencies are new: libcudnn_adv.so.9, libcudnn_cnn.so.9, libcudnn_engines_precompiled.so.9, libcudnn_engines_runtime_compiled.so.9, libcudnn_graph.so.9, libcudnn_heuristic.so.9, libcudnn_ops.so.9, libnvrtc.so.12, and libz.so.1.
Build System & Packages
- Python 3.13 support is included in PyPI packages.
- ONNX 1.17 support will be delayed until a future release, but the ONNX version used by ONNX Runtime has been patched to include a shape inference change to the Einsum op.
- DLLs in the Maven build are now digitally signed (fix for issue reported here).
- (Experimental) vcpkg support added for the CPU EP. The DML EP does not yet support vcpkg, and other EPs have not been tested.
Core
- MultiLoRA support.
- Reduced memory utilization.
- Fixed alignment that was causing mmap to fail for external weights.
- Eliminated double allocations when deserializing external weights.
- Added ability to serialize pre-packed weights so that they don’t cause an increase in memory utilization when the model is loaded.
- Support bfloat16 and float8 data types in python I/O binding API.
Performance
- INT4 quantized embedding support on CPU and CUDA EPs.
- Miscellaneous performance improvements and bug fixes.
EPs
CPU
- FP16 support for MatMulNbits, Clip, and LayerNormalization ops.
CUDA
- Cudnn frontend integration for convolution operators.
- Added support of cuDNN Flash Attention and Lean Attention in MultiHeadAttention op.
TensorRT
QNN
- QNN HTP support for weight sharing across multiple ORT inference sessions. (See ORT QNN EP documentation for more information.)
- Support for QNN SDK 2.27.
OpenVINO
- Added support up to OpenVINO 2024.4.1.
- Compile-time memory optimizations.
- Enhancement of ORT EPContext Session option for optimized first inference latency.
- Added remote tensors to ensure direct memory access for inferencing on NPU.
DirectML
- DirectML 1.15.2 support.
Mobile
- Improved Android QNN support, including a pre-built Maven package and various performance improvements.
- FP16 support for ML Program models with CoreML EP.
- FP16 XNNPACK kernels to provide a fallback option if CoreML is not available at runtime.
- Initial support for using the native WebGPU EP on Android and iOS. _Note: The set of initial operators is limited, and the code is available from the main branch, not ORT 1.20 packages. See #22591 for more information.
Web
- Quantized embedding support.
- On-demand weight loading support (offloads Wasm32 heap and enables 8B-parameter LLMs).
- Integrated Intel GPU performance improvements.
- Opset-21 support (Reshape, Shape, Gelu).
GenAI
- MultiLoRA support.
- Generations can now be terminated mid-loop.
- Logit soft capping support in Group Query Attention (GQA).
- Additional model support, including Phi-3.5 Vision Multi-Frame, ChatGLM3, and Nemotron-Mini.
- Python package now available for Mac.
- Mac / iOS now available in NuGet packages.
Full release notes for ONNX Runtime generate() API v0.5.0 can be found here.
Extensions
- Tokenization performance improvements.
- Support for latest Hugging Face tokenization JSON format (transformers>=4.45).
- Unigram tokenization model support.
- OpenCV dependency removed from C API build.
Full release notes for ONNX Runtime Extensions v0.13 can be found here.
Olive
- Olive command line interface (CLI) now available with support to execute well-defined, concrete workflows without the need to create or edit configs manually.
- Additional improvements, including support for YAML-based workflow configs, streamlined DataConfig management, simplified workflow configuration, and more.
- Llama and Phi-3 model updates, including an updated MultiLoRA example using the ORT generate() API.
Full release notes for Olive v0.7.0 can be found here.
Contributors
Big thank you to the release manager @apsonawane, as well as @snnn, @jchen351, @sheetalarkadam, and everyone else who made this release possible!
Tianlei Wu, Yi Zhang, Yulong Wang, Scott McKay, Edward Chen, Adrian Lizarraga, Wanming Lin, Changming Sun, Dmitri Smirnov, Jian Chen, Jiajia Qin, Jing Fang, George Wu, Caroline Zhu, Hector Li, Ted Themistokleous, mindest, Yang Gu, jingyanwangms, liqun Fu, Adam Pocock, Patrice Vignola, Yueqing Zhang, Prathik Rao, Satya Kumar Jandhyala, Sumit Agarwal, Xu Xing, aciddelgado, duanshengliu, Guenther Schmuelling, Kyle, Ranjit Ranjan, Sheil Kumar, Ye Wang, kunal-vaishnavi, mingyueliuh, xhcao, zz002, 0xdr3dd, Adam Reeve, Arne H Juul, Atanas Dimitrov, Chen Feiyue, Chester Liu, Chi Lo, Erick Muñoz, Frank Dong, Jake Mathern, Julius Tischbein, Justin Chu, Xavier Dupré, Yifan Li, amarin16, anujj, chenduan-amd, saurabh, sfatimar, sheetalarkadam, wejoncy, Akshay Sonawane, AlbertGuan9527, Bin Miao, Christian Bourjau, Claude, Clément Péron, Emmanuel, Enrico Galli, Fangjun Kuang, Hann Wang, Indy Zhu, Jagadish Krishnamoorthy, Javier Martinez, Jeff Daily, Justin Beavers, Kevin Chen, Krishna Bindumadhavan, Lennart Hannink, Luis E. P., Mauricio A Rovira Galvez, Michael Tyler, PARK DongHa, Peishen Yan, PeixuanZuo, Po-Wei (Vincent), Pranav Sharma, Preetha Veeramalai, Sophie Schoenmeyer, Vishnudas Thaniel S, Xiang Zhang, Yi-Hong Lyu, Yufeng Li, goldsteinn, mcollinswisc, mguynn-intc, mingmingtasd, raoanag, shiyi, stsokolo, vraspar, wangshuai09
Full changelog: v1.19.2...v1.20.0