Post-training Quantization:
Features:
- Introduced
backup_mode
optional parameter innncf.compress_weights()
to specify the data type for embeddings, convolutions and last linear layers during 4-bit weights compression. Available options are INT8_ASYM by default, INT8_SYM, and NONE which retains the original floating-point precision of the model weights. - Added the
quantizer_propagation_rule
parameter, providing fine-grained control over quantizer propagation. This advanced option is designed to improve accuracy for models where quantizers with different granularity could be merged to per-tensor, potentially affecting model accuracy. - Introduced
nncf.data.generate_text_data
API method that utilizes LLM to generate data for further data-aware optimization. See the example for details. - (OpenVINO) Extended support of data-free and data-aware weight compression methods for
nncf.compress_weights()
with NF4 per-channel quantization, which makes compressed LLMs more accurate and faster on NPU. - (OpenVINO) Introduced a new option
statistics_path
to cache and reuse statistics fornncf.compress_weights()
, reducing the time required to find optimal compression configurations. See the TinyLlama example for details. - (TorchFX, Experimental) Added support for quantization and weight compression of Torch FX models. The compressed models can be directly executed via
torch.compile(compressed_model, backend="openvino")
(see details here). Added INT8 quantization example. The list of supported features:- INT8 quantization with SmoothQuant, MinMax, FastBiasCorrection, and BiasCorrection algorithms via
nncf.quantize()
. - Data-free INT8, INT4, and mixed-precision weights compression with
nncf.compress_weights()
.
- INT8 quantization with SmoothQuant, MinMax, FastBiasCorrection, and BiasCorrection algorithms via
- (PyTorch, Experimental) Added model tracing and execution pre-post hooks based on TorchFunctionMode.
Fixes:
- Resolved an issue with redundant quantizer insertion before elementwise operations, reducing noise introduced by quantization.
- Fixed type mismatch issue for
nncf.quantize_with_accuracy_control()
. - Fixed BiasCorrection algorithm for specific branching cases.
- (OpenVINO) Fixed GPTQ weight compression method for Stable Diffusion models.
- (OpenVINO) Fixed issue with the variational statistics processing for
nncf.compress_weights()
. - (PyTorch, ONNX) Scaled dot product attention pattern quantization setup is aligned with OpenVINO.
Improvements:
- Reduction in peak memory by 30-50% for data-aware
nncf.compress_weights()
with AWQ, Scale Estimation, LoRA and mixed-precision algorithms. - Reduction in compression time by 10-20% for
nncf.compress_weights()
with AWQ algorithm. - Aligned behavior for ignored subgraph between different
networkx
versions. - Extended ignored patterns with RoPE block for
nncf.ModelType.TRANSFORMER
scheme. - (OpenVINO) Extended to the ignored scope for
nncf.ModelType.TRANSFORMER
scheme with GroupNorm metatype. - (ONNX) SE-block ignored pattern variant for
torchvision
mobilenet_v3 has been extended.
Tutorials:
- Post-Training Optimization of Llama-3.2-11B-Vision Model
- Post-Training Optimization of YOLOv11 Model
- Post-Training Optimization of Whisper in Automatic speech recognition with OpenVINO Generate API
- Post-Training Optimization of Pixtral Model
- Post-Training Optimization of LLM ReAct Agent Model
- Post-Training Optimization of CatVTON Model
- Post-Training Optimization of Stable Diffusion v3 Model in Torch FX Representation
Known issues:
- (ONNX)
nncf.quantize()
method can generate inaccurate INT8 results for MobileNet models with the BiasCorrection algorithm.
Deprecations/Removals:
- Migrated from using
setup.py
topyproject.toml
for the build and package configuration. It is aligned with Python packaging standards as outlined in PEP 517 and PEP 518. The installation throughsetup.py
does not work anymore. No impact on the installation from PyPI and Conda. - Removed support for Python 3.8.
- (PyTorch)
nncf.torch.create_compressed_model()
function has been deprecated.
Requirements:
- Updated ONNX (1.17.0) and ONNXRuntime (1.19.2) versions.
- Updated PyTorch (2.5.1) and Torchvision (0.20.1) versions.
- Updated NumPy (<2.2.0) version support.
- Updated Ultralytics (8.3.22) version.
Acknowledgements
Thanks for contributions from the OpenVINO developer community:
@rk119
@zina-cs