What's Changed
Major Changes
Target Platform Capabilities (TPC) Changes
TPC Schema
- Introduced a new Schema (version v1) mechanism to establish the language for building a target platform capabilities description.
- The schema defines the TargetPlatformCapabilites class, which can be built to describe the platform capabilities.
- The
OperatorSetNames
enum provides a closed set of operator set names that allows to set quantization configuration options for commonly used operators. - Using a custom operator set name is also available.
- All schema classes are using pydantic
BaseModel
for enhanced validation and schema flexibility.- MCT has a new dependency in "pydantic < 2.0".
- In addition, a new versioning system was introduced, using minor and patch versions.
Naming Refactor
- Creating the schema mechanism was followed by some classes renaming:
TargetPlatformModel
→TargetPlatformCapabilities
TargetPlatformCapabilities
→FrameworkQuantizationCapabilities
OperatorSetConcat
→OperatorSetGroup
Attach TPC to Framework
- A new module named
AttachTpcToFramework
handles the conversion from a framework-independentTargetPlatformCapabilities
description to a framework-specificFrameworkQuantizationCapabilities
that maps each framework's operator to its possible quantization configurations. - Available for Tensorflow and PyTorch via
AttachTpcToKeras
andAttachTpcToPytorch
, respectively.
API changes
- All MCT's APIs are expecting to get a target_platform_capabilities object (
TargetPlatformCapabilities
), which contains the framework-independent platform capabilities description. - This is changed from the previous behaviour which expected an initialized framework-specific object.
- Note: the default behavior of MCT's APIs is not changed! calling an API function without passing a TPC object or passing an object obtained using the following API:
get_target_platform_capabilities(<FW_NAME>, DEFAULT_TP_MODEL)
would use the same default TPC as in previous release.- Regardless, users that accessed TPC-related classes not via the published API may encounter breaking changes due to class renaming and files hierarchy changes.
Tighter activation memory estimation via Max-Cut(Experimental)
- Replace Max-Tensor with Max-Cut as the activation memory estimation method in the mixed precision algorithm.
- The Max-Cut metric considers the model operator's execution schedule for a more precise estimation of activation memory (#1295)
- Note: this is an estimation of the actual memory usage during runtime, the actual memory in runtime may differ.
- 16-bit Activation Quantization (experimental)
- The new activation memory estimation allows flexible usage of the mixed precision algorithm to enable 16-bit activation quantization (dependent on a TPC that supports 16-bit quantization for different operators).
- 16-bit quantization can be enabled either via Manual Bit-width selection API or automatically, by executing mixed precision with a proper activation or total memory constraint.
- Note that when running mixed precision with activation memory constraint to enable 16-bit allocation, shift negative correction should be disabled.
Improved GPTQ algorithm via Sample Layer Attention (SLA):
- Enabled SLA by default in both Keras and PyTorch (#1287, #1260)
- Added gradual activation quantization support for enhanced results when quantizing activations (#1244, #1237)
- Implemented Rademacher distribution for Hessian estimation (#1250)
- For more details, please visit our paper.
Resource Utilization (RU) calculation:
- Use max cut activation method for activation and total resource utilization computation.
- Compute the total target from weights and activations utilization instead of using it as a separate metric.
- Weights memory computation now include all quantized weights in the model, instead of considering only kernel attributes. This may change the results of existing execution of mixed precision scenarios.
- Note that the
ResourceUtilization
API did not changed.
Minor Changes
- Added Activation Bias Correction feature to potentially enhance quantization results of vision transformers (#1256)
- Added substitution to decompose MatMul operation into baseline components in PyTorch (#1313)
- Added substitution decompose scaled dot product attention operator in PyTorch (#1229)
- Converted core configuration classes to dataclasses for simpler usage and strict behavior verification (
CoreConfig
,QuantizationConfig
, etc.) (#1203) - Trainable Infrastructure changes:
- Moved STE/LSQ activation quantizers from QAT to trainable infrastructure.
- Renamed Trainable QAT quantizer to Weight Trainable quantizer (#1240)
- Added support for PyTorch 2.4, PyTorch 2.5, and Python 3.12
Bug Fixes
- Fix activation gradient backpropagating in GPTQ for PyTorch models. It now uses STE Activation Trainable quantizers with frozen quantization parameters instead of Activation Inferable quantizers, which did not propagate gradients. (#1197)
- Fix ONNX export when PyTorch models have multiple inputs/outputs (#1223)
- Fixed the issue of duplicating reused layers in PyTorch models (#1217)
- Fixed HMSE being overridden by MSE after resource utilization computation (#1253)
- Resolved duplicate QCOs error handling (#1282, #1149)
- Fixed tf.nn.{conv2d,convolution} substitution to handle attributes with default values that were not passed explicitly (#1275)
- Fixed handling errors in PyTorch graphs by managing nodes with missing outputs and ensuring robust extraction of output shapes (#1186)
New Contributors
Welcome @ambitious-octopus and @itai-berman for their first contributions! #1186 , #1266