Skip to content

Release 2.3.0

Latest
Compare
Choose a tag to compare
@reuvenperetz reuvenperetz released this 12 Feb 11:31
· 13 commits to main since this release
33c45ff

What's Changed

Major Changes

Target Platform Capabilities (TPC) Changes

TPC Schema

  • Introduced a new Schema (version v1) mechanism to establish the language for building a target platform capabilities description.
    • The schema defines the TargetPlatformCapabilites class, which can be built to describe the platform capabilities.
    • The OperatorSetNames enum provides a closed set of operator set names that allows to set quantization configuration options for commonly used operators.
    • Using a custom operator set name is also available.
    • All schema classes are using pydantic BaseModel for enhanced validation and schema flexibility.
      • MCT has a new dependency in "pydantic < 2.0".
  • In addition, a new versioning system was introduced, using minor and patch versions.

Naming Refactor

  • Creating the schema mechanism was followed by some classes renaming:
    • TargetPlatformModelTargetPlatformCapabilities
    • TargetPlatformCapabilitiesFrameworkQuantizationCapabilities
    • OperatorSetConcatOperatorSetGroup

Attach TPC to Framework

  • A new module named AttachTpcToFramework handles the conversion from a framework-independent TargetPlatformCapabilities description to a framework-specific FrameworkQuantizationCapabilities that maps each framework's operator to its possible quantization configurations.
  • Available for Tensorflow and PyTorch via AttachTpcToKeras and AttachTpcToPytorch, respectively.

API changes

  • All MCT's APIs are expecting to get a target_platform_capabilities object ( TargetPlatformCapabilities), which contains the framework-independent platform capabilities description.
  • This is changed from the previous behaviour which expected an initialized framework-specific object.
  • Note: the default behavior of MCT's APIs is not changed! calling an API function without passing a TPC object or passing an object obtained using the following API: get_target_platform_capabilities(<FW_NAME>, DEFAULT_TP_MODEL) would use the same default TPC as in previous release.
    • Regardless, users that accessed TPC-related classes not via the published API may encounter breaking changes due to class renaming and files hierarchy changes.

Tighter activation memory estimation via Max-Cut(Experimental)

  • Replace Max-Tensor with Max-Cut as the activation memory estimation method in the mixed precision algorithm.
  • The Max-Cut metric considers the model operator's execution schedule for a more precise estimation of activation memory (#1295)
  • Note: this is an estimation of the actual memory usage during runtime, the actual memory in runtime may differ.
  • 16-bit Activation Quantization (experimental)
    • The new activation memory estimation allows flexible usage of the mixed precision algorithm to enable 16-bit activation quantization (dependent on a TPC that supports 16-bit quantization for different operators).
    • 16-bit quantization can be enabled either via Manual Bit-width selection API or automatically, by executing mixed precision with a proper activation or total memory constraint.
    • Note that when running mixed precision with activation memory constraint to enable 16-bit allocation, shift negative correction should be disabled.

Improved GPTQ algorithm via Sample Layer Attention (SLA):

  • Enabled SLA by default in both Keras and PyTorch (#1287, #1260)
  • Added gradual activation quantization support for enhanced results when quantizing activations (#1244, #1237)
  • Implemented Rademacher distribution for Hessian estimation (#1250)
  • For more details, please visit our paper.

Resource Utilization (RU) calculation:

  • Use max cut activation method for activation and total resource utilization computation.
  • Compute the total target from weights and activations utilization instead of using it as a separate metric.
  • Weights memory computation now include all quantized weights in the model, instead of considering only kernel attributes. This may change the results of existing execution of mixed precision scenarios.
  • Note that the ResourceUtilization API did not changed.

Minor Changes

  • Added Activation Bias Correction feature to potentially enhance quantization results of vision transformers (#1256)
  • Added substitution to decompose MatMul operation into baseline components in PyTorch (#1313)
  • Added substitution decompose scaled dot product attention operator in PyTorch (#1229)
  • Converted core configuration classes to dataclasses for simpler usage and strict behavior verification (CoreConfig, QuantizationConfig, etc.) (#1203)
  • Trainable Infrastructure changes:
    • Moved STE/LSQ activation quantizers from QAT to trainable infrastructure.
    • Renamed Trainable QAT quantizer to Weight Trainable quantizer (#1240)
  • Added support for PyTorch 2.4, PyTorch 2.5, and Python 3.12

Bug Fixes

  • Fix activation gradient backpropagating in GPTQ for PyTorch models. It now uses STE Activation Trainable quantizers with frozen quantization parameters instead of Activation Inferable quantizers, which did not propagate gradients. (#1197)
  • Fix ONNX export when PyTorch models have multiple inputs/outputs (#1223)
  • Fixed the issue of duplicating reused layers in PyTorch models (#1217)
  • Fixed HMSE being overridden by MSE after resource utilization computation (#1253)
  • Resolved duplicate QCOs error handling (#1282, #1149)
  • Fixed tf.nn.{conv2d,convolution} substitution to handle attributes with default values that were not passed explicitly (#1275)
  • Fixed handling errors in PyTorch graphs by managing nodes with missing outputs and ensuring robust extraction of output shapes (#1186)

New Contributors

Welcome @ambitious-octopus and @itai-berman for their first contributions! #1186 , #1266