Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[OV] Add support for nf4_f8e4m3 quantization mode #1148

Merged
merged 23 commits into from
Feb 18, 2025

Conversation

nikita-savelyevv
Copy link
Collaborator

@nikita-savelyevv nikita-savelyevv commented Feb 6, 2025

What does this PR do?

Changes

  • Added OVMixedQuantizationConfig for mixed precision quantization scenario. It is initialized with an instance of OVWeightQuantizationConfig and an instance of OVQuantizationConfig.
  • Added nf4_f8e4m3, int4_f8e4m3, nf4_f8e5m2, int4_f8e5m2 as possible values of --quant-mode CLI argument. This performs mixed precision quantization, compressing weights to nf4/int4 precision and activations to f8e4m3/f8e5m2.
  • Quantization configs refactoring. OVQuantizationConfigBase now contains only model-related parameters. Added to_nncf_dict() method to quantization configs for convenience.
  • Renamed OVWeightQuantizationConfig.weight_format to OVWeightQuantizationConfig.dtype and OVQuantizationConfig.activation_format to OVQuantizationConfig.dtype. The latter is done because when OVQuantizationConfig is used, not only activations but also weights are quantized, so activation_format does not correctly represent what actually happens. OVWeightQuantizationConfig.weight_format is renamed for consistency.
  • OVBaseModel._prepare_quantization_config() can now create instances of configs other than OVWeightQuantizationConfig.

Examples

CLI

optimum-cli export openvino -m meta-llama/Llama-3.1-8B --quant-mode nf4_f8e4m3 --dataset wikitext2 ./llama-3.1-8b_nf4_f8e4m3

Python API:

model = OVModelForCausalLM.from_pretrained(
    model_id="meta-llama/Llama-3.1-8B",
    quantization_config=OVMixedQuantizationConfig(
        weight_quantization_config=OVWeightQuantizationConfig(bits=4, dtype="nf4"),
        full_quantization_config=OVQuantizationConfig(dtype="f8e4m3"),
        dataset="wikitext2",
    )
)

Some of these changes were implemented thanks to @nikita-malininn .

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@nikita-savelyevv nikita-savelyevv marked this pull request as ready for review February 7, 2025 17:11
@AlexKoff88
Copy link
Collaborator

@nikita-malininn, please take a look as well.

@nikita-savelyevv nikita-savelyevv marked this pull request as draft February 10, 2025 15:55
@nikita-savelyevv nikita-savelyevv marked this pull request as ready for review February 12, 2025 17:34
@@ -389,8 +379,8 @@ class OVWeightQuantizationConfig(OVQuantizationConfigBase):
scale_estimation (`bool`, *optional*):
Indicates whether to apply a scale estimation algorithm that minimizes the L2 error between the original and
compressed layers. Providing a dataset is required to run scale estimation.
weight_format (`str`, *optional*):
Data format weights are compressed to. Possible values: ['int4', 'int8', 'mxfp4', 'nf4'].
dtype (`str`, *optional*):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eaidova, we hope that this change will not have a negative impact on the OpenVINO Notebooks as it is not backward compatible.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes could make sense to add a warning (+ potentially keep compatibility for one or two releases by setting dtype in case weight_format is provided)

@AlexKoff88
Copy link
Collaborator

@l-bat, can you please review this PR as well?

optimum/intel/openvino/quantization.py Show resolved Hide resolved
optimum/intel/openvino/quantization.py Outdated Show resolved Hide resolved
optimum/intel/openvino/quantization.py Show resolved Hide resolved
optimum/intel/openvino/configuration.py Outdated Show resolved Hide resolved
optimum/intel/openvino/configuration.py Outdated Show resolved Hide resolved
@AlexKoff88
Copy link
Collaborator

Overall, it looks good, thanks.

@AlexKoff88
Copy link
Collaborator

@IlyasMoutawwakil, @echarlaix, the PR is ready for your review.

Copy link
Collaborator

@echarlaix echarlaix left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for the addition @nikita-savelyevv

optimum/intel/openvino/modeling_base.py Outdated Show resolved Hide resolved
optimum/intel/openvino/modeling_base.py Outdated Show resolved Hide resolved
optimum/intel/openvino/configuration.py Outdated Show resolved Hide resolved
optimum/intel/openvino/configuration.py Outdated Show resolved Hide resolved
@@ -389,8 +379,8 @@ class OVWeightQuantizationConfig(OVQuantizationConfigBase):
scale_estimation (`bool`, *optional*):
Indicates whether to apply a scale estimation algorithm that minimizes the L2 error between the original and
compressed layers. Providing a dataset is required to run scale estimation.
weight_format (`str`, *optional*):
Data format weights are compressed to. Possible values: ['int4', 'int8', 'mxfp4', 'nf4'].
dtype (`str`, *optional*):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes could make sense to add a warning (+ potentially keep compatibility for one or two releases by setting dtype in case weight_format is provided)

@echarlaix
Copy link
Collaborator

Failing tests unrelated so merging, thanks @nikita-savelyevv

@echarlaix echarlaix merged commit 235294d into huggingface:main Feb 18, 2025
17 of 22 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants