-
Notifications
You must be signed in to change notification settings - Fork 123
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[OV] Add support for nf4_f8e4m3
quantization mode
#1148
[OV] Add support for nf4_f8e4m3
quantization mode
#1148
Conversation
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
@nikita-malininn, please take a look as well. |
@@ -389,8 +379,8 @@ class OVWeightQuantizationConfig(OVQuantizationConfigBase): | |||
scale_estimation (`bool`, *optional*): | |||
Indicates whether to apply a scale estimation algorithm that minimizes the L2 error between the original and | |||
compressed layers. Providing a dataset is required to run scale estimation. | |||
weight_format (`str`, *optional*): | |||
Data format weights are compressed to. Possible values: ['int4', 'int8', 'mxfp4', 'nf4']. | |||
dtype (`str`, *optional*): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eaidova, we hope that this change will not have a negative impact on the OpenVINO Notebooks as it is not backward compatible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes could make sense to add a warning (+ potentially keep compatibility for one or two releases by setting dtype
in case weight_format
is provided)
@l-bat, can you please review this PR as well? |
Overall, it looks good, thanks. |
@IlyasMoutawwakil, @echarlaix, the PR is ready for your review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for the addition @nikita-savelyevv
@@ -389,8 +379,8 @@ class OVWeightQuantizationConfig(OVQuantizationConfigBase): | |||
scale_estimation (`bool`, *optional*): | |||
Indicates whether to apply a scale estimation algorithm that minimizes the L2 error between the original and | |||
compressed layers. Providing a dataset is required to run scale estimation. | |||
weight_format (`str`, *optional*): | |||
Data format weights are compressed to. Possible values: ['int4', 'int8', 'mxfp4', 'nf4']. | |||
dtype (`str`, *optional*): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes could make sense to add a warning (+ potentially keep compatibility for one or two releases by setting dtype
in case weight_format
is provided)
Failing tests unrelated so merging, thanks @nikita-savelyevv |
What does this PR do?
Changes
OVMixedQuantizationConfig
for mixed precision quantization scenario. It is initialized with an instance ofOVWeightQuantizationConfig
and an instance ofOVQuantizationConfig
.nf4_f8e4m3
,int4_f8e4m3
,nf4_f8e5m2
,int4_f8e5m2
as possible values of--quant-mode
CLI argument. This performs mixed precision quantization, compressing weights tonf4
/int4
precision and activations tof8e4m3
/f8e5m2
.OVQuantizationConfigBase
now contains only model-related parameters. Addedto_nncf_dict()
method to quantization configs for convenience.OVWeightQuantizationConfig.weight_format
toOVWeightQuantizationConfig.dtype
andOVQuantizationConfig.activation_format
toOVQuantizationConfig.dtype
. The latter is done because when OVQuantizationConfig is used, not only activations but also weights are quantized, soactivation_format
does not correctly represent what actually happens.OVWeightQuantizationConfig.weight_format
is renamed for consistency.OVBaseModel._prepare_quantization_config()
can now create instances of configs other thanOVWeightQuantizationConfig
.Examples
CLI
optimum-cli export openvino -m meta-llama/Llama-3.1-8B --quant-mode nf4_f8e4m3 --dataset wikitext2 ./llama-3.1-8b_nf4_f8e4m3
Python API:
Some of these changes were implemented thanks to @nikita-malininn .
Before submitting