Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added quantization for OUTETTS #2662

Merged
merged 33 commits into from
Feb 18, 2025

Conversation

nikita-malininn
Copy link
Contributor

@nikita-malininn nikita-malininn commented Jan 16, 2025

  • Added quantization via nncf.quantize to the notebook
  • Quantization with the mixed preset and transformer model type due to model architecture (LLM)
  • Quantization with the ignored scope due to OpenVINO issues with the optimized SDPA inference
  • Quality validation of the quantized model only with expert listening is possible because of the nature of the task
  • Performance validation only with the generate pipeline is possible because of the model architecture

Ticket: 157133

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@nikita-malininn nikita-malininn marked this pull request as ready for review January 16, 2025 13:04
@nikita-malininn nikita-malininn marked this pull request as draft January 16, 2025 13:12
@nikita-malininn nikita-malininn marked this pull request as ready for review January 27, 2025 11:38
@MaximProshin
Copy link
Contributor

@KodiaqQ , what results do you get with the quantized model vs original on your machine?

@nikita-malininn
Copy link
Contributor Author

nikita-malininn commented Jan 28, 2025

@KodiaqQ , what results do you get with the quantized model vs original on your machine?

FP model generate time: 3.926095366012305
INT model generate time: 2.8104791679652408

upd. Recalculated times with ignored scope.

@nikita-malininn nikita-malininn marked this pull request as draft January 28, 2025 10:30
@nikita-malininn nikita-malininn marked this pull request as ready for review February 3, 2025 14:12
@nikita-malininn nikita-malininn marked this pull request as draft February 6, 2025 13:10
"hf_model = OVHFModel(model_dir, device.value).model\n",
"dataset = nncf.Dataset(libritts, partial(transform_fn, interface=interface))\n",
"\n",
"quantized_model = nncf.quantize(\n",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest to use INT4 weight compression with dynamic quantization (A8W4). @KodiaqQ claim that the performance of such model is equal to the performance of the quantized model, but compression reate is higher for A8W4 model.

cc' @MaximProshin

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please share numbers for both cases. If int4 is better, I'm ok to use that method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed it offline and agreed to keep int8. At the same time the issue with SDPA will be analyzed in #16177. If resolved, we will update the notebook afterwards.

@nikita-malininn nikita-malininn marked this pull request as ready for review February 10, 2025 12:10
@nikita-malininn
Copy link
Contributor Author

@l-bat, can you review, please? Thanks.

@@ -22,6 +22,11 @@
"- [Run model inference](#Run-model-inference)\n",
Copy link
Collaborator

@l-bat l-bat Feb 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #36.                "__module.model.layers.*.self_attn/aten::scaled_dot_product_attention/ScaledDotProductAttention"

Could you please provide a reason in the description for why we should add this pattern to IgnoredScope?


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -22,6 +22,11 @@
"- [Run model inference](#Run-model-inference)\n",
Copy link
Collaborator

@l-bat l-bat Feb 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #3.    demo = make_demo(interface)

I think it would be nice to use the quantized model in the interactive demo. Could you please add an option for the user to choose between the original and optimized models?


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -22,6 +22,11 @@
"- [Run model inference](#Run-model-inference)\n",
Copy link
Collaborator

@eaidova eaidova Feb 12, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #4.    r = requests.get(

please add check that "skip_kernel_extension.py" exists to allow rerun notebook without internet connection


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@eaidova
Copy link
Collaborator

eaidova commented Feb 12, 2025

@nikita-malininn could you please also fix formatting for updated notebook?
You can find info how to do that here

P.S. after fixing code formatting and file downloading, I'm ready to merge your PR

@eaidova eaidova merged commit 0b4b204 into openvinotoolkit:latest Feb 18, 2025
16 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants