-
Notifications
You must be signed in to change notification settings - Fork 854
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added quantization for OUTETTS #2662
Added quantization for OUTETTS #2662
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
@KodiaqQ , what results do you get with the quantized model vs original on your machine? |
FP model generate time: 3.926095366012305 upd. Recalculated times with ignored scope. |
"hf_model = OVHFModel(model_dir, device.value).model\n", | ||
"dataset = nncf.Dataset(libritts, partial(transform_fn, interface=interface))\n", | ||
"\n", | ||
"quantized_model = nncf.quantize(\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would suggest to use INT4 weight compression with dynamic quantization (A8W4). @KodiaqQ claim that the performance of such model is equal to the performance of the quantized model, but compression reate is higher for A8W4 model.
cc' @MaximProshin
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please share numbers for both cases. If int4 is better, I'm ok to use that method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed it offline and agreed to keep int8. At the same time the issue with SDPA will be analyzed in #16177. If resolved, we will update the notebook afterwards.
@l-bat, can you review, please? Thanks. |
@@ -22,6 +22,11 @@ | |||
"- [Run model inference](#Run-model-inference)\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #36. "__module.model.layers.*.self_attn/aten::scaled_dot_product_attention/ScaledDotProductAttention"
Could you please provide a reason in the description for why we should add this pattern to IgnoredScope
?
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -22,6 +22,11 @@ | |||
"- [Run model inference](#Run-model-inference)\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #3. demo = make_demo(interface)
I think it would be nice to use the quantized model in the interactive demo. Could you please add an option for the user to choose between the original and optimized models?
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@@ -22,6 +22,11 @@ | |||
"- [Run model inference](#Run-model-inference)\n", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #4. r = requests.get(
please add check that "skip_kernel_extension.py" exists to allow rerun notebook without internet connection
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@nikita-malininn could you please also fix formatting for updated notebook? P.S. after fixing code formatting and file downloading, I'm ready to merge your PR |
nncf.quantize
to the notebookmixed
preset andtransformer
model type due to model architecture (LLM)ignored scope
due to OpenVINO issues with the optimized SDPA inferencegenerate
pipeline is possible because of the model architectureTicket: 157133