diff --git a/README.md b/README.md index 9f25eefd94..54d8371b5b 100644 --- a/README.md +++ b/README.md @@ -75,12 +75,13 @@ It is possible to export your model to the [OpenVINO](https://docs.openvino.ai/2 optimum-cli export openvino --model gpt2 ov_model ``` -If you add `--int8`, the weights will be quantized to INT8, the activations will be kept in floating point precision. +If you add `--int8`, the model linear and embedding weights will be quantized to INT8, the activations will be kept in floating point precision. ```plain optimum-cli export openvino --model gpt2 --int8 ov_model ``` +To apply quantization on both weights and activations, you can find more information in the [documentation](https://huggingface.co/docs/optimum/main/en/intel/optimization_ov). #### Inference: diff --git a/docs/source/inference.mdx b/docs/source/inference.mdx index f526492a12..e93c39882a 100644 --- a/docs/source/inference.mdx +++ b/docs/source/inference.mdx @@ -102,7 +102,7 @@ You can also apply INT8 quantization on your models weights when exporting your optimum-cli export openvino --model gpt2 --int8 ov_model ``` -This will results in the exported model linear and embedding layers to be quanrtized to INT8, the activations will be kept in floating point precision. +This will results in the exported model linear and embedding layers to be quantized to INT8, the activations will be kept in floating point precision. This can also be done when loading your model by setting the `load_in_8bit` argument when calling the `from_pretrained()` method.