-
Notifications
You must be signed in to change notification settings - Fork 149
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Explain torch compile error messages, improve PTQ vs QAT doc #730
Merged
Merged
Changes from all commits
Commits
Show all changes
17 commits
Select commit
Hold shift + click to select a range
346f997
docs: add error message explanation
andrei-stoian-zama becad10
docs: improve torch support explanations
andrei-stoian-zama 663c8c2
fix: review changes
andrei-stoian-zama 2dcacd5
fix: review changes
andrei-stoian-zama b118fde
fix: spacing
andrei-stoian-zama 65ba564
fix: Update docs/deep-learning/torch_support.md
andrei-stoian-zama 27b84c0
fix: Update docs/deep-learning/fhe_assistant.md
andrei-stoian-zama cd67f9d
fix: Update docs/deep-learning/fhe_assistant.md
andrei-stoian-zama b0c5c34
fix: Update docs/deep-learning/fhe_assistant.md
andrei-stoian-zama 7db02ea
fix: Update docs/deep-learning/torch_support.md
andrei-stoian-zama 4fe35a2
fix: Update docs/deep-learning/torch_support.md
andrei-stoian-zama 8ab41e3
fix: Update docs/deep-learning/torch_support.md
andrei-stoian-zama feff4f6
fix: Update docs/deep-learning/torch_support.md
andrei-stoian-zama c70609b
fix: Update docs/deep-learning/torch_support.md
andrei-stoian-zama 6746d60
fix: Update docs/deep-learning/fhe_assistant.md
andrei-stoian-zama a33d063
fix: formatting
andrei-stoian-zama 1d3fbd6
fix: bitwidth
andrei-stoian-zama File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|
|
@@ -53,12 +53,60 @@ concrete_clf.fit(X, y) | |||||||||||
concrete_clf.compile(X, debug_config) | ||||||||||||
``` | ||||||||||||
|
||||||||||||
## Compilation error debugging | ||||||||||||
## Common compilation errors | ||||||||||||
|
||||||||||||
Compilation errors that signal that the ML model is not FHE compatible are usually of two types: | ||||||||||||
#### 1. TLU input maximum bit-width is exceeded | ||||||||||||
|
||||||||||||
1. TLU input maximum bit-width is exceeded | ||||||||||||
1. No crypto-parameters can be found for the ML model: `RuntimeError: NoParametersFound` is raised by the compiler | ||||||||||||
**Error message**: `this [N]-bit value is used as an input to a table lookup` | ||||||||||||
|
||||||||||||
**Cause**: This error can occur when `rounding_threshold_bits` is not used and accumulated intermediate values in the computation exceed 16 bits. | ||||||||||||
|
||||||||||||
**Possible solutions**: | ||||||||||||
|
||||||||||||
- Reduce quantization `n_bits`. However, this may reduce accuracy. When quantization `n_bits` must be below 6, it is best to use [Quantization Aware Training](../deep-learning/fhe_friendly_models.md). | ||||||||||||
- Use `rounding_threshold_bits`. This feature is described [here](../explanations/advanced_features.md#rounded-activations-and-quantizers). It is recommended to use the [`fhe.Exactness.APPROXIMATE`](../references/api/concrete.ml.torch.compile.md#function-compile_torch_model) setting, and set the rounding bits to 1 or 2 bits higher than the quantization `n_bits` | ||||||||||||
- Use [pruning](../explanations/pruning.md) | ||||||||||||
|
||||||||||||
#### 2. No crypto-parameters can be found | ||||||||||||
|
||||||||||||
**Error message**: `RuntimeError: NoParametersFound` | ||||||||||||
|
||||||||||||
**Cause**: This error occurs when using `rounding_threshold_bits` in the `compile_torch_model` function. | ||||||||||||
|
||||||||||||
**Possible solutions**: The solutions in this case are similar to the ones for the previous error. | ||||||||||||
|
||||||||||||
#### 3. Quantization import failed | ||||||||||||
|
||||||||||||
**Error message**: `Error occurred during quantization aware training (QAT) import [...] Could not determine a unique scale for the quantization!`. | ||||||||||||
|
||||||||||||
**Cause**: This error occurs when the model imported as a quantized-aware training model lacks quantization operators. See [this guide](../deep-learning/fhe_friendly_models.md) on how to use Brevitas layers. This error message indicates that some layers do not take inputs quantized through `QuantIdentity` layers. | ||||||||||||
|
||||||||||||
A common example is related to the concatenation operator. Suppose two tensors `x` and `y` are produced by two layers and need to be concatenated: | ||||||||||||
|
||||||||||||
<!--pytest-codeblocks:skip--> | ||||||||||||
|
||||||||||||
```python | ||||||||||||
x = self.dense1(x) | ||||||||||||
y = self.dense2(y) | ||||||||||||
z = torch.cat([x, y]) | ||||||||||||
``` | ||||||||||||
|
||||||||||||
In the example above, the `x` and `y` layers need quantization before being concatenated. | ||||||||||||
|
||||||||||||
**Possible solutions**: | ||||||||||||
|
||||||||||||
1. If the error occurs for the first layer of the model: Add a `QuantIdentity` layer in your model and apply it on the input of the `forward` function, before the first layer is computed. | ||||||||||||
1. If the error occurs for a concatenation or addition layer: Add a new `QuantIdentity` layer in your model. Suppose it is called `quant_concat`. In the `forward` function, before concatenation of `x` and `y`, apply it to both tensors that are concatenated. The usage of a common `Quantidentity` layer to quantize both tensors that are concatenated ensures that they have the same scale: | ||||||||||||
|
||||||||||||
<!--pytest-codeblocks:skip--> | ||||||||||||
|
||||||||||||
```python | ||||||||||||
z = torch.cat([self.quant_concat(x), self.quant_concat(y)]) | ||||||||||||
``` | ||||||||||||
|
||||||||||||
## Debugging compilation errors | ||||||||||||
|
||||||||||||
Compilation errors due to FHE incompatible models, such as maximum bit-width exceeded or `NoParametersFound` can be debugged by examining the bit-widths associated with various intermediate values of the FHE computation. | ||||||||||||
|
||||||||||||
The following produces a neural network that is not FHE-compatible: | ||||||||||||
|
||||||||||||
|
@@ -116,6 +164,8 @@ Function you are trying to compile cannot be compiled: | |||||||||||
|
||||||||||||
The error `this 17-bit value is used as an input to a table lookup` indicates that the 16-bit limit on the input of the Table Lookup (TLU) operation has been exceeded. To pinpoint the model layer that causes the error, Concrete ML provides the [bitwidth_and_range_report](../references/api/concrete.ml.quantization.quantized_module.md#method-bitwidth_and_range_report) helper function. First, the model must be compiled so that it can be [simulated](fhe_assistant.md#simulation). | ||||||||||||
|
||||||||||||
On the other hand, `NoParametersFound` is encountered when using `rounding_threshold_bits`. When using this setting, the 16-bit accumulator limit is relaxed. However, reducing bit-width, or reducing the `rounding_threshold_bits`, or using using the [`fhe.Exactness.APPROXIMATE`](../references/api/concrete.ml.torch.compile.md#function-compile_torch_model) rounding method can help. | ||||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||||||||
|
||||||||||||
### Fixing compilation errors | ||||||||||||
|
||||||||||||
To make this network FHE-compatible one can apply several techniques: | ||||||||||||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change | ||||
---|---|---|---|---|---|---|
|
@@ -2,14 +2,21 @@ | |||||
|
||||||
In addition to the built-in models, Concrete ML supports generic machine learning models implemented with Torch, or [exported as ONNX graphs](onnx_support.md). | ||||||
|
||||||
As [Quantization Aware Training (QAT)](../explanations/quantization.md) is the most appropriate method of training neural networks that are compatible with [FHE constraints](../getting-started/concepts.md#model-accuracy-considerations-under-fhe-constraints), Concrete ML works with [Brevitas](../explanations/inner-workings/external_libraries.md#brevitas), a library providing QAT support for PyTorch. | ||||||
There are two approaches to build [FHE-compatible deep networks](../getting-started/concepts.md#model-accuracy-considerations-under-fhe-constraints): | ||||||
|
||||||
The following example uses a simple QAT PyTorch model that implements a fully connected neural network with two hidden layers. Due to its small size, making this model respect FHE constraints is relatively easy. | ||||||
- [Quantization Aware Training (QAT)](../explanations/quantization.md) requires using custom layers, but can quantize weights and activations to low bit-widths. Concrete ML works with [Brevitas](../explanations/inner-workings/external_libraries.md#brevitas), a library providing QAT support for PyTorch. To use this mode, compile models using `compile_brevitas_qat_model` | ||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
|
||||||
- **Post-training Quantization**: This mode allows a vanilla PyTorch model to be compiled. However, when quantizing weights & activations to fewer than 7 bits, the accuracy can decrease strongly. On the other hand, depending on the model size, quantizing with 6-8 bits can be incompatible with FHE constraints. To use this mode, compile models with `compile_torch_model`. | ||||||
|
||||||
Both approaches require the `rounding_threshold_bits` parameter to be set accordingly. The best values for this parameter need to be determined through experimentation. A good initial value to try is `6`. See [here](../explanations/advanced_features.md#rounded-activations-and-quantizers) for more details. | ||||||
|
||||||
{% hint style="info" %} | ||||||
Converting neural networks to use FHE can be done with `compile_brevitas_qat_model` or with `compile_torch_model` for post-training quantization. If the model can not be converted to FHE two types of errors can be raised: (1) crypto-parameters can not be found and, (2) table look-up bit-width limit is exceeded. See the [debugging section](fhe_assistant.md#compilation-error-debugging) if you encounter these errors. | ||||||
**See the [common compilation errors page](./fhe_assistant.md#common-compilation-errors) for an explanation of some error messages that the compilation function may raise.** | ||||||
{% endhint %} | ||||||
|
||||||
## Quantization-aware training | ||||||
|
||||||
The following example uses a simple QAT PyTorch model that implements a fully connected neural network with two hidden layers. Due to its small size, making this model respect FHE constraints is relatively easy. To use QAT, Brevitas `QuantIdentity` nodes must be inserted in the PyTorch model, including one that quantizes the input of the `forward` function. | ||||||
|
||||||
```python | ||||||
import brevitas.nn as qnn | ||||||
import torch.nn as nn | ||||||
|
@@ -51,38 +58,60 @@ torch_model = QATSimpleNet(30) | |||||
quantized_module = compile_brevitas_qat_model( | ||||||
torch_model, # our model | ||||||
torch_input, # a representative input-set to be used for both quantization and compilation | ||||||
rounding_threshold_bits={"n_bits": 6, "method": "approximate"} | ||||||
) | ||||||
|
||||||
``` | ||||||
|
||||||
## Configuring quantization parameters | ||||||
{% hint style="warning" %} | ||||||
If `QuantIdentity` layers are missing for any input or intermediate value, the compile function will raise an error. See the [common compilation errors page](./fhe_assistant.md#common-compilation-errors) for an explanation. | ||||||
{% endhint %} | ||||||
|
||||||
The PyTorch/Brevitas models, created following the example above, require the user to configure quantization parameters such as `bit_width` (activation bit-width) and `weight_bit_width`. The quantization parameters, along with the number of neurons on each layer, will determine the accumulator bit-width of the network. Larger accumulator bit-widths result in higher accuracy but slower FHE inference time. | ||||||
## Post-training quantization | ||||||
|
||||||
The following configurations were determined through experimentation for convolutional and dense layers. | ||||||
The following example uses a simple PyTorch model that implements a fully connected neural network with two hidden layers. The model is compiled to use FHE using `compile_torch_model`. | ||||||
|
||||||
```python | ||||||
import torch.nn as nn | ||||||
import torch | ||||||
|
||||||
N_FEAT = 12 | ||||||
n_bits = 6 | ||||||
|
||||||
class PTQSimpleNet(nn.Module): | ||||||
def __init__(self, n_hidden): | ||||||
super().__init__() | ||||||
|
||||||
self.fc1 = nn.Linear(N_FEAT, n_hidden) | ||||||
self.fc2 = nn.Linear(n_hidden, n_hidden) | ||||||
self.fc3 = nn.Linear(n_hidden, 2) | ||||||
|
||||||
def forward(self, x): | ||||||
x = torch.relu(self.fc1(x)) | ||||||
x = torch.relu(self.fc2(x)) | ||||||
x = self.fc3(x) | ||||||
return x | ||||||
|
||||||
from concrete.ml.torch.compile import compile_torch_model | ||||||
import numpy | ||||||
|
||||||
| target accumulator bit-width | activation bit-width | weight bit-width | number of active neurons | | ||||||
| ---------------------------- | -------------------- | ---------------- | ------------------------ | | ||||||
| 8 | 3 | 3 | 80 | | ||||||
| 10 | 4 | 3 | 90 | | ||||||
| 12 | 5 | 5 | 110 | | ||||||
| 14 | 6 | 6 | 110 | | ||||||
| 16 | 7 | 6 | 120 | | ||||||
torch_input = torch.randn(100, N_FEAT) | ||||||
torch_model = PTQSimpleNet(5) | ||||||
quantized_module = compile_torch_model( | ||||||
torch_model, # our model | ||||||
torch_input, # a representative input-set to be used for both quantization and compilation | ||||||
n_bits=6, | ||||||
rounding_threshold_bits={"n_bits": 6, "method": "approximate"} | ||||||
) | ||||||
``` | ||||||
|
||||||
Using the templates above, the probability of obtaining the target accumulator bit-width, for a single layer, was determined experimentally by training 10 models for each of the following data-sets. | ||||||
## Configuring quantization parameters | ||||||
|
||||||
| <p>probability of obtaining<br>the accumulator bit-width</p> | 8 | 10 | 12 | 14 | 16 | | ||||||
| ------------------------------------------------------------ | --- | ---- | --- | --- | ---- | | ||||||
| mnist,fashion | 72% | 100% | 72% | 85% | 100% | | ||||||
| cifar10 | 88% | 88% | 75% | 75% | 88% | | ||||||
| cifar100 | 73% | 88% | 61% | 66% | 100% | | ||||||
With QAT (the PyTorch/Brevitas models created following the example above), you need to configure quantization parameters such as `bit_width` (activation bit-width) and `weight_bit_width`. When using this mode, set `n_bits=None` in the `compile_brevitas_qat_model`. | ||||||
|
||||||
Note that the accuracy on larger data-sets, when the accumulator size is low, is also reduced strongly. | ||||||
With PTQ, you need to set the `n_bits` value in the `compile_torch_model` function and must manually determine the trade-off between accuracy, FHE compatibility, and latency. | ||||||
|
||||||
| <p>accuracy for target<br>accumulator bit-width</p> | 8 | 10 | 12 | 14 | 16 | | ||||||
| --------------------------------------------------- | --- | --- | --- | --- | --- | | ||||||
| cifar10 | 20% | 37% | 89% | 90% | 90% | | ||||||
| cifar100 | 6% | 30% | 67% | 69% | 69% | | ||||||
The quantization parameters, along with the number of neurons on each layer, will determine the accumulator bit-width of the network. Larger accumulator bit-widths result in higher accuracy but slower FHE inference time. | ||||||
andrei-stoian-zama marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||
|
||||||
## Running encrypted inference | ||||||
|
||||||
|
@@ -100,7 +129,7 @@ In this example, the input values `x_test` and the predicted values `y_pred` are | |||||
|
||||||
## Simulated FHE Inference in the clear | ||||||
|
||||||
The user can also perform the inference on clear data. Two approaches exist: | ||||||
You can perform the inference on clear data in order to evaluate the impact of quantization and of FHE computation on the accuracy of their model. See [this section](../deep-learning/fhe_assistant.md#simulation) for more details. Two approaches exist: | ||||||
|
||||||
- `quantized_module.forward(quantized_x, fhe="simulate")`: simulates FHE execution taking into account Table Lookup errors.\ | ||||||
De-quantization must be done in a second step as for actual FHE execution. Simulation takes into account the `p_error`/`global_p_error` parameters | ||||||
|
@@ -110,34 +139,6 @@ The user can also perform the inference on clear data. Two approaches exist: | |||||
FHE simulation allows to measure the impact of the Table Lookup error on the model accuracy. The Table Lookup error can be adjusted using `p_error`/`global_p_error`, as described in the [approximate computation ](../explanations/advanced_features.md#approximate-computations)section. | ||||||
{% endhint %} | ||||||
|
||||||
## Generic Quantization Aware Training import | ||||||
|
||||||
While the example above shows how to import a Brevitas/PyTorch model, Concrete ML also provides an option to import generic QAT models implemented in PyTorch or through ONNX. Deep learning models made with TensorFlow or Keras should be usable by preliminary converting them to ONNX. | ||||||
|
||||||
QAT models contain quantizers in the PyTorch graph. These quantizers ensure that the inputs to the Linear/Dense and Conv layers are quantized. | ||||||
|
||||||
Suppose that `n_bits_qat` is the bit-width of activations and weights during the QAT process. To import a PyTorch QAT network, you can use the [`compile_torch_model`](../references/api/concrete.ml.torch.compile.md#function-compile_torch_model) library function, passing `import_qat=True`: | ||||||
|
||||||
<!--pytest-codeblocks:skip--> | ||||||
|
||||||
```python | ||||||
from concrete.ml.torch.compile import compile_torch_model | ||||||
n_bits_qat = 3 | ||||||
|
||||||
quantized_module = compile_torch_model( | ||||||
torch_model, | ||||||
torch_input, | ||||||
import_qat=True, | ||||||
n_bits=n_bits_qat, | ||||||
) | ||||||
``` | ||||||
|
||||||
Alternatively, if you want to import an ONNX model directly, please see [the ONNX guide](onnx_support.md). The [`compile_onnx_model`](../references/api/concrete.ml.torch.compile.md#function-compile_onnx_model) also supports the `import_qat` parameter. | ||||||
|
||||||
{% hint style="warning" %} | ||||||
When importing QAT models using this generic pipeline, a representative calibration set should be given as quantization parameters in the model need to be inferred from the statistics of the values encountered during inference. | ||||||
{% endhint %} | ||||||
|
||||||
## Supported operators and activations | ||||||
|
||||||
Concrete ML supports a variety of PyTorch operators that can be used to build fully connected or convolutional neural networks, with normalization and activation layers. Moreover, many element-wise operators are supported. | ||||||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.