Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Explain torch compile error messages, improve PTQ vs QAT doc #730

Merged
merged 17 commits into from
Jun 17, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
58 changes: 54 additions & 4 deletions docs/deep-learning/fhe_assistant.md
Original file line number Diff line number Diff line change
Expand Up @@ -53,12 +53,60 @@ concrete_clf.fit(X, y)
concrete_clf.compile(X, debug_config)
```

## Compilation error debugging
## Common compilation errors

Compilation errors that signal that the ML model is not FHE compatible are usually of two types:
#### 1. TLU input maximum bit-width is exceeded

1. TLU input maximum bit-width is exceeded
1. No crypto-parameters can be found for the ML model: `RuntimeError: NoParametersFound` is raised by the compiler
**Error message**: `this [N]-bit value is used as an input to a table lookup`

**Cause**: This error can occur when `rounding_threshold_bits` is not used and accumulated intermediate values in the computation exceed 16 bits.

**Possible solutions**:

- Reduce quantization `n_bits`. However, this may reduce accuracy. When quantization `n_bits` must be below 6, it is best to use [Quantization Aware Training](../deep-learning/fhe_friendly_models.md).
- Use `rounding_threshold_bits`. This feature is described [here](../explanations/advanced_features.md#rounded-activations-and-quantizers). It is recommended to use the [`fhe.Exactness.APPROXIMATE`](../references/api/concrete.ml.torch.compile.md#function-compile_torch_model) setting, and set the rounding bits to 1 or 2 bits higher than the quantization `n_bits`
- Use [pruning](../explanations/pruning.md)

#### 2. No crypto-parameters can be found

**Error message**: `RuntimeError: NoParametersFound`

**Cause**: This error occurs when using `rounding_threshold_bits` in the `compile_torch_model` function.

**Possible solutions**: The solutions in this case are similar to the ones for the previous error.

#### 3. Quantization import failed

**Error message**: `Error occurred during quantization aware training (QAT) import [...] Could not determine a unique scale for the quantization!`.

**Cause**: This error occurs when the model imported as a quantized-aware training model lacks quantization operators. See [this guide](../deep-learning/fhe_friendly_models.md) on how to use Brevitas layers. This error message indicates that some layers do not take inputs quantized through `QuantIdentity` layers.

A common example is related to the concatenation operator. Suppose two tensors `x` and `y` are produced by two layers and need to be concatenated:

<!--pytest-codeblocks:skip-->

```python
x = self.dense1(x)
y = self.dense2(y)
z = torch.cat([x, y])
```

In the example above, the `x` and `y` layers need quantization before being concatenated.

**Possible solutions**:

1. If the error occurs for the first layer of the model: Add a `QuantIdentity` layer in your model and apply it on the input of the `forward` function, before the first layer is computed.
1. If the error occurs for a concatenation or addition layer: Add a new `QuantIdentity` layer in your model. Suppose it is called `quant_concat`. In the `forward` function, before concatenation of `x` and `y`, apply it to both tensors that are concatenated. The usage of a common `Quantidentity` layer to quantize both tensors that are concatenated ensures that they have the same scale:

<!--pytest-codeblocks:skip-->

```python
z = torch.cat([self.quant_concat(x), self.quant_concat(y)])
```

## Debugging compilation errors

Compilation errors due to FHE incompatible models, such as maximum bit-width exceeded or `NoParametersFound` can be debugged by examining the bit-widths associated with various intermediate values of the FHE computation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Compilation errors due to FHE incompatible models, such as maximum bit-width exceeded or `NoParametersFound` can be debugged by examining the bit-widths associated with various intermediate values of the FHE computation.
For compilation errors due to FHE-incompatible models, such as maximum bit-width exceeded or `NoParametersFound`, you can debug them by examining the bit-widths associated with various intermediate values of the FHE computation.


The following produces a neural network that is not FHE-compatible:

Expand Down Expand Up @@ -116,6 +164,8 @@ Function you are trying to compile cannot be compiled:

The error `this 17-bit value is used as an input to a table lookup` indicates that the 16-bit limit on the input of the Table Lookup (TLU) operation has been exceeded. To pinpoint the model layer that causes the error, Concrete ML provides the [bitwidth_and_range_report](../references/api/concrete.ml.quantization.quantized_module.md#method-bitwidth_and_range_report) helper function. First, the model must be compiled so that it can be [simulated](fhe_assistant.md#simulation).

On the other hand, `NoParametersFound` is encountered when using `rounding_threshold_bits`. When using this setting, the 16-bit accumulator limit is relaxed. However, reducing bit-width, or reducing the `rounding_threshold_bits`, or using using the [`fhe.Exactness.APPROXIMATE`](../references/api/concrete.ml.torch.compile.md#function-compile_torch_model) rounding method can help.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
On the other hand, `NoParametersFound` is encountered when using `rounding_threshold_bits`. When using this setting, the 16-bit accumulator limit is relaxed. However, reducing bit-width, or reducing the `rounding_threshold_bits`, or using using the [`fhe.Exactness.APPROXIMATE`](../references/api/concrete.ml.torch.compile.md#function-compile_torch_model) rounding method can help.
On the other hand, `NoParametersFound` occurs when using `rounding_threshold_bits`. With this setting, the 16-bit accumulator limit is relaxed. However, the following solution could help:
- Reduce bit width
- Reduce the `rounding_threshold_bits`
- Use the [`fhe.Exactness.APPROXIMATE`](../references/api/concrete.ml.torch.compile.md#function-compile_torch_model) rounding method.


### Fixing compilation errors

To make this network FHE-compatible one can apply several techniques:
Expand Down
107 changes: 54 additions & 53 deletions docs/deep-learning/torch_support.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,14 +2,21 @@

In addition to the built-in models, Concrete ML supports generic machine learning models implemented with Torch, or [exported as ONNX graphs](onnx_support.md).

As [Quantization Aware Training (QAT)](../explanations/quantization.md) is the most appropriate method of training neural networks that are compatible with [FHE constraints](../getting-started/concepts.md#model-accuracy-considerations-under-fhe-constraints), Concrete ML works with [Brevitas](../explanations/inner-workings/external_libraries.md#brevitas), a library providing QAT support for PyTorch.
There are two approaches to build [FHE-compatible deep networks](../getting-started/concepts.md#model-accuracy-considerations-under-fhe-constraints):

The following example uses a simple QAT PyTorch model that implements a fully connected neural network with two hidden layers. Due to its small size, making this model respect FHE constraints is relatively easy.
- [Quantization Aware Training (QAT)](../explanations/quantization.md) requires using custom layers, but can quantize weights and activations to low bit-widths. Concrete ML works with [Brevitas](../explanations/inner-workings/external_libraries.md#brevitas), a library providing QAT support for PyTorch. To use this mode, compile models using `compile_brevitas_qat_model`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- [Quantization Aware Training (QAT)](../explanations/quantization.md) requires using custom layers, but can quantize weights and activations to low bit-widths. Concrete ML works with [Brevitas](../explanations/inner-workings/external_libraries.md#brevitas), a library providing QAT support for PyTorch. To use this mode, compile models using `compile_brevitas_qat_model`
- [**Quantization Aware Training (QAT)**](../explanations/quantization.md) : QAT requires using custom layers, but can quantize weights and activations to low bit widths. Concrete ML works with [Brevitas](../explanations/inner-workings/external_libraries.md#brevitas), a library providing QAT support for PyTorch. To use this mode, compile models using `compile_brevitas_qat_model`

- **Post-training Quantization**: This mode allows a vanilla PyTorch model to be compiled. However, when quantizing weights & activations to fewer than 7 bits, the accuracy can decrease strongly. On the other hand, depending on the model size, quantizing with 6-8 bits can be incompatible with FHE constraints. To use this mode, compile models with `compile_torch_model`.

Both approaches require the `rounding_threshold_bits` parameter to be set accordingly. The best values for this parameter need to be determined through experimentation. A good initial value to try is `6`. See [here](../explanations/advanced_features.md#rounded-activations-and-quantizers) for more details.

{% hint style="info" %}
Converting neural networks to use FHE can be done with `compile_brevitas_qat_model` or with `compile_torch_model` for post-training quantization. If the model can not be converted to FHE two types of errors can be raised: (1) crypto-parameters can not be found and, (2) table look-up bit-width limit is exceeded. See the [debugging section](fhe_assistant.md#compilation-error-debugging) if you encounter these errors.
**See the [common compilation errors page](./fhe_assistant.md#common-compilation-errors) for an explanation of some error messages that the compilation function may raise.**
{% endhint %}

## Quantization-aware training

The following example uses a simple QAT PyTorch model that implements a fully connected neural network with two hidden layers. Due to its small size, making this model respect FHE constraints is relatively easy. To use QAT, Brevitas `QuantIdentity` nodes must be inserted in the PyTorch model, including one that quantizes the input of the `forward` function.

```python
import brevitas.nn as qnn
import torch.nn as nn
Expand Down Expand Up @@ -51,38 +58,60 @@ torch_model = QATSimpleNet(30)
quantized_module = compile_brevitas_qat_model(
torch_model, # our model
torch_input, # a representative input-set to be used for both quantization and compilation
rounding_threshold_bits={"n_bits": 6, "method": "approximate"}
)

```

## Configuring quantization parameters
{% hint style="warning" %}
If `QuantIdentity` layers are missing for any input or intermediate value, the compile function will raise an error. See the [common compilation errors page](./fhe_assistant.md#common-compilation-errors) for an explanation.
{% endhint %}

The PyTorch/Brevitas models, created following the example above, require the user to configure quantization parameters such as `bit_width` (activation bit-width) and `weight_bit_width`. The quantization parameters, along with the number of neurons on each layer, will determine the accumulator bit-width of the network. Larger accumulator bit-widths result in higher accuracy but slower FHE inference time.
## Post-training quantization

The following configurations were determined through experimentation for convolutional and dense layers.
The following example uses a simple PyTorch model that implements a fully connected neural network with two hidden layers. The model is compiled to use FHE using `compile_torch_model`.

```python
import torch.nn as nn
import torch

N_FEAT = 12
n_bits = 6

class PTQSimpleNet(nn.Module):
def __init__(self, n_hidden):
super().__init__()

self.fc1 = nn.Linear(N_FEAT, n_hidden)
self.fc2 = nn.Linear(n_hidden, n_hidden)
self.fc3 = nn.Linear(n_hidden, 2)

def forward(self, x):
x = torch.relu(self.fc1(x))
x = torch.relu(self.fc2(x))
x = self.fc3(x)
return x

from concrete.ml.torch.compile import compile_torch_model
import numpy

| target accumulator bit-width | activation bit-width | weight bit-width | number of active neurons |
| ---------------------------- | -------------------- | ---------------- | ------------------------ |
| 8 | 3 | 3 | 80 |
| 10 | 4 | 3 | 90 |
| 12 | 5 | 5 | 110 |
| 14 | 6 | 6 | 110 |
| 16 | 7 | 6 | 120 |
torch_input = torch.randn(100, N_FEAT)
torch_model = PTQSimpleNet(5)
quantized_module = compile_torch_model(
torch_model, # our model
torch_input, # a representative input-set to be used for both quantization and compilation
n_bits=6,
rounding_threshold_bits={"n_bits": 6, "method": "approximate"}
)
```

Using the templates above, the probability of obtaining the target accumulator bit-width, for a single layer, was determined experimentally by training 10 models for each of the following data-sets.
## Configuring quantization parameters

| <p>probability of obtaining<br>the accumulator bit-width</p> | 8 | 10 | 12 | 14 | 16 |
| ------------------------------------------------------------ | --- | ---- | --- | --- | ---- |
| mnist,fashion | 72% | 100% | 72% | 85% | 100% |
| cifar10 | 88% | 88% | 75% | 75% | 88% |
| cifar100 | 73% | 88% | 61% | 66% | 100% |
With QAT (the PyTorch/Brevitas models created following the example above), you need to configure quantization parameters such as `bit_width` (activation bit-width) and `weight_bit_width`. When using this mode, set `n_bits=None` in the `compile_brevitas_qat_model`.

Note that the accuracy on larger data-sets, when the accumulator size is low, is also reduced strongly.
With PTQ, you need to set the `n_bits` value in the `compile_torch_model` function and must manually determine the trade-off between accuracy, FHE compatibility, and latency.

| <p>accuracy for target<br>accumulator bit-width</p> | 8 | 10 | 12 | 14 | 16 |
| --------------------------------------------------- | --- | --- | --- | --- | --- |
| cifar10 | 20% | 37% | 89% | 90% | 90% |
| cifar100 | 6% | 30% | 67% | 69% | 69% |
The quantization parameters, along with the number of neurons on each layer, will determine the accumulator bit-width of the network. Larger accumulator bit-widths result in higher accuracy but slower FHE inference time.
andrei-stoian-zama marked this conversation as resolved.
Show resolved Hide resolved

## Running encrypted inference

Expand All @@ -100,7 +129,7 @@ In this example, the input values `x_test` and the predicted values `y_pred` are

## Simulated FHE Inference in the clear

The user can also perform the inference on clear data. Two approaches exist:
You can perform the inference on clear data in order to evaluate the impact of quantization and of FHE computation on the accuracy of their model. See [this section](../deep-learning/fhe_assistant.md#simulation) for more details. Two approaches exist:

- `quantized_module.forward(quantized_x, fhe="simulate")`: simulates FHE execution taking into account Table Lookup errors.\
De-quantization must be done in a second step as for actual FHE execution. Simulation takes into account the `p_error`/`global_p_error` parameters
Expand All @@ -110,34 +139,6 @@ The user can also perform the inference on clear data. Two approaches exist:
FHE simulation allows to measure the impact of the Table Lookup error on the model accuracy. The Table Lookup error can be adjusted using `p_error`/`global_p_error`, as described in the [approximate computation ](../explanations/advanced_features.md#approximate-computations)section.
{% endhint %}

## Generic Quantization Aware Training import

While the example above shows how to import a Brevitas/PyTorch model, Concrete ML also provides an option to import generic QAT models implemented in PyTorch or through ONNX. Deep learning models made with TensorFlow or Keras should be usable by preliminary converting them to ONNX.

QAT models contain quantizers in the PyTorch graph. These quantizers ensure that the inputs to the Linear/Dense and Conv layers are quantized.

Suppose that `n_bits_qat` is the bit-width of activations and weights during the QAT process. To import a PyTorch QAT network, you can use the [`compile_torch_model`](../references/api/concrete.ml.torch.compile.md#function-compile_torch_model) library function, passing `import_qat=True`:

<!--pytest-codeblocks:skip-->

```python
from concrete.ml.torch.compile import compile_torch_model
n_bits_qat = 3

quantized_module = compile_torch_model(
torch_model,
torch_input,
import_qat=True,
n_bits=n_bits_qat,
)
```

Alternatively, if you want to import an ONNX model directly, please see [the ONNX guide](onnx_support.md). The [`compile_onnx_model`](../references/api/concrete.ml.torch.compile.md#function-compile_onnx_model) also supports the `import_qat` parameter.

{% hint style="warning" %}
When importing QAT models using this generic pipeline, a representative calibration set should be given as quantization parameters in the model need to be inferred from the statistics of the values encountered during inference.
{% endhint %}

## Supported operators and activations

Concrete ML supports a variety of PyTorch operators that can be used to build fully connected or convolutional neural networks, with normalization and activation layers. Moreover, many element-wise operators are supported.
Expand Down
5 changes: 4 additions & 1 deletion src/concrete/ml/onnx/ops_impl.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,10 @@


class RawOpOutput(numpy.ndarray):
"""Type construct that marks an ndarray as a raw output of a quantized op."""
"""Type construct that marks an ndarray as a raw output of a quantized op.

A raw output is an output that is a clear constant such as a shape, a constant float, an index..
"""


# This function is only used for comparison operators that return boolean values by default.
Expand Down
Loading