diff --git a/docs/built-in-models/training.md b/docs/built-in-models/training.md index f7675c291..6ab86d686 100644 --- a/docs/built-in-models/training.md +++ b/docs/built-in-models/training.md @@ -2,6 +2,8 @@ This document explains how to train [SGD Logistic Regression](../references/api/concrete.ml.sklearn.linear_model.md#class-sgdclassifier) on encrypted data. +## Introduction + Training on encrypted data is done through an FHE program that is generated by Concrete ML, based on the characteristics of the data that are given to the `fit` function. Once the FHE program associated with the `SGDClassifier` object has fit the encrypted data, it performs specifically to that data's distribution and dimensionality. When deploying encrypted training services, you need to consider the type of data that future users of your services will train on: diff --git a/docs/deep-learning/fhe_assistant.md b/docs/deep-learning/fhe_assistant.md index 63bba03ec..aac5d4383 100644 --- a/docs/deep-learning/fhe_assistant.md +++ b/docs/deep-learning/fhe_assistant.md @@ -1,6 +1,6 @@ # Debugging models -This section provides a set of tools and guidelines to help users build optimized FHE-compatible models. It discusses FHE simulation, the key-cache functionality that helps speed-up FHE result debugging, and gives a guide to evaluate circuit complexity. +This section provides a set of tools and guidelines to help users debug errors and build optimized models that are compatible with Fully Homomorphic Encryption (FHE). ## Simulation @@ -59,7 +59,7 @@ concrete_clf.compile(X, debug_config) **Error message**: `this [N]-bit value is used as an input to a table lookup` -**Cause**: This error can occur when `rounding_threshold_bits` is not used and accumulated intermediate values in the computation exceed 16 bits. +**Cause**: This error can occur when `rounding_threshold_bits` is not used and accumulated intermediate values in the computation exceed 16 bits. To pinpoint the model layer that causes the error, Concrete ML provides the [bitwidth_and_range_report](../references/api/concrete.ml.quantization.quantized_module.md#method-bitwidth_and_range_report) helper function. To use this function, the model must be compiled first so that it can be [simulated](fhe_assistant.md#simulation). **Possible solutions**: @@ -71,9 +71,9 @@ concrete_clf.compile(X, debug_config) **Error message**: `RuntimeError: NoParametersFound` -**Cause**: This error occurs when using `rounding_threshold_bits` in the `compile_torch_model` function. +**Cause**: This error occurs when cryptosystem parameters can not be found for the model bit-width, rounding mode and requested `p_error`, when using `rounding_threshold_bits` in the `compile_torch_model` function. With `rounding_threshold_bits` set, the 16-bit accumulator limit is relaxed, so the `this [N]-bit value is used as an input to a table lookup` does not occur. However, cryptosystem-parameters may still not exist for the model to be compiled. -**Possible solutions**: The solutions in this case are similar to the ones for the previous error. +**Possible solutions**: The solutions in this case are similar to the ones for the previous error: reducing bit-width, or reducing the `rounding_threshold_bits`, or using the [`fhe.Exactness.APPROXIMATE`](../references/api/concrete.ml.torch.compile.md#function-compile_torch_model) rounding method can help. Additionally adjusting the tolerance for one-off errors using the `p_error` parameter can help, as explained in [this section](../explanations/advanced_features.md#approximate-computations). #### 3. Quantization import failed @@ -104,130 +104,22 @@ In the example above, the `x` and `y` layers need quantization before being conc z = torch.cat([self.quant_concat(x), self.quant_concat(y)]) ``` -## Debugging compilation errors +## PBS complexity and optimization -Compilation errors due to FHE incompatible models, such as maximum bit-width exceeded or `NoParametersFound` can be debugged by examining the bit-widths associated with various intermediate values of the FHE computation. +In FHE, univariate functions are encoded as Table Lookups, which are then implemented using [Programmable Bootstrapping (PBS)](../getting-started/concepts.md#cryptography-concepts). PBS is a powerful technique but requires significantly more computing resources compared to simpler encrypted operations such as matrix multiplications, convolution, or additions. -The following produces a neural network that is not FHE-compatible: +Furthermore, the cost of PBS depends on the bit-width of the compiled circuit. Every additional bit in the maximum bit-width significantly increase the complexity of the PBS. Therefore, it is important to determine the bit-width of the circuit and the amount of PBS it performs in order to optimize the performance. -```python -import numpy -import torch - -from torch import nn -from concrete.ml.torch.compile import compile_torch_model - -N_FEAT = 2 -class SimpleNet(nn.Module): - """Simple MLP with PyTorch""" - - def __init__(self, n_hidden=30): - super().__init__() - self.fc1 = nn.Linear(in_features=N_FEAT, out_features=n_hidden) - self.fc2 = nn.Linear(in_features=n_hidden, out_features=n_hidden) - self.fc3 = nn.Linear(in_features=n_hidden, out_features=2) - - - def forward(self, x): - """Forward pass.""" - x = torch.relu(self.fc1(x)) - x = torch.relu(self.fc2(x)) - x = self.fc3(x) - return x - - -torch_input = torch.randn(100, N_FEAT) -torch_model = SimpleNet(120) -try: - quantized_numpy_module = compile_torch_model( - torch_model, - torch_input, - n_bits=7, - ) -except RuntimeError as err: - print(err) -``` - -Upon execution, the Compiler will raise the following error within the graph representation: - -``` -Function you are trying to compile cannot be compiled: - -%0 = _x # EncryptedTensor ∈ [-64, 63] -%1 = [[ -9 18 ... 30 34]] # ClearTensor ∈ [-62, 63] @ /fc1/Gemm.matmul -%2 = matmul(%0, %1) # EncryptedTensor ∈ [-5834, 5770] @ /fc1/Gemm.matmul -%3 = subgraph(%2) # EncryptedTensor ∈ [0, 127] -%4 = [[-36 6 ... 27 -11]] # ClearTensor ∈ [-63, 63] @ /fc2/Gemm.matmul -%5 = matmul(%3, %4) # EncryptedTensor ∈ [-34666, 37702] @ /fc2/Gemm.matmul -^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ this 17-bit value is used as an input to a table lookup -``` - -The error `this 17-bit value is used as an input to a table lookup` indicates that the 16-bit limit on the input of the Table Lookup (TLU) operation has been exceeded. To pinpoint the model layer that causes the error, Concrete ML provides the [bitwidth_and_range_report](../references/api/concrete.ml.quantization.quantized_module.md#method-bitwidth_and_range_report) helper function. First, the model must be compiled so that it can be [simulated](fhe_assistant.md#simulation). - -On the other hand, `NoParametersFound` is encountered when using `rounding_threshold_bits`. When using this setting, the 16-bit accumulator limit is relaxed. However, reducing bit-width, or reducing the `rounding_threshold_bits`, or using using the [`fhe.Exactness.APPROXIMATE`](../references/api/concrete.ml.torch.compile.md#function-compile_torch_model) rounding method can help. - -### Fixing compilation errors - -To make this network FHE-compatible one can apply several techniques: - -1. use [rounded accumulators](../explanations/advanced_features.md#rounded-activations-and-quantizers) by specifying the `rounding_threshold_bits` parameter. Please evaluate the accuracy of the model using simulation if you use this feature, as it may impact accuracy. Setting a value 2-bit higher than the quantization `n_bits` should be a good start. - - - -```python -torch_model = SimpleNet(20) - -quantized_numpy_module = compile_torch_model( - torch_model, - torch_input, - n_bits=6, - rounding_threshold_bits=7, -) -``` - -2. reduce the accumulator bit-width of the second layer named `fc2`. To do this, a simple solution is to reduce the number of neurons, as it is proportional to the bit-width. - - +To inspect the MLIR code produced by the compiler, use the following command: -```python -torch_model = SimpleNet(10) - -quantized_numpy_module = compile_torch_model( - torch_model, - torch_input, - n_bits=7, -) -``` - -3. adjust the tolerance for one-off errors using the `p_error` parameter. See [this section for more explanation](../explanations/advanced_features.md#approximate-computations) on this tolerance. - - - -```python -torch_model = SimpleNet(10) - -quantized_numpy_module = compile_torch_model( - torch_model, - torch_input, - n_bits=7, - p_error=0.01 -) -``` - -## Complexity analysis - -In FHE, univariate functions are encoded as table lookups, which are then implemented using Programmable Bootstrapping (PBS). PBS is a powerful technique but will require significantly more computing resources, and thus time, compared to simpler encrypted operations such as matrix multiplications, convolution, or additions. - -Furthermore, the cost of PBS will depend on the bit-width of the compiled circuit. Every additional bit in the maximum bit-width raises the complexity of the PBS by a significant factor. It may be of interest to the model developer, then, to determine the bit-width of the circuit and the amount of PBS it performs. - -This can be done by inspecting the MLIR code produced by the Compiler: - - + ```python print(quantized_numpy_module.fhe_circuit.mlir) ``` +Example output: + ``` MLIR -------------------------------------------------------------------------------- @@ -262,14 +154,14 @@ module { -------------------------------------------------------------------------------- ``` -There are several calls to `FHELinalg.apply_mapped_lookup_table` and `FHELinalg.apply_lookup_table`. These calls apply PBS to the cells of their input tensors. Their inputs in the listing above are: `tensor<1x2x!FHE.eint<8>>` for the first and last call and `tensor<1x50x!FHE.eint<8>>` for the two calls in the middle. Thus, PBS is applied 104 times. +In the MLIR code, there are several calls to `FHELinalg.apply_mapped_lookup_table` and `FHELinalg.apply_lookup_table`. These calls apply PBS to the cells of their input tensors. For example, in the code above, the inputs are: `tensor<1x5x!FHE.eint<15>>` for both the first and last `apply_mapped_lookup_table` call. Thus, the PBS is applied 10 times, corresponding to the size of every encrypted tensor, which is 1x5 multiplied by 2. -Retrieving the bit-width of the circuit is then simply: +To retrieve the bit-width of the circuit, use this command: - + ```python print(quantized_numpy_module.fhe_circuit.graph.maximum_integer_bit_width()) ``` -Decreasing the number of bits and the number of PBS applications induces large reductions in the computation time of the compiled circuit. +Reducing the number of bits and the number of PBS applications can significantly decrease the computation time of the compiled circuit. diff --git a/docs/deep-learning/fhe_friendly_models.md b/docs/deep-learning/fhe_friendly_models.md index 5baee76fe..bda71ffc6 100644 --- a/docs/deep-learning/fhe_friendly_models.md +++ b/docs/deep-learning/fhe_friendly_models.md @@ -1,18 +1,17 @@ # Step-by-step guide -This guide provides a complete example of converting a PyTorch neural network into its FHE-friendly, quantized counterpart. It focuses on Quantization Aware Training a simple network on a synthetic data-set. +This guide demonstrates how to convert a PyTorch neural network into a Fully Homomorphic Encryption (FHE)-friendly, quantized version. It focuses on Quantization Aware Training (QAT) using a simple network on a synthetic data-set. This guide is based on a [notebook tutorial](../advanced_examples/QuantizationAwareTraining.ipynb), from which some code blocks are documented. -In general, quantization can be carried out in two different ways: either during Quantization Aware Training (QAT) or after the training phase with Post-Training Quantization (PTQ). +## Quantization -Regarding FHE-friendly neural networks, QAT is the best way to reach optimal accuracy under [FHE constraints](../getting-started/README.md#current-limitations). This technique allows weights and activations to be reduced to very low bit-widths (e.g., 2-3 bits), which, combined with pruning, can keep accumulator bit-widths low. +In general, quantization can be carried out in two different ways: -Concrete ML uses the third-party library [Brevitas](https://github.com/Xilinx/brevitas) to perform QAT for PyTorch NNs, but options exist for other frameworks such as Keras/Tensorflow. +- During the training phase with [Quantization Aware Training (QAT)](../getting-started/concepts.md#i-model-development) +- After the training phase with [Post Training Quantization (PTQ)](../getting-started/concepts.md#i-model-development). -Several [demos and tutorials](../tutorials/showcase.md) that use Brevitas are available in the Concrete ML library, such as the [CIFAR classification tutorial](../../use_case_examples/cifar/cifar_brevitas_finetuning/CifarQuantizationAwareTraining.ipynb). +For FHE-friendly neural networks, QAT is the best method to achieve optimal accuracy under [FHE constraints](../getting-started/README.md#current-limitations). This technique reduces weights and activations to very low bit-widths (for example, 2-3 bits). When combined with pruning, QAT helps keep low accumulator bit-widths. -This guide is based on a [notebook tutorial](../advanced_examples/QuantizationAwareTraining.ipynb), from which some code blocks are documented. - -For a more formal description of the usage of Brevitas to build FHE-compatible neural networks, please see the [Brevitas usage reference](../explanations/inner-workings/external_libraries.md#brevitas). +Concrete ML uses the third-party library [Brevitas](https://github.com/Xilinx/brevitas) to perform QAT for PyTorch neural networks, but options exist for other frameworks such as Keras/Tensorflow. Concrete ML provides several [demos and tutorials](../tutorials/showcase.md) that use Brevitas , including the [CIFAR classification tutorial](../../use_case_examples/cifar/cifar_brevitas_finetuning/CifarQuantizationAwareTraining.ipynb). For a more formal description of the usage of Brevitas to build FHE-compatible neural networks, please see the [Brevitas usage reference](../explanations/inner-workings/external_libraries.md#brevitas). {% hint style="info" %} For a formal explanation of the mechanisms that enable FHE-compatible neural networks, please see the the following paper. @@ -22,7 +21,7 @@ For a formal explanation of the mechanisms that enable FHE-compatible neural net ## Baseline PyTorch model -In PyTorch, using standard layers, a fully connected neural network (FCNN) would look like this: +In PyTorch, using standard layers, a Fully Connected Neural Network (FCNN) would look like this: ```python import torch @@ -49,11 +48,11 @@ class SimpleNet(nn.Module): return x ``` -The [notebook tutorial](../advanced_examples/QuantizationAwareTraining.ipynb), example shows how to train a FCNN, similarly to the one above, on a synthetic 2D data-set with a checkerboard grid pattern of 100 x 100 points. The data is split into 9500 training and 500 test samples. +Similarly to the one above, the [notebook tutorial](../advanced_examples/QuantizationAwareTraining.ipynb) shows how to train a FCNN on a synthetic 2D data-set with a checkerboard grid pattern of 100 x 100 points. The data is split into 9500 training and 500 test samples. -Once trained, this PyTorch network can be imported using the [`compile_torch_model`](../references/api/concrete.ml.torch.compile.md#function-compile_torch_model) function. This function uses simple PTQ. +Once trained, you can import this PyTorch network using the [`compile_torch_model`](../references/api/concrete.ml.torch.compile.md#function-compile_torch_model) function, which uses simple PTQ. -The network was trained using different numbers of neurons in the hidden layers, and quantized using 3-bits weights and activations. The mean accumulator size shown below is measured as the mean over 10 runs of the experiment. An accumulator of 6.6 means that 4 times out of 10 the accumulator measured was 6 bits while 6 times it was 7 bits. +The network was trained using different numbers of neurons in the hidden layers, and quantized using 3-bits weights and activations. The mean accumulator size, shown below, is measured as the mean over 10 runs of the experiment. An accumulator size of 6.6 means that 4 times out of 10, the accumulator was 6 bits, while 6 times it was 7 bits. | neurons | 10 | 30 | 100 | | --------------------- | ------ | ------ | ------ | @@ -61,24 +60,24 @@ The network was trained using different numbers of neurons in the hidden layers, | 3-bit accuracy | 56.44% | 55.54% | 56.50% | | mean accumulator size | 6.6 | 6.9 | 7.4 | -This shows that the fp32 accuracy and accumulator size increases with the number of hidden neurons, while the 3-bits accuracy remains low irrespective of the number of neurons. While all the configurations tried here were FHE-compatible (accumulator \< 16 bits), it is often preferable to have a lower accumulator size in order to speed up inference time. +This shows that the fp32 accuracy and accumulator size increases with the number of hidden neurons, while the 3-bits accuracy remains low regardless of the number of neurons. Although all configurations tested were FHE-compatible (accumulator \< 16 bits), it is often preferable to have a lower accumulator size to speed up inference time. {% hint style="info" %} -Accumulator size is determined by Concrete as being the maximum bit-width encountered anywhere in the encrypted circuit. +Accumulator size is determined by [Concrete](https://docs.zama.ai/concrete) as the maximum bit-width encountered anywhere in the encrypted circuit. {% endhint %} -## Quantization Aware Training: +## Quantization Aware Training (QAT) -[Quantization Aware Training](../explanations/quantization.md) using [Brevitas](https://github.com/Xilinx/brevitas) is the best way to guarantee a good accuracy for Concrete ML compatible neural networks. +Using [QAT](../explanations/quantization.md) with [Brevitas](https://github.com/Xilinx/brevitas) is the best way to guarantee a good accuracy for Concrete ML compatible neural networks. -Brevitas provides a quantized version of almost all PyTorch layers (`Linear` layer becomes `QuantLinear`, `ReLU` layer becomes `QuantReLU` and so on), plus some extra quantization parameters, such as : +Brevitas provides quantized versions of almost all PyTorch layers. For example, `Linear` layer becomes `QuantLinear`, and `ReLU` layer becomes `QuantReLU`. Brevitas also offers additional quantization parameters, such as: - `bit_width`: precision quantization bits for activations - `act_quant`: quantization protocol for the activations - `weight_bit_width`: precision quantization bits for weights - `weight_quant`: quantization protocol for the weights -In order to use FHE, the network must be quantized from end to end, and thanks to the Brevitas's `QuantIdentity` layer, it is possible to quantize the input by placing it at the entry point of the network. Moreover, it is also possible to combine PyTorch and Brevitas layers, provided that a `QuantIdentity` is placed after this PyTorch layer. The following table gives the replacements to be made to convert a PyTorch NN for Concrete ML compatibility. +To use FHE, the network must be quantized from end to end. With the Brevitas `QuantIdentity` layer, you can quantize the input by placing it at the network's entry point. Moreover, you can combine PyTorch and Brevitas layers, as long as a `QuantIdentity` layer follows the PyTorch layer. The following table lists the replacements needed to convert a PyTorch neural network for Concrete ML compatibility. | PyTorch fp32 layer | Concrete ML model with PyTorch/Brevitas | | -------------------- | ----------------------------------------------------- | @@ -151,7 +150,7 @@ class QuantSimpleNet(nn.Module): In the network above, biases are used for linear layers but are not quantized (`"bias": True, "bias_quant": None`). The addition of the bias is a univariate operation and is fused into the activation function. {% endhint %} -Training this network with pruning (see below) with 30 out of 100 total non-zero neurons gives good accuracy while keeping the accumulator size low. +Training this network with pruning (see [below](#pruning-using-torch)) using 30 out of 100 total non-zero neurons gives good accuracy while keeping the accumulator size low. | Non-zero neurons | 30 | | ----------------------------- | ----- | @@ -164,7 +163,7 @@ The PyTorch QAT training loop is the same as the standard floating point trainin {% endhint %} {% hint style="info" %} -Quantization Aware Training is somewhat slower than normal training. QAT introduces quantization during both the forward and backward passes. The quantization process is inefficient on GPUs as its computational intensity is low with respect to data transfer time. +QAT is somewhat slower than normal training. QAT introduces quantization during both the forward and backward passes. The quantization process is inefficient on GPUs due to its low computational intensity is low relative to data transfer time. {% endhint %} ### Pruning using Torch @@ -175,7 +174,7 @@ To understand how to overcome this limitation, consider a scenario where 2 bits By default, Concrete ML uses symmetric quantization for model weights, with values in the interval $$\left[-2^{n_{bits}-1}, 2^{n_{bits}-1}-1\right]$$. For example, for $$n_{bits}=2$$ the possible values are $$[-2, -1, 0, 1]$$; for $$n_{bits}=3$$, the values can be $$[-4,-3,-2,-1,0,1,2,3]$$. -In a typical setting, the weights will not all have the maximum or minimum values (e.g., $$-2^{n_{bits}-1}$$). Weights typically have a normal distribution around 0, which is one of the motivating factors for their symmetric quantization. A symmetric distribution and many zero-valued weights are desirable because opposite sign weights can cancel each other out and zero weights do not increase the accumulator size. +In a typical setting, the weights will not all have the maximum or minimum values (such as $$-2^{n_{bits}-1}$$). Weights typically have a normal distribution around 0, which is one of the motivating factors for their symmetric quantization. A symmetric distribution and many zero-valued weights are desirable because opposite sign weights can cancel each other out and zero weights do not increase the accumulator size. This fact can be leveraged to train a network with more neurons, while not overflowing the accumulator, using a technique called [pruning](../explanations/pruning.md) where the developer can impose a number of zero-valued weights. Torch [provides support for pruning](https://pytorch.org/tutorials/intermediate/pruning_tutorial.html) out of the box. diff --git a/docs/deep-learning/onnx_support.md b/docs/deep-learning/onnx_support.md index bdf426ae0..06f6c08ed 100644 --- a/docs/deep-learning/onnx_support.md +++ b/docs/deep-learning/onnx_support.md @@ -1,15 +1,17 @@ # Using ONNX -In addition to Concrete ML models and [custom models in torch](torch_support.md), it is also possible to directly compile [ONNX](https://onnx.ai/) models. This can be particularly appealing, notably to import models trained with Keras. +This document explains how to compile [ONNX](https://onnx.ai/) models in Concrete ML. This is particularly useful for importing models trained with Keras. -ONNX models can be compiled by directly importing models that are already quantized with Quantization Aware Training (QAT) or by performing Post-Training Quantization (PTQ) with Concrete ML. +You can compile ONNX models by directly importing models that are already quantized with [Quantization Aware Training (QAT)](../getting-started/concepts.md#i-model-development) or by performing [Post Training Quantization (PTQ)](../getting-started/concepts.md#i-model-development) with Concrete ML. ## Simple example The following example shows how to compile an ONNX model using PTQ. The model was initially trained using Keras before being exported to ONNX. The training code is not shown here. {% hint style="warning" %} -This example uses Post-Training Quantization, i.e., the quantization is not performed during training. This model would not have good performance in FHE. Quantization Aware Training should be added by the model developer. Additionally, importing QAT ONNX models can be done [as shown below](onnx_support.md#quantization-aware-training). +This example uses PTQ, meaning that the quantization is not performed during training. This model does not have the optimal performance in FHE. + +To improve performance in FHE, you should add QAT. Additionally, you can also import QAT ONNX models [as shown below](onnx_support.md#quantization-aware-training). {% endhint %} ```python @@ -56,7 +58,7 @@ While a Keras ONNX model was used in this example, Keras/Tensorflow support in C ## Quantization Aware Training -Models trained using [Quantization Aware Training](../explanations/quantization.md) contain quantizers in the ONNX graph. These quantizers ensure that the inputs to the Linear/Dense and Conv layers are quantized. Since these QAT models have quantizers that are configured during training to a specific number of bits, the ONNX graph will need to be imported using the same settings: +Models trained using QAT contain quantizers in the ONNX graph. These quantizers ensure that the inputs to the Linear/Dense and Conv layers are quantized. Since these QAT models have quantizers configured to a specific number of bits during training, you must import the ONNX graph using the same settings: @@ -74,7 +76,7 @@ quantized_numpy_module = compile_onnx_model( ## Supported operators -The following ONNX operators are supported for evaluation and conversion to an equivalent FHE circuit. Other operators were not implemented, either due to FHE constraints or because they are rarely used in PyTorch activations or scikit-learn models. +Concrete ML supports the following operators for evaluation and conversion to an equivalent FHE circuit. Other operators were not implemented either due to FHE constraints or because they are rarely used in PyTorch activations or scikit-learn models. diff --git a/docs/deep-learning/optimizing_inference.md b/docs/deep-learning/optimizing_inference.md index 496ad9937..cda08dce1 100644 --- a/docs/deep-learning/optimizing_inference.md +++ b/docs/deep-learning/optimizing_inference.md @@ -1,21 +1,23 @@ # Optimizing inference -Neural networks pose unique challenges with regards to encrypted inference. Each neuron in a network applies an activation function that requires a PBS operation. The latency of a single PBS depends on the bit-width of the input of the PBS. +This document introduces several approaches to reduce the overall latency of a neural network. -Several approaches can be used to reduce the overall latency of a neural network. +## Introduction + +Neural networks are challenging for encrypted inference. Each neuron in a network has to apply an activation function that requires a [Programmable Bootstrapping(PBS)](../getting-started/concepts.md#cryptography-concepts) operation. The latency of a single PBS depends on the bit-width of its input. ## Circuit bit-width optimization -[Quantization Aware Training](../explanations/quantization.md) and [pruning](../explanations/pruning.md) introduce specific hyper-parameters that influence the accumulator sizes. It is possible to chose quantization and pruning configurations that reduce the accumulator size. A trade-off between latency and accuracy can be obtained by varying these hyper-parameters as described in the [deep learning design guide](torch_support.md#configuring-quantization-parameters). +[Quantization Aware Training](../explanations/quantization.md) and [pruning](../explanations/pruning.md) introduce specific hyper-parameters that influence the accumulator sizes. You can chose quantization and pruning configurations to reduce the accumulator size. To obtain a trade-off between latency and accuracy, you can manually set these hyper-parameters as described in the [deep learning design guide](torch_support.md#configuring-quantization-parameters). ## Structured pruning -While un-structured pruning is used to ensure the accumulator bit-width stays low, [structured pruning](https://pytorch.org/docs/stable/generated/torch.nn.utils.prune.ln_structured.html) can eliminate entire neurons from the network. Many neural networks are over-parametrized (since this enables easier training) and some neurons can be removed. Structured pruning, applied to a trained network as a fine-tuning step, can be applied to built-in neural networks using the [prune](../references/api/concrete.ml.sklearn.base.md#method-prune) helper function as shown in [this example](../advanced_examples/FullyConnectedNeuralNetworkOnMNIST.ipynb). To apply structured pruning to custom models, it is recommended to use the [torch-pruning](https://github.com/VainF/Torch-Pruning) package. +While using unstructured pruning ensures the accumulator bit-width stays low, [structured pruning](https://pytorch.org/docs/stable/generated/torch.nn.utils.prune.ln_structured.html) can eliminate entire neurons from the network as many neural networks are over-parametrized for easier training. You can apply structured pruning to a trained network as a fine-tuning step. [This example](../advanced_examples/FullyConnectedNeuralNetworkOnMNIST.ipynb) demonstrates how to apply structured pruning to built-in neural networks using the [prune](../references/api/concrete.ml.sklearn.base.md#method-prune) helper function. To apply structured pruning to custom models, it is recommended to use the [torch-pruning](https://github.com/VainF/Torch-Pruning) package. ## Rounded activations and quantizers -Reducing the bit-width of the inputs to the Table Lookup (TLU) operations is a major source of improvements in the latency. Post-training, it is possible to leverage some properties of the fused activation and quantization functions expressed in the TLUs to further reduce the accumulator. This is achieved through the _rounded PBS_ feature as described in the [rounded activations and quantizers reference](../explanations/advanced_features.md#rounded-activations-and-quantizers). Adjusting the rounding amount, relative to the initial accumulator size, can bring large improvements in latency while maintaining accuracy. +Reducing the bit-width of inputs to the Table Lookup (TLU) operations significantly improves latency. Post-training, you can leverage properties of the fused activation and quantization functions in the TLUs to further reduce the accumulator size. This is achieved through the _rounded PBS_ feature as described in the [rounded activations and quantizers reference](../explanations/advanced_features.md#rounded-activations-and-quantizers). Adjusting the rounding amount relative to the initial accumulator size can improve latency while maintaining accuracy. ## TLU error tolerance adjustment -Finally, the TFHE scheme exposes a TLU error tolerance parameter that has an impact on crypto-system parameters that influence latency. A higher tolerance of TLU off-by-one errors results in faster computations but may reduce accuracy. One can think of the error of obtaining $$T[x]$$ as a Gaussian distribution centered on $$x$$: $$TLU[x]$$ is obtained with probability of `1 - p_error`, while $$T[x-1]$$, $$T[x+1]$$ are obtained with much lower probability, etc. In Deep NNs, these type of errors can be tolerated up to some point. See the [`p_error` documentation for details](../explanations/advanced_features.md#approximate-computations) and more specifically the usage example of [the API for finding the best `p_error`](../explanations/advanced_features.md#searching-for-the-best-error-probability). +Finally, the TFHE scheme introduces a TLU error tolerance parameter that has an impact on crypto-system parameters that influence latency. A higher tolerance of TLU off-by-one errors results in faster computations but may reduce accuracy. You can think of the error of obtaining $$T[x]$$ as a Gaussian distribution centered on $$x$$: $$TLU[x]$$ is obtained with probability of `1 - p_error`, while $$T[x-1]$$, $$T[x+1]$$ are obtained with much lower probability, etc. In Deep NNs, these type of errors can be tolerated up to some point. See the [`p_error` documentation for details](../explanations/advanced_features.md#approximate-computations) and more specifically [the API for finding the best `p_error`](../explanations/advanced_features.md#searching-for-the-best-error-probability). diff --git a/docs/deep-learning/torch_support.md b/docs/deep-learning/torch_support.md index db8ee8e65..0bcee4483 100644 --- a/docs/deep-learning/torch_support.md +++ b/docs/deep-learning/torch_support.md @@ -1,19 +1,26 @@ # Using Torch -In addition to the built-in models, Concrete ML supports generic machine learning models implemented with Torch, or [exported as ONNX graphs](onnx_support.md). +This document explains how to implement machine learning models with Torch in Concrete ML, leveraging Fully Homomorphic Encryption (FHE). + +## Introduction There are two approaches to build [FHE-compatible deep networks](../getting-started/concepts.md#model-accuracy-considerations-under-fhe-constraints): -- [Quantization Aware Training (QAT)](../explanations/quantization.md) requires using custom layers, but can quantize weights and activations to low bit-widths. Concrete ML works with [Brevitas](../explanations/inner-workings/external_libraries.md#brevitas), a library providing QAT support for PyTorch. To use this mode, compile models using `compile_brevitas_qat_model` -- **Post-training Quantization**: This mode allows a vanilla PyTorch model to be compiled. However, when quantizing weights & activations to fewer than 7 bits, the accuracy can decrease strongly. On the other hand, depending on the model size, quantizing with 6-8 bits can be incompatible with FHE constraints. To use this mode, compile models with `compile_torch_model`. +- [**Quantization Aware Training (QAT)**](../getting-started/concepts.md#i-model-development): This method requires using custom layers to quantize weights and activations to low bit-widths. Concrete ML works with [Brevitas](../explanations/inner-workings/external_libraries.md#brevitas), a library that provides QAT support for PyTorch. + + - Use `compile_brevitas_qat_model` to compile models in this mode. + +- [**Post Training Quantization (PTQ)**](../getting-started/concepts.md#i-model-development): This method allows to compile a vanilla PyTorch model. However, accuracy may decrease significantly when quantizing weights and activations to fewer than 7 bits. On the other hand, depending on the model size, quantizing with 6-8 bits can be incompatible with FHE constraints. Thus you need to determine the trade-off between model accuracy and FHE compatibility. -Both approaches require the `rounding_threshold_bits` parameter to be set accordingly. The best values for this parameter need to be determined through experimentation. A good initial value to try is `6`. See [here](../explanations/advanced_features.md#rounded-activations-and-quantizers) for more details. + - Use `compile_torch_model` to compile models in this mode. + +Both approaches require setting `rounding_threshold_bits` parameter accordingly. You should experiment to find the best values, starting with an initial value of `6`. See [here](../explanations/advanced_features.md#rounded-activations-and-quantizers) for more details. {% hint style="info" %} -**See the [common compilation errors page](./fhe_assistant.md#common-compilation-errors) for an explanation of some error messages that the compilation function may raise.** +See the [common compilation errors page](./fhe_assistant.md#common-compilation-errors) for explanations and solutions to some common errors raised by the compilation function. {% endhint %} -## Quantization-aware training +## Quantization Aware training (QAT) The following example uses a simple QAT PyTorch model that implements a fully connected neural network with two hidden layers. Due to its small size, making this model respect FHE constraints is relatively easy. To use QAT, Brevitas `QuantIdentity` nodes must be inserted in the PyTorch model, including one that quantizes the input of the `forward` function. @@ -45,7 +52,7 @@ class QATSimpleNet(nn.Module): ``` -Once the model is trained, calling the [`compile_brevitas_qat_model`](../references/api/concrete.ml.torch.compile.md#function-compile_brevitas_qat_model) from Concrete ML will automatically perform conversion and compilation of a QAT network. Here, 3-bit quantization is used for both the weights and activations. The `compile_brevitas_qat_model` function automatically identifies the number of quantization bits used in the Brevitas model. +Once the model is trained, use [`compile_brevitas_qat_model`](../references/api/concrete.ml.torch.compile.md#function-compile_brevitas_qat_model) from Concrete ML to perform conversion and compilation of the QAT network. Here, 3-bit quantization is used for both the weights and activations. This function automatically identifies the number of quantization bits used in the Brevitas model. @@ -67,9 +74,9 @@ quantized_module = compile_brevitas_qat_model( If `QuantIdentity` layers are missing for any input or intermediate value, the compile function will raise an error. See the [common compilation errors page](./fhe_assistant.md#common-compilation-errors) for an explanation. {% endhint %} -## Post-training quantization +## Post Training quantization (PTQ) -The following example uses a simple PyTorch model that implements a fully connected neural network with two hidden layers. The model is compiled to use FHE using `compile_torch_model`. +The following example demonstrates a simple PyTorch model that implements a fully connected neural network with two hidden layers. The model is compiled with `compile_torch_model` to use FHE. ```python import torch.nn as nn @@ -107,11 +114,11 @@ quantized_module = compile_torch_model( ## Configuring quantization parameters -With QAT (the PyTorch/Brevitas models created following the example above), you need to configure quantization parameters such as `bit_width` (activation bit-width) and `weight_bit_width`. When using this mode, set `n_bits=None` in the `compile_brevitas_qat_model`. +The quantization parameters, along with the number of neurons in each layer, determine the accumulator bit-width of the network. Larger accumulator bit-widths result in higher accuracy but slower FHE inference time. -With PTQ, you need to set the `n_bits` value in the `compile_torch_model` function and must manually determine the trade-off between accuracy, FHE compatibility, and latency. +**QAT**: Configure parameters such as `bit_width` and `weight_bit_width`. Set `n_bits=None` in the `compile_brevitas_qat_model`. -The quantization parameters, along with the number of neurons on each layer, will determine the accumulator bit-width of the network. Larger accumulator bit-widths result in higher accuracy but slower FHE inference time. +**PTQ**: Set the `n_bits` value in the `compile_torch_model` function. Manually determine the trade-off between accuracy, FHE compatibility, and latency. ## Running encrypted inference @@ -125,18 +132,19 @@ x_test = numpy.array([numpy.random.randn(N_FEAT)]) y_pred = quantized_module.forward(x_test, fhe="execute") ``` -In this example, the input values `x_test` and the predicted values `y_pred` are floating points. The quantization (resp. de-quantization) step is done in the clear within the `forward` method, before (resp. after) any FHE computations. +In this example, the input values `x_test` and the predicted values `y_pred` are floating points. The quantization (respectively de-quantization) step is done in the clear within the `forward` method, before (respectively after) any FHE computations. ## Simulated FHE Inference in the clear -You can perform the inference on clear data in order to evaluate the impact of quantization and of FHE computation on the accuracy of their model. See [this section](../deep-learning/fhe_assistant.md#simulation) for more details. Two approaches exist: +You can perform the inference on clear data in order to evaluate the impact of quantization and of FHE computation on the accuracy of their model. See [this section](../deep-learning/fhe_assistant.md#simulation) for more details. + +There are two approaches: -- `quantized_module.forward(quantized_x, fhe="simulate")`: simulates FHE execution taking into account Table Lookup errors.\ - De-quantization must be done in a second step as for actual FHE execution. Simulation takes into account the `p_error`/`global_p_error` parameters -- `quantized_module.forward(quantized_x, fhe="disable")`: computes predictions in the clear on quantized data, and then de-quantize the result. The return value of this function contains the de-quantized (float) output of running the model in the clear. Calling this function on clear data is useful when debugging, but this does not perform actual FHE simulation. +- `quantized_module.forward(quantized_x, fhe="simulate")`: This method simulates FHE execution taking into account Table Lookup errors. De-quantization must be done in a second step as for actual FHE execution. Simulation takes into account the `p_error`/`global_p_error` parameters +- `quantized_module.forward(quantized_x, fhe="disable")`: This method computes predictions in the clear on quantized data, and then de-quantize the result. The return value of this function contains the de-quantized (float) output of running the model in the clear. Calling this function on clear data is useful when debugging, but this does not perform actual FHE simulation. {% hint style="info" %} -FHE simulation allows to measure the impact of the Table Lookup error on the model accuracy. The Table Lookup error can be adjusted using `p_error`/`global_p_error`, as described in the [approximate computation ](../explanations/advanced_features.md#approximate-computations)section. +FHE simulation allows to measure the impact of the Table Lookup error on the model accuracy. You can adjust the Table Lookup error using `p_error`/`global_p_error`, as described in the [approximate computation ](../explanations/advanced_features.md#approximate-computations)section. {% endhint %} ## Supported operators and activations diff --git a/docs/getting-started/concepts.md b/docs/getting-started/concepts.md index 103faf20f..d43374b8e 100644 --- a/docs/getting-started/concepts.md +++ b/docs/getting-started/concepts.md @@ -14,8 +14,8 @@ With Concrete ML, you can train a model on clear or encrypted data, then deploy 1. **Quantization:** Quantization converts inputs, model weights, and all intermediate values of the inference computation to integer equivalents. More information is available [here](../explanations/quantization.md). Concrete ML performs this step in two ways depending on model type: - - During training (Quantization Aware Training) - - After training (Post-training Quantization) + - During training (Quantization Aware Training): by adding quantization layers in the neural network model, weights can be forced to have discrete values and activation quantization parameters are optimized through gradient descent. QAT requires re-training a neural network with these quantization layers. + - After training (Post Training Quantization): the floating point neural network is kept as-is and a calibration step determines quantization parameters for each layer. No re-training is necessary and thus, no training data or labels are needed to convert a neural network to FHE using PTQ. 1. **Simulation:** Simulation allows you to execute a model that was quantized, to measure its accuracy in FHE, and to determine the modifications required to make it FHE compatible. Simulation is described in more detail [here](../explanations/compilation.md#fhe-simulation). diff --git a/docs/guides/client_server.md b/docs/guides/client_server.md index 2fb41bcc2..77af24eac 100644 --- a/docs/guides/client_server.md +++ b/docs/guides/client_server.md @@ -1,29 +1,55 @@ # Production Deployment -Concrete ML provides functionality to deploy FHE machine learning models in a client/server setting. The deployment workflow and model serving pattern is as follows: +This document explains the deployment workflow and the model serving pattern for deploying Fully Homomorphic Encryption machine learning models in a client/server setting using Concrete ML. ## Deployment +The steps to prepare a model for encrypted inference in a client/server setting is illustrated as follows: + ![](../figures/concretemlgraph1.jpg) -The diagram above shows the steps that a developer goes through to prepare a model for encrypted inference in a client/server setting. The training of the model and its compilation to FHE are performed on a development machine. Three different files are created when saving the model: +### Model training and compilation + +The training of the model and its compilation to FHE are performed on a development machine. + +Three different files are created when saving the model: -- `client.zip` contains `client.specs.json` which lists the secure cryptographic parameters needed for the client to generate private and evaluation keys. It also contains `serialized_processing.json` which describes the pre-processing and post-processing required by the machine learning model, such as quantization parameters to quantize the input and de-quantize the output. -- `server.zip` contains the compiled model. This file is sufficient to run the model on a server. The compiled model is machine-architecture specific (i.e., a model compiled on x86 cannot run on ARM). +- `client.zip` contains the following files: + - `client.specs.json` lists the secure cryptographic parameters needed for the client to generate private and evaluation keys. + - `serialized_processing.json` describes the pre-processing and post-processing required by the machine learning model, such as quantization parameters to quantize the input and de-quantize the output. +- `server.zip` contains the compiled model. This file is sufficient to run the model on a server. The compiled model is machine-architecture specific, for example, a model compiled on x86 cannot run on ARM. -The compiled model (`server.zip`) is deployed to a server and the cryptographic parameters (`client.zip`) are shared with the clients. In some settings, such as a phone application, the `client.zip` can be directly deployed on the client device and the server does not need to host it. +### Model deployment -> **Important Note:** In a client-server production using FHE, the server's output format depends on the model type. For regressors, the output matches the `predict()` method from scikit-learn, providing direct predictions. For classifiers, the output uses the `predict_proba()` method format, offering probability scores for each class, which allows clients to determine class membership by applying a threshold (commonly 0.5). +The compiled model (`server.zip`) is deployed to a server. The cryptographic parameters (`client.zip`) are shared with the clients. In some settings, such as a phone application, the `client.zip` can be directly deployed on the client device and the server does not need to host it. + +{% hint style="info" %} +**Important:** In a client-server production using FHE, the server's output format depends on the model type: + +- For regressors, the output matches the `predict()` method from scikit-learn, providing direct predictions. +- For classifiers, the output uses the `predict_proba()` method format, offering probability scores for each class, which allows clients to determine class membership by applying a threshold (commonly 0.5). + {% endhint %} ### Using the API Classes -The `FHEModelDev`, `FHEModelClient`, and `FHEModelServer` classes in the `concrete.ml.deployment` module make it easy to deploy and interact between the client and server: +The `FHEModelDev`, `FHEModelClient`, and `FHEModelServer` classes in the `concrete.ml.deployment` module simplifies the deployment and interaction between the client and server: + +- **`FHEModelDev`**: -- **`FHEModelDev`**: Use the `save` method of this class during the development phase to prepare and save the model artifacts (`client.zip` and `server.zip`). This class handles the serialization of the underlying FHE circuit as well as the crypto-parameters used for generating the keys. By changing the `mode` parameter of the `save` method, you can deploy a trained model or a [training FHE program](../built-in-models/training.md). + - This class handles the serialization of the underlying FHE circuit as well as the crypto-parameters used for generating the keys. + - Use the `save` method of this class during the development phase to prepare and save the model artifacts (`client.zip` and `server.zip`). With `save` method, you can deploy a trained model or a [training FHE program](../built-in-models/training.md). -- **`FHEModelClient`**: This class is used on the client side to generate and serialize the cryptographic keys, encrypt the data before sending it to the server, and decrypt the results received from the server. It also handles the loading of quantization parameters and pre/post-processing from `serialized_processing.json`. +- **`FHEModelClient`** is used on the client side for the following actions: -- **`FHEModelServer`**: This class is used on the server side to load the FHE circuit from `server.zip` and execute the model on encrypted data received from the client. + - Generate and serialize the cryptographic keys. + - Encrypt the data before sending it to the server. + - Decrypt the results received from the server. + - Load quantization parameters and pre/post-processing from `serialized_processing.json`. + +- **`FHEModelServer`** is used on the server side for the following actions: + + - Load the FHE circuit from `server.zip` . + - Execute the model on encrypted data received from the client. ### Example Usage @@ -69,26 +95,39 @@ encrypted_result = server.run(encrypted_data, serialized_evaluation_keys) result = client.deserialize_decrypt_dequantize(encrypted_result) ``` -> **Data Transfer Overview:** -> -> - **From Client to Server:** `serialized_evaluation_keys` (once), `encrypted_data`. -> - **From Server to Client:** `encrypted_result`. +#### Data transfer overview: + +- **From Client to Server:** `serialized_evaluation_keys` (once), `encrypted_data`. +- **From Server to Client:** `encrypted_result`. These objects are serialized into bytes to streamline the data transfer between the client and server. ## Serving +The client-side deployment of a secured inference machine learning model is illustrated as follows: + ![](../figures/concretemlgraph3.jpg) -The client-side deployment of a secured inference machine learning model follows the schema above. First, the client obtains the cryptographic parameters (stored in `client.zip`) and generates a private encryption/decryption key as well as a set of public evaluation keys. The public evaluation keys are then sent to the server, while the secret key remains on the client. +The workflow contains the following steps: -The private data is then encrypted by the client as described in the `serialized_processing.json` file in `client.zip`, and it is then sent to the server. Server-side, the FHE model inference is run on encrypted inputs using the public evaluation keys. +1. **Key generation**: The client obtains the cryptographic parameters stored in `client.zip` and generates a private encryption/decryption key as well as a set of public evaluation keys. +1. **Sending public keys**: The public evaluation keys are sent to the server, while the secret key remains on the client. +1. **Data encryption**: The private data is encrypted by the client as described in the `serialized_processing.json` file in `client.zip`. +1. **Data transmission**: The encrypted data is sent to the server. +1. **Encrypted inference**: Server-side, the FHE model inference is run on encrypted inputs using the public evaluation keys. +1. **Data transmission**: The encrypted result is returned by the server to the client. +1. **Data decryption**: The client decrypts it using its private key. +1. **Post-processing**: The client performs any necessary post-processing of the decrypted result as specified in `serialized_processing.json` (part of `client.zip`). -The encrypted result is then returned by the server to the client, which decrypts it using its private key. Finally, the client performs any necessary post-processing of the decrypted result as specified in `serialized_processing.json` (part of `client.zip`). +The server-side implementation of a Concrete ML model is illustrated as follows: ![](../figures/concretemlgraph2.jpg) -The server-side implementation of a Concrete ML model follows the diagram above. The public evaluation keys sent by clients are stored. They are then retrieved for the client that is querying the service and used to evaluate the machine learning model stored in `server.zip`. Finally, the server sends the encrypted result of the computation back to the client. +The workflow contains the following steps: + +1. **Storing the public key**: The public evaluation keys sent by clients are stored. +1. **Model evaluation**: The public evaluation keys are retrieved for the client that is querying the service and used to evaluate the machine learning model stored in `server.zip`. +1. **Sending back the result**: The server sends the encrypted result of the computation back to the client. ## Example notebook diff --git a/docs/guides/hybrid-models.md b/docs/guides/hybrid-models.md index 0cfc107eb..246f3150f 100644 --- a/docs/guides/hybrid-models.md +++ b/docs/guides/hybrid-models.md @@ -1,20 +1,35 @@ # Hybrid models -FHE enables cloud applications to process private user data without running the risk of data leaks. Furthermore, deploying ML models in the cloud is advantageous as it eases model updates, allows to scale to large numbers of users by using large amounts of compute power, and protects model IP by keeping the model on a trusted server instead of the client device. +This document explains how to use Concrete ML API to deploy hybrid models in Fully Homomorphic Encryption (FHE). -However, not all applications can be easily converted to FHE computation and the computation cost of FHE may make a full conversion exceed latency requirements. +## Introduction -Hybrid models provide a balance between on-device deployment and cloud-based deployment. This approach entails executing parts of the model directly on the client side, while other parts are securely processed with FHE on the server side. Concrete ML facilitates the hybrid deployment of various neural network models, including MLP (multilayer perceptron), CNN (convolutional neural network), and Large Language Models. +FHE allows cloud applications to process private user data securely, minimizing the risk of data leaks. Deploying machine learning (ML) models in the cloud offers several advantages: + +- Simplifies model updates. +- Scales to large user bases by leveraging substantial compute power. +- Protects model's Intellectual Property (IP) by keeping the model on a trusted server rather than on client devices. + +However, not all applications can be easily converted to FHE computation. The high computation cost of FHE might exceed latency requirements for full conversion. + +Hybrid models provide a balance between on-device deployment and cloud-based deployment. This approach involves: + +- Executing parts of the model on the client side. +- Securely processing other parts with FHE on the server side. + +Concrete ML supports hybrid deployment for various neural network models, including Multilayer Perceptron (MLP), Convolutional Neural Network (CNN), and Large Language Models(LLM). {% hint style="warning" %} -If model IP protection is important, care must be taken in choosing the parts of a model to be executed on the cloud. Some black-box model stealing attacks rely on knowledge distillation or on differential methods. As a general rule, the difficulty to steal a machine learning model is proportional to the size of the model, in terms of numbers of parameters and model depth. +To protect model IP, carefully choose the model parts to execute in the cloud. Some black-box model stealing attacks use knowledge distillation or differential methods. Generally, the difficulty of stealing a machine learning model increases with the model's size, number of parameters, and depth. {% endhint %} -The hybrid model deployment API provides an easy way to integrate the [standard deployment procedure](client_server.md) into neural network style models that are compiled with [`compile_brevitas_qat_model`](../references/api/concrete.ml.torch.compile.md#function-compile_brevitas_qat_model) or [`compile_torch_model`](../references/api/concrete.ml.torch.compile.md#function-compile_torch_model). +The hybrid model deployment API simplifies integrating the [standard deployment procedure](client_server.md) into neural network style models that are compiled with [`compile_brevitas_qat_model`](../references/api/concrete.ml.torch.compile.md#function-compile_brevitas_qat_model) or [`compile_torch_model`](../references/api/concrete.ml.torch.compile.md#function-compile_torch_model). ## Compilation -To use hybrid model deployment, the first step is to define what part of the PyTorch neural network model must be executed in FHE. The model part must be a `nn.Module` and is identified by its key in the original model's `.named_modules()`. +To use hybrid model deployment, the first step is to define which part of the PyTorch neural network model must be executed in FHE. Ensure the model part is a `nn.Module` and is identified by its key in the original model's `.named_modules()`. + +Here is an example: ```python import numpy as np @@ -68,9 +83,13 @@ hybrid_model.save_and_clear_private_info(model_dir, via_mlir=True) ## Server Side Deployment -The [`save_and_clear_private_info`](../references/api/concrete.ml.torch.hybrid_model.md#method-save_and_clear_private_info) function serializes the FHE circuits corresponding to the various parts of the model that were chosen to be moved server-side. It also saves the client-side model, removing the weights of the layers that are transferred server-side. Furthermore it saves all necessary information required to serve these sub-models with FHE, using the [`FHEModelDev`](../references/api/concrete.ml.deployment.fhe_client_server.md#class-fhemodeldev) class. +The [`save_and_clear_private_info`](../references/api/concrete.ml.torch.hybrid_model.md#method-save_and_clear_private_info) functions as follows: -The [`FHEModelServer`](../references/api/concrete.ml.deployment.fhe_client_server.md#class-fhemodelserver) class should be used to create a server application that creates end-points to serve these sub-models: +- Serializes the FHE circuits for the model parts chosen to be server-side. +- Saves the client-side model, removing the weights of the layers transferred to the server. +- Saves all necessary information required to serve these sub-models with FHE using the [`FHEModelDev`](../references/api/concrete.ml.deployment.fhe_client_server.md#class-fhemodeldev) class. + +To create a server application that serves these sub-models, use the [`FHEModelServer`](../references/api/concrete.ml.deployment.fhe_client_server.md#class-fhemodelserver) class: @@ -84,7 +103,7 @@ For more information about serving FHE models, see the [client/server section](c ## Client Side -A client application that deploys a model with hybrid deployment can be developed in a very similar manner to on-premise deployment: the model is loaded normally with PyTorch, but an extra step is required to specify the remote endpoint and the model parts that are to be executed remotely. +You can develop a client application that deploys a model with hybrid deployment in a very similar manner to on-premise deployment: Use PyTorch to load the model normally, but specify the remote endpoint and the part of the model to be executed remotely. @@ -99,7 +118,7 @@ hybrid_model = HybridFHEModel( ) ``` -Next, the client application must obtain the parameters necessary to encrypt and quantize data, as detailed in the [client/server documentation](client_server.md#production-deployment). +Next, obtain the parameters necessary to encrypt and quantize data, as detailed in the [client/server documentation](client_server.md#production-deployment). @@ -108,7 +127,7 @@ path_to_clients = Path(__file__).parent / "clients" hybrid_model.init_client(path_to_clients=path_to_clients) ``` -When the client application is ready to make inference requests to the server, it must set the operation mode of the `HybridFHEModel` instance to `HybridFHEMode.REMOTE`: +When the client application is ready to make inference requests to the server, set the operation mode of the `HybridFHEModel` instance to `HybridFHEMode.REMOTE`: @@ -117,7 +136,7 @@ for module in hybrid_model.remote_modules.values(): module.fhe_local_mode = HybridFHEMode.REMOTE ``` -When performing inference with the `HybridFHEModel` instance, `hybrid_model`, only the regular `forward` method is called, as if the model was fully deployed locally: +For inference with the `HybridFHEModel` instance, `hybrid_model`, call the regular `forward` method as if the model was fully deployed locally:: @@ -125,4 +144,9 @@ When performing inference with the `HybridFHEModel` instance, `hybrid_model`, on hybrid_model.forward(torch.randn((dim, ))) ``` -When calling `forward`, the `HybridFHEModel` handles, for each model part that is deployed remotely, all the necessary intermediate steps: quantizing the data, encrypting it, makes the request to the server using `requests` Python module, decrypting and de-quantizing the result. +When calling `forward`, the `HybridFHEModel` handles all the necessary intermediate steps for each model part deployed remotely, including: + +- Quantizing the data. +- Encrypting the data. +- Making the request to the server using `requests` Python module. +- Decrypting and de-quantizing the result. diff --git a/docs/guides/prediction_with_fhe.md b/docs/guides/prediction_with_fhe.md index 69e759786..df85a5b08 100644 --- a/docs/guides/prediction_with_fhe.md +++ b/docs/guides/prediction_with_fhe.md @@ -1,14 +1,19 @@ # Prediction with FHE -Concrete ML has APIs that make it easy, during model development and testing, to perform encryption, execution in FHE, and decryption in a single step. For more control, these individual steps can be executed separately. The APIs used to accomplish this are different for: +This document explains how to perform encryption, execution, and decryption of Fully Homomorphic Encryption (FHE) using one function call of the Concrete ML API, or multiple function calls separately. + +The APIs are different for the following: - [Built-in models](#built-in-models) - [Custom models](#custom-models) ## Built-in models +### Using one function + +All Concrete ML built-in models have a single `predict` method that performs the encryption, FHE execution, and decryption with only one function call. + The following example shows how to create a synthetic data-set and how to use it to train a LogisticRegression model from Concrete ML. -Next, we will discuss the dedicated functions for encryption, inference, and decryption. ```python from sklearn.datasets import make_classification @@ -33,7 +38,7 @@ y_pred_clear = model.predict(x_test) fhe_circuit = model.compile(x_train) ``` -All Concrete ML built-in models have a monolithic `predict` method that performs the encryption, FHE execution, and decryption with a single function call. Concrete ML models follow the same API as scikit-learn models, transparently performing the steps related to encryption for convenience. +Concrete ML models follow the same API as scikit-learn models, transparently performing the steps related to encryption for convenience. @@ -44,7 +49,9 @@ y_pred_fhe = model.predict(x_test, fhe="execute") Regarding this LogisticRegression model, as with scikit-learn, it is possible to predict the logits as well as the class probabilities by respectively using the `decision_function` or `predict_proba` methods instead. -Alternatively, it is possible to execute all main steps (key generation, quantization, encryption, FHE execution, decryption) separately. +### Using separate functions + +Alternatively, you can execute key generation, quantization, encryption, FHE execution and decryption separately. @@ -89,7 +96,7 @@ print(f"Similarity: {int((y_pred_fhe_step == y_pred_clear).mean()*100)}%") ## Custom models -For custom models, the API to execute inference in FHE or simulation is illustrated as: +For custom models, the API to execute inference in FHE or simulation is as follows: diff --git a/docs/guides/serialization.md b/docs/guides/serialization.md index a0661cd02..d1d03c621 100644 --- a/docs/guides/serialization.md +++ b/docs/guides/serialization.md @@ -1,17 +1,19 @@ # Serializing Built-In Models -Concrete ML has support for serializing all available built-in models. Using this feature, one can -dump a fitted and compiled model into a JSON string or file. The estimator can then be loaded back -using the JSON object. +This document explains how to serialize build-in models in Concrete ML. + +## Introduction + +Serialization allows you to dump a fitted and compiled model into a JSON string or file. You can then load the estimator back using the JSON object. ## Saving Models All built-in models provide the following methods: -- `dumps`: dumps the model as a string. -- `dump`: dumps the model into a file. +- `dumps`: Dumps the model as a string. +- `dump`: Dumps the model into a file. -For example, a logistic regression model can be dumped in a string as below. +For example, a logistic regression model can be dumped in a string as follows: ```python from sklearn.datasets import make_classification @@ -38,7 +40,7 @@ dumped_model_str = model.dumps() ``` -Similarly, it can be dumped into a file. +Similarly, it can be dumped into a file: @@ -54,7 +56,7 @@ with dumped_model_path.open("w") as f: model.dump(f) ``` -Alternatively, Concrete ML provides two equivalent global functions. +Alternatively, Concrete ML provides two equivalent global functions: @@ -72,26 +74,23 @@ with dumped_model_path.open("w") as f: ``` {% hint style="warning" %} -Some parameters used for instantiating Quantized Neural Network models are not supported for -serialization. In particular, one cannot serialize a model that was instantiated using callable -objects for the `train_split` and `predict_nonlinearity` parameters or with `callbacks` being -enabled. +Some parameters used for instantiating Quantized Neural Network models are not supported for serialization. In particular, you cannot serialize a model that was instantiated using callable objects for the `train_split` and `predict_nonlinearity` parameters or with `callbacks` being enabled. {% endhint %} ## Loading Models -Loading a built-in model is possible through the following functions: +You can load a built-in model using the following functions: -- `loads`: loads the model from a string. -- `load`: loads the model from a file. +- `loads`: Loads the model from a string. +- `load`: Loads the model from a file. {% hint style="warning" %} -A loaded model is required to be compiled once again in order for a user to be able to execute the inference in -FHE or with simulation. This is because the underlying FHE circuit is currently not serialized. -There is not required when FHE mode is disabled. +A loaded model must be compiled once again to execute the inference in +FHE or with simulation because the underlying FHE circuit is currently not serialized. +This recompilation is not required when FHE mode is disabled. {% endhint %} -The above logistic regression model can therefore be loaded as below. +The same logistic regression model can be loaded as follows: