docs: detail hybrid deployment, optimization of built-in NNs, K-neare…

…st neighbor classification Closes zama-ai/concrete-ml-internal#3975 Closes zama-ai/concrete-ml-internal#3976 Closes zama-ai/concrete-ml-internal#3977
zama-ai · Sep 20, 2023 · 78dba60 · 78dba60
1 parent 9716831
commit 78dba60
Show file tree

Hide file tree

Showing 3 changed files with 152 additions and 15 deletions.
diff --git a/docs/advanced-topics/hybrid-models.md b/docs/advanced-topics/hybrid-models.md
@@ -0,0 +1,121 @@
+# Hybrid model deployment
+
+FHE enables cloud applications to process private user data without running the risk of data leaks. Furthermore, deploying ML models in the cloud is advantageous as it eases model updates, allows to scale to large numbers of users by using large amounts of compute power, and protects model IP by keeping the model on a trusted server instead of the client device.
+
+However, not all applications can be easily converted to FHE computation and the computation cost of FHE may make a full conversion exceed latency requirements.
+
+Hybrid models are a compromise between on-premise or on-device deployment and full cloud deployment. Hybrid deployment means parts of the model are executed on the client side and parts are executed in FHE on the server side. Concrete ML supports hybrid deployment of neural network models such as MLP, CNN and Large Language-Models.
+
+{% hint style="warning" %}
+If model IP protection is important, care must be taken in choosing the parts of a model to be executed on the cloud. Some
+black-box model stealing attacks rely on knowledge distillation
+or on differential methods. As a general rule, the difficulty
+to steal a machine learning model is proportional to the size of the model, in terms of numbers of parameters and model depth.
+{% endhint %}
+
+The hybrid model deployment API provides an easy way to integrate the [standard deployment procedure](client_server.md) into neural network style models that are compiled with [`compile_brevitas_qat_model`](../developer-guide/api/concrete.ml.torch.compile.md#kbdfunctionkbd-compilebrevitasqatmodel) or [`compile_torch_model`](../developer-guide/api/concrete.ml.torch.compile.md#kbdfunctionkbd-compiletorchmodel).
+
+## Compilation
+
+To use hybrid model deployment, the first step is to define what part of the PyTorch neural network model must be executed in FHE. The model part must be a `nn.Module` and is identified by its key in the original model's `.named_modules()`.
+
+```python
+from torch import nn
+from concrete.ml.torch.hybrid_model import HybridFHEModel
+
+class FCSmall(nn.Module):
+    """Torch model for the tests."""
+
+    def __init__(self, dim):
+        super().__init__()
+        self.seq = nn.Sequential(nn.Linear(dim, dim), nn.ReLU(), nn.Linear(dim, dim))
+
+    def forward(self, x):
+        return self.seq(x)
+
+model = FCSmall(10)
+model_name = "FCSmall"
+submodule_name = "seq_0"
+
+# Prints ['', 'seq', 'seq.0', 'seq.1', 'seq.2']
+print([k for (k, _) in model.named_modules()])
+
+# Create a hybrid model
+hybrid_model = HybridFHEModel(model, [submodule_name])
+hybrid_model.compile_model(
+    inputs,
+    n_bits=8,
+)
+
+models_dir = Path(__file__).parent / "compiled_models"
+models_dir.mkdir(exist_ok=True)
+model_dir = models_dir / model_name
+hybrid_model.save_and_clear_private_info(model_dir, via_mlir=True)
+```
+
+## Server Side Deployment
+
+<!--pytest-codeblocks:cont-->
+
+The [`save_and_clear_private_info`](<>) function serializes the FHE circuits
+corresponding to the various parts of the model that were chosen to be moved
+server-side. Furthermore it saves all necessary information required
+to serve these sub-models with FHE, using the [`FHEModelDev`](../developer-guide/api/concrete.ml.deployment.fhe_client_server.md#kbdclasskbd-fhemodeldev) class.
+
+The [`FHEModelServer`](../developer-guide/api/concrete.ml.deployment.fhe_client_server.md#kbdclasskbd-fhemodelserver) class should be used to create a server application that creates end-points to serve these sub-models:
+
+```
+from concrete.ml.deployment import FHEModelServer
+MODULES = { model_name: { submodule_name: {"path":  model_dir / "seq_0" }}}
+return FHEModelServer(str(MODULES[model_name][submodule_name]["path"]))
+```
+
+For more information about serving FHE models, see the [client/server section](client_server.md#serving).
+
+## Client Side
+
+A client application that deploys a model with hybrid deployment can be developed
+in a very similar manner to on-premise deployment: the model is loaded normally with Pytorch, but an extra step is required to specify the remote endpoint and the model parts that are to be executed remotely.
+
+<!--pytest-codeblocks:cont-->
+
+```python
+# Modify model to use remote FHE server instead of local weights
+hybrid_model = HybridFHEModel(
+    model,
+    submodule_name,
+    server_remote_address="http://0.0.0.0:8000",
+    model_name=f"{model_name}",
+    verbose=False,
+)
+```
+
+Next, the client application must obtain the parameters necessary to encrypt and
+quantize data, as detailed in the [client/server documentation](client_server.md#production-deployment).
+
+<!--pytest-codeblocks:cont-->
+
+```
+path_to_clients = Path(__file__).parent / "clients"
+hybrid_model.init_client(path_to_clients=path_to_clients)
+```
+
+When the client application is ready to make inference requests to the server, it must
+set the operation mode of the `HybridFHEModel` instance to `HybridFHEMode.REMOTE`:
+
+<!--pytest-codeblocks:cont-->
+
+```
+for module in hybrid_model.remote_modules.values():
+    module.fhe_local_mode = HybridFHEMode.REMOTE    
+```
+
+When performing inference with the `HybridFHEModel` instance, `hybrid_model`, only the regular `forward` method is called, as if the model was fully deployed locally:
+
+<!--pytest-codeblocks:cont-->
+
+```python
+hybrid_model.forward(torch.randn((dim, )))
+```
+
+When calling `forward`, the `HybridFHEModel` handles, for each model part that is deployed remotely, all the necessary intermediate steps: quantizing the data, encrypting it, makes the request to the server using `requests` Python module, decrypting and de-quantizing the result.
diff --git a/docs/built-in-models/nearest-neighbors.md b/docs/built-in-models/nearest-neighbors.md
@@ -0,0 +1,21 @@
+# Nearest-neighbors
+
+Concrete ML offers nearest neighbors non-parametric classification models with a scikit-learn interface through the `KNeighborsClassifier` class.
+
+|        Concrete ML         | scikit-learn                                                                                                          |
+| :------------------------: | --------------------------------------------------------------------------------------------------------------------- |
+| [KNeighborsClassifier](<>) | [KNeighborsClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) |
+
+## Example usage
+
+```python
+from concrete.ml.sklearn import KNeighborsClassifier
+
+concrete_classifier = KNeighborsClassifier(n_bits=2,     n_neighbors=3)
+```
+
+The `KNeighborsClassifier` class quantizes the training dataset that is given to `.fit` with the specified number of bits, `n_bits`. As this value must be kept low to comply with [accumulator size constraints](../getting-started/concepts.md#model-accuracy-considerations-under-fhe-constraints) the accuracy of the model will depend heavily a well-chosen value `n_bits` and the dimensionality of the data.
+
+The FHE inference latency of this model is heavily influenced by the `n_bits`, the dimensionality of the data. Furthermore, the size of the dataset has a linear impact on the complexity of the data and the number of nearest neighbors, `n_neighbors`, also plays a role.
+
+The KNN computation executes in FHE in $$O(Nlog^2k)$$ steps, where $$N$$ is the training dataset size and $$k$$ is `n_neighbors`. Each step requires several PBS of the precision required to represent the distances between test vectors and the training dataset.
diff --git a/docs/built-in-models/neural-networks.md b/docs/built-in-models/neural-networks.md
@@ -17,17 +17,14 @@ While `NeuralNetClassifier` and `NeuralNetClassifier` provide scikit-learn-like
 Good quantization parameter values are critical to make models [respect FHE constraints](../getting-started/concepts.md#model-accuracy-considerations-under-fhe-constraints). Weights and activations should be quantized to low precision (e.g., 2-4 bits). The sparsity of the network can be tuned [as described below](neural-networks.md#overflow-errors) to avoid accumulator overflow.
 {% endhint %}
 
+{% hint style="warning" %}
+Using `nn.ReLU` as the activation function benefits from an optimization where quantization uses powers-of-two scales. This results in much faster inference times in FHE, thanks to a TFHE primitive that performs fast division by powers of two.
+{% endhint %}
+
 ## Example usage
 
 To create an instance of a Fully Connected Neural Network (FCNN), you need to instantiate one of the `NeuralNetClassifier` and `NeuralNetRegressor` classes and configure a number of parameters that are passed to their constructor. Note that some parameters need to be prefixed by `module__`, while others don't. The parameters related to the model (i.e., the underlying `nn.Module`), must have the prefix. The parameters related to training options do not require the prefix.
 
-<!-- 
-FIXME: Restore the test for this codeblock in the next RC
-see: https://github.com/zama-ai/concrete-ml-internal/issues/2807
- -->
-
-<!-- pytest-codeblocks:skip -->
-
 ```python
 from concrete.ml.sklearn import NeuralNetClassifier
 import torch.nn as nn
@@ -36,11 +33,6 @@ n_inputs = 10
 n_outputs = 2
 params = {
     "module__n_layers": 2,
-    "module__n_w_bits": 2,
-    "module__n_a_bits": 2,
-    "module__n_accum_bits": 8,
-    "module__n_hidden_neurons_multiplier": 1,
-    "module__activation_function": nn.ReLU,
     "max_epochs": 10,
 }
 
@@ -56,13 +48,14 @@ The figure above right shows the Concrete ML neural network, trained with Quanti
 ### Architecture parameters
 
 - `module__n_layers`: number of layers in the FCNN, must be at least 1. Note that this is the total number of layers. For a single, hidden layer NN model, set `module__n_layers=2`
-- `module__activation_function`: can be one of the Torch activations (e.g., nn.ReLU, see the full list [here](../deep-learning/torch_support.md#activations))
+- `module__activation_function`: can be one of the Torch activations (e.g., nn.ReLU, see the full list [here](../deep-learning/torch_support.md#activations)). Neural networks with `nn.ReLU` activation benefit from specific optimizations that make them around 10x faster than networks with other activation functions.
 
 ### Quantization parameters
 
 - `n_w_bits` (default 3): number of bits for weights
 - `n_a_bits` (default 3): number of bits for activations and inputs
-- `n_accum_bits` (default 8): maximum accumulator bit-width that is desired. The implementation will attempt to keep accumulators under this bit-width through [pruning](../advanced-topics/pruning.md) (i.e., setting some weights to zero)
+- `n_accum_bits`: maximum accumulator bit-width that is desired. By default, this is unbounded, which, for weight and activation bitwidth settings, [may make the trained networks fail in compilation](#overflow-errors). When used, the implementation will attempt to keep accumulators under this bit-width through [pruning](../advanced-topics/pruning.md) (i.e., setting some weights to zero)
+- `power_of_two_scaling`: forces quantization scales to be powers-of-two, which, when coupled with the ReLU activation, benefits from strong FHE inference time optimization
 
 ### Training parameters (from skorch)
 
@@ -89,4 +82,6 @@ You can give weights to each class to use in training. Note that this must be su
 
 ### Overflow errors
 
-The `n_hidden_neurons_multiplier` parameter influences training accuracy as it controls the number of non-zero neurons that are allowed in each layer. Increasing `n_hidden_neurons_multiplier` improves accuracy, but should take into account precision limitations to avoid an overflow in the accumulator. The default value is a good compromise that avoids an overflow in most cases, but you may want to change the value of this parameter to reduce the breadth of the network if you have overflow errors. A value of 1 should be completely safe with respect to overflow.
+The `n_accum_bits` parameter influences training accuracy as it controls the number of non-zero neurons that are allowed in each layer. Increasing `n_accum_bits` improves accuracy, but should take into account precision limitations to avoid an overflow in the accumulator. The default value is a good compromise that avoids an overflow in most cases, but you may want to change the value of this parameter to reduce the breadth of the network if you have overflow errors.
+
+Furthermore, the number of neurons on intermediate layers is controlled through the `n_hidden_neurons_multiplier` parameter - a value of 1 will make intermediate layers have the same number of neurons as the number of dimensions of the input data.