Skip to content

Commit

Permalink
docs: detail hybrid deployment, optimization of built-in NNs, K-neare…
Browse files Browse the repository at this point in the history
  • Loading branch information
andrei-stoian-zama committed Sep 20, 2023
1 parent 9716831 commit 78dba60
Show file tree
Hide file tree
Showing 3 changed files with 152 additions and 15 deletions.
121 changes: 121 additions & 0 deletions docs/advanced-topics/hybrid-models.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
# Hybrid model deployment

FHE enables cloud applications to process private user data without running the risk of data leaks. Furthermore, deploying ML models in the cloud is advantageous as it eases model updates, allows to scale to large numbers of users by using large amounts of compute power, and protects model IP by keeping the model on a trusted server instead of the client device.

However, not all applications can be easily converted to FHE computation and the computation cost of FHE may make a full conversion exceed latency requirements.

Hybrid models are a compromise between on-premise or on-device deployment and full cloud deployment. Hybrid deployment means parts of the model are executed on the client side and parts are executed in FHE on the server side. Concrete ML supports hybrid deployment of neural network models such as MLP, CNN and Large Language-Models.

{% hint style="warning" %}
If model IP protection is important, care must be taken in choosing the parts of a model to be executed on the cloud. Some
black-box model stealing attacks rely on knowledge distillation
or on differential methods. As a general rule, the difficulty
to steal a machine learning model is proportional to the size of the model, in terms of numbers of parameters and model depth.
{% endhint %}

The hybrid model deployment API provides an easy way to integrate the [standard deployment procedure](client_server.md) into neural network style models that are compiled with [`compile_brevitas_qat_model`](../developer-guide/api/concrete.ml.torch.compile.md#kbdfunctionkbd-compilebrevitasqatmodel) or [`compile_torch_model`](../developer-guide/api/concrete.ml.torch.compile.md#kbdfunctionkbd-compiletorchmodel).

## Compilation

To use hybrid model deployment, the first step is to define what part of the PyTorch neural network model must be executed in FHE. The model part must be a `nn.Module` and is identified by its key in the original model's `.named_modules()`.

```python
from torch import nn
from concrete.ml.torch.hybrid_model import HybridFHEModel

class FCSmall(nn.Module):
"""Torch model for the tests."""

def __init__(self, dim):
super().__init__()
self.seq = nn.Sequential(nn.Linear(dim, dim), nn.ReLU(), nn.Linear(dim, dim))

def forward(self, x):
return self.seq(x)

model = FCSmall(10)
model_name = "FCSmall"
submodule_name = "seq_0"

# Prints ['', 'seq', 'seq.0', 'seq.1', 'seq.2']
print([k for (k, _) in model.named_modules()])

# Create a hybrid model
hybrid_model = HybridFHEModel(model, [submodule_name])
hybrid_model.compile_model(
inputs,
n_bits=8,
)

models_dir = Path(__file__).parent / "compiled_models"
models_dir.mkdir(exist_ok=True)
model_dir = models_dir / model_name
hybrid_model.save_and_clear_private_info(model_dir, via_mlir=True)
```

## Server Side Deployment

<!--pytest-codeblocks:cont-->

The [`save_and_clear_private_info`](<>) function serializes the FHE circuits
corresponding to the various parts of the model that were chosen to be moved
server-side. Furthermore it saves all necessary information required
to serve these sub-models with FHE, using the [`FHEModelDev`](../developer-guide/api/concrete.ml.deployment.fhe_client_server.md#kbdclasskbd-fhemodeldev) class.

The [`FHEModelServer`](../developer-guide/api/concrete.ml.deployment.fhe_client_server.md#kbdclasskbd-fhemodelserver) class should be used to create a server application that creates end-points to serve these sub-models:

```
from concrete.ml.deployment import FHEModelServer
MODULES = { model_name: { submodule_name: {"path": model_dir / "seq_0" }}}
return FHEModelServer(str(MODULES[model_name][submodule_name]["path"]))
```

For more information about serving FHE models, see the [client/server section](client_server.md#serving).

## Client Side

A client application that deploys a model with hybrid deployment can be developed
in a very similar manner to on-premise deployment: the model is loaded normally with Pytorch, but an extra step is required to specify the remote endpoint and the model parts that are to be executed remotely.

<!--pytest-codeblocks:cont-->

```python
# Modify model to use remote FHE server instead of local weights
hybrid_model = HybridFHEModel(
model,
submodule_name,
server_remote_address="http://0.0.0.0:8000",
model_name=f"{model_name}",
verbose=False,
)
```

Next, the client application must obtain the parameters necessary to encrypt and
quantize data, as detailed in the [client/server documentation](client_server.md#production-deployment).

<!--pytest-codeblocks:cont-->

```
path_to_clients = Path(__file__).parent / "clients"
hybrid_model.init_client(path_to_clients=path_to_clients)
```

When the client application is ready to make inference requests to the server, it must
set the operation mode of the `HybridFHEModel` instance to `HybridFHEMode.REMOTE`:

<!--pytest-codeblocks:cont-->

```
for module in hybrid_model.remote_modules.values():
module.fhe_local_mode = HybridFHEMode.REMOTE
```

When performing inference with the `HybridFHEModel` instance, `hybrid_model`, only the regular `forward` method is called, as if the model was fully deployed locally:

<!--pytest-codeblocks:cont-->

```python
hybrid_model.forward(torch.randn((dim, )))
```

When calling `forward`, the `HybridFHEModel` handles, for each model part that is deployed remotely, all the necessary intermediate steps: quantizing the data, encrypting it, makes the request to the server using `requests` Python module, decrypting and de-quantizing the result.
21 changes: 21 additions & 0 deletions docs/built-in-models/nearest-neighbors.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# Nearest-neighbors

Concrete ML offers nearest neighbors non-parametric classification models with a scikit-learn interface through the `KNeighborsClassifier` class.

| Concrete ML | scikit-learn |
| :------------------------: | --------------------------------------------------------------------------------------------------------------------- |
| [KNeighborsClassifier](<>) | [KNeighborsClassifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html) |

## Example usage

```python
from concrete.ml.sklearn import KNeighborsClassifier

concrete_classifier = KNeighborsClassifier(n_bits=2, n_neighbors=3)
```

The `KNeighborsClassifier` class quantizes the training dataset that is given to `.fit` with the specified number of bits, `n_bits`. As this value must be kept low to comply with [accumulator size constraints](../getting-started/concepts.md#model-accuracy-considerations-under-fhe-constraints) the accuracy of the model will depend heavily a well-chosen value `n_bits` and the dimensionality of the data.

The FHE inference latency of this model is heavily influenced by the `n_bits`, the dimensionality of the data. Furthermore, the size of the dataset has a linear impact on the complexity of the data and the number of nearest neighbors, `n_neighbors`, also plays a role.

The KNN computation executes in FHE in $$O(Nlog^2k)$$ steps, where $$N$$ is the training dataset size and $$k$$ is `n_neighbors`. Each step requires several PBS of the precision required to represent the distances between test vectors and the training dataset.
25 changes: 10 additions & 15 deletions docs/built-in-models/neural-networks.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,17 +17,14 @@ While `NeuralNetClassifier` and `NeuralNetClassifier` provide scikit-learn-like
Good quantization parameter values are critical to make models [respect FHE constraints](../getting-started/concepts.md#model-accuracy-considerations-under-fhe-constraints). Weights and activations should be quantized to low precision (e.g., 2-4 bits). The sparsity of the network can be tuned [as described below](neural-networks.md#overflow-errors) to avoid accumulator overflow.
{% endhint %}

{% hint style="warning" %}
Using `nn.ReLU` as the activation function benefits from an optimization where quantization uses powers-of-two scales. This results in much faster inference times in FHE, thanks to a TFHE primitive that performs fast division by powers of two.
{% endhint %}

## Example usage

To create an instance of a Fully Connected Neural Network (FCNN), you need to instantiate one of the `NeuralNetClassifier` and `NeuralNetRegressor` classes and configure a number of parameters that are passed to their constructor. Note that some parameters need to be prefixed by `module__`, while others don't. The parameters related to the model (i.e., the underlying `nn.Module`), must have the prefix. The parameters related to training options do not require the prefix.

<!--
FIXME: Restore the test for this codeblock in the next RC
see: https://github.com/zama-ai/concrete-ml-internal/issues/2807
-->

<!-- pytest-codeblocks:skip -->

```python
from concrete.ml.sklearn import NeuralNetClassifier
import torch.nn as nn
Expand All @@ -36,11 +33,6 @@ n_inputs = 10
n_outputs = 2
params = {
"module__n_layers": 2,
"module__n_w_bits": 2,
"module__n_a_bits": 2,
"module__n_accum_bits": 8,
"module__n_hidden_neurons_multiplier": 1,
"module__activation_function": nn.ReLU,
"max_epochs": 10,
}

Expand All @@ -56,13 +48,14 @@ The figure above right shows the Concrete ML neural network, trained with Quanti
### Architecture parameters

- `module__n_layers`: number of layers in the FCNN, must be at least 1. Note that this is the total number of layers. For a single, hidden layer NN model, set `module__n_layers=2`
- `module__activation_function`: can be one of the Torch activations (e.g., nn.ReLU, see the full list [here](../deep-learning/torch_support.md#activations))
- `module__activation_function`: can be one of the Torch activations (e.g., nn.ReLU, see the full list [here](../deep-learning/torch_support.md#activations)). Neural networks with `nn.ReLU` activation benefit from specific optimizations that make them around 10x faster than networks with other activation functions.

### Quantization parameters

- `n_w_bits` (default 3): number of bits for weights
- `n_a_bits` (default 3): number of bits for activations and inputs
- `n_accum_bits` (default 8): maximum accumulator bit-width that is desired. The implementation will attempt to keep accumulators under this bit-width through [pruning](../advanced-topics/pruning.md) (i.e., setting some weights to zero)
- `n_accum_bits`: maximum accumulator bit-width that is desired. By default, this is unbounded, which, for weight and activation bitwidth settings, [may make the trained networks fail in compilation](#overflow-errors). When used, the implementation will attempt to keep accumulators under this bit-width through [pruning](../advanced-topics/pruning.md) (i.e., setting some weights to zero)
- `power_of_two_scaling`: forces quantization scales to be powers-of-two, which, when coupled with the ReLU activation, benefits from strong FHE inference time optimization

### Training parameters (from skorch)

Expand All @@ -89,4 +82,6 @@ You can give weights to each class to use in training. Note that this must be su

### Overflow errors

The `n_hidden_neurons_multiplier` parameter influences training accuracy as it controls the number of non-zero neurons that are allowed in each layer. Increasing `n_hidden_neurons_multiplier` improves accuracy, but should take into account precision limitations to avoid an overflow in the accumulator. The default value is a good compromise that avoids an overflow in most cases, but you may want to change the value of this parameter to reduce the breadth of the network if you have overflow errors. A value of 1 should be completely safe with respect to overflow.
The `n_accum_bits` parameter influences training accuracy as it controls the number of non-zero neurons that are allowed in each layer. Increasing `n_accum_bits` improves accuracy, but should take into account precision limitations to avoid an overflow in the accumulator. The default value is a good compromise that avoids an overflow in most cases, but you may want to change the value of this parameter to reduce the breadth of the network if you have overflow errors.

Furthermore, the number of neurons on intermediate layers is controlled through the `n_hidden_neurons_multiplier` parameter - a value of 1 will make intermediate layers have the same number of neurons as the number of dimensions of the input data.

0 comments on commit 78dba60

Please sign in to comment.