Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add explanation of encrypted training and federated learning #437

Merged
merged 10 commits into from
Jan 18, 2024
6 changes: 4 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,11 +23,11 @@
</p>
<hr>

Concrete ML is a Privacy-Preserving Machine Learning (PPML) open-source set of tools built on top of [Concrete](https://github.com/zama-ai/concrete) by [Zama](https://github.com/zama-ai). It aims to simplify the use of fully homomorphic encryption (FHE) for data scientists to help them automatically turn machine learning models into their homomorphic equivalent. Concrete ML was designed with ease-of-use in mind, so that data scientists can use it without knowledge of cryptography. Notably, the Concrete ML model classes are similar to those in scikit-learn and it is also possible to convert PyTorch models to FHE.
Concrete ML is a Privacy-Preserving Machine Learning (PPML) open-source set of tools built on top of [Concrete](https://github.com/zama-ai/concrete) by [Zama](https://github.com/zama-ai). It simplifies the use of fully homomorphic encryption (FHE) for data scientists to help them automatically turn machine learning models into their homomorphic equivalent. Concrete ML was designed with ease-of-use in mind, so that data scientists can use it without knowledge of cryptography. Notably, the Concrete ML model classes are similar to those in scikit-learn and it is also possible to convert PyTorch models to FHE.

## Main features.

Data scientists can use models with APIs which are close to the frameworks they use, with additional options to run inferences in FHE.
Data scientists can use models with APIs which are close to the frameworks they use, while additional options to those models allow them to run inference or training on encrypted data with FHE.

Concrete ML features:

Expand Down Expand Up @@ -154,6 +154,8 @@ Various tutorials are given for [built-in models](docs/built-in-models/ml_exampl
- [Health diagnosis](use_case_examples/disease_prediction/): based on a patient's symptoms, history and other health factors, give
a diagnosis using FHE to preserve the privacy of the patient.

- [Private inference for federated learned models](use_case_examples/federated_learning/): private training of a Logistic Regression model and then import the model into Concrete ML and perform encrypted prediction
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bcm-at-zama added this as you suggested


- [Titanic](use_case_examples/titanic/KaggleTitanic.ipynb): solve the [Kaggle Titanic competition](https://www.kaggle.com/c/titanic/). Implemented with XGBoost from Concrete ML, this example comes as a companion of the [Kaggle notebook](https://www.kaggle.com/code/concretemlteam/titanic-with-privacy-preserving-machine-learning), and was the subject of a blogpost in [KDnuggets](https://www.kdnuggets.com/2022/08/machine-learning-encrypted-data.html).

- [Sentiment analysis with transformers](use_case_examples/sentiment_analysis_with_transformer): predict if an encrypted tweet / short message is positive, negative or neutral, using FHE. The [live interactive](https://huggingface.co/spaces/zama-fhe/encrypted_sentiment_analysis) demo is available on Hugging Face. This [blog post](https://huggingface.co/blog/sentiment-analysis-fhe) explains how this demo works!
Expand Down
9 changes: 6 additions & 3 deletions docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,10 +4,13 @@

![](.gitbook/assets/3.png)

Concrete ML is an open source, privacy-preserving, machine learning inference framework based on Fully Homomorphic Encryption (FHE). It enables data scientists without any prior knowledge of cryptography to automatically turn machine learning models into their FHE equivalent, using familiar APIs from scikit-learn and PyTorch (see how it looks for [linear models](built-in-models/linear.md), [tree-based models](built-in-models/tree.md), and [neural networks](built-in-models/neural-networks.md)).
Concrete ML is an open source, privacy-preserving, machine learning framework based on Fully Homomorphic Encryption (FHE). It enables data scientists without any prior knowledge of cryptography to automatically turn machine learning models into their FHE equivalent, using familiar APIs from scikit-learn and PyTorch (see how it looks for [linear models](built-in-models/linear.md), [tree-based models](built-in-models/tree.md), and [neural networks](built-in-models/neural-networks.md)). Concrete ML supports converting models for inference with FHE but can also [train some models](built-in-models/training.md) on encrypted data.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sentence Concrete ML supports converting models for inference with FHE but can also [train some models](built-in-models/training.md) on encrypted data. is grammatically correct, however, the use of "but" might be slightly misleading, as the two features (model conversion for inference and model training on encrypted data) are not in opposition but are complementary

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would suggest:
Concrete ML not only supports converting models for inference with FHE, it also enables training somes models ...


Fully Homomorphic Encryption is an encryption technique that allows computing directly on encrypted data, without needing to decrypt it. With FHE, you can build private-by-design applications without compromising on features. You can learn more about FHE in [this introduction](https://www.zama.ai/post/tfhe-deep-dive-part-1) or by joining the [FHE.org](https://fhe.org) community.

Training on encrypted data provides the highest level of privacy but is slower than training on clear data. Federated learning is an alternative approach, where data privacy can be ensured by using a trusted gradient aggregator, coupled with optional _differential privacy_ instead of encryption. Concrete ML
can import linear models, including logistic regression, that are trained using federated learning using the [`from_sklearn` function](./built-in-models/linear.md#pre-trained-models).

## Example usage

Here is a simple example of classification on encrypted data using logistic regression. More examples can be found [here](built-in-models/ml_examples.md).
Expand Down Expand Up @@ -86,11 +89,11 @@ This example shows the typical flow of a Concrete ML model:

To make a model work with FHE, the only constraint is to make it run within the supported precision limitations of Concrete ML (currently 16-bit integers). Thus, machine learning models must be quantized, which sometimes leads to a loss of accuracy versus the original model, which operates on plaintext.

Additionally, Concrete ML currently only supports FHE _inference_. Training has to be done on unencrypted data, producing a model which is then converted to an FHE equivalent that can perform encrypted inference (i.e., prediction over encrypted data).
Additionally, Concrete ML currently only supports training on encrypted data for some models, while it supports _inference_ for a large variety of models.

Finally, there is currently no support for pre-processing model inputs and post-processing model outputs. These processing stages may involve text-to-numerical feature transformation, dimensionality reduction, KNN or clustering, featurization, normalization, and the mixing of results of ensemble models.

These issues are currently being addressed, and significant improvements are expected to be released in the coming months.
These issues are currently being addressed, and significant improvements are expected to be released in the near future.

## Concrete stack

Expand Down
1 change: 1 addition & 0 deletions docs/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,7 @@
- [Neural Networks](built-in-models/neural-networks.md)
- [Nearest Neighbors](built-in-models/nearest-neighbors.md)
- [Pandas](built-in-models/pandas.md)
- [Encrypted training](built-in-models/training.md)
- [Built-in Model Examples](built-in-models/ml_examples.md)

## Deep Learning
Expand Down
22 changes: 11 additions & 11 deletions docs/advanced-topics/advanced_features.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,25 +6,25 @@ Concrete ML provides features for advanced users to adjust cryptographic paramet

Concrete ML makes use of table lookups (TLUs) to represent any non-linear operation (e.g., a sigmoid). TLUs are implemented through the Programmable Bootstrapping (PBS) operation, which applies a non-linear operation in the cryptographic realm.

The result of TLU operations is obtained with a specific error probability. Concrete ML offers the possibility to set this error probability, which influences the cryptographic parameters. The higher the success rate, the more restrictive the parameters become. This can affect both key generation and, more significantly, FHE execution time.
The result of TLU operations is obtained with a specific tolerance to off-by-one errors. Concrete ML offers the possibility to set the probability of such errors occurring, which influences the cryptographic parameters. The lower the tolerance, the more restrictive the parameters become, making both key generation and, more significantly, FHE execution time slower.

{% hint style="info" %}
Concrete ML has a _simulation_ mode where the impact of approximate computation of TLUs on the model accuracy can be determined. The simulation is much faster, speeding up model development significantly. The behavior in simulation mode is representative of the behavior of the model on encrypted data.
{% endhint %}

In Concrete ML, there are three different ways to define the error probability:
In Concrete ML, there are three different ways to define the tolerance to off-by-one errors for each TLU operation:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we comment on the CRT representation around here?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no.. way to complicated to explain


- setting `p_error`, the error probability of an individual TLU (see [here](advanced_features.md#an-error-probability-for-an-individual-tlu))
- setting `global_p_error`, the error probability of the full circuit (see [here](advanced_features.md#a-global-error-probability-for-the-entire-model))
- setting `p_error`, the error probability of an individual TLU (see [here](advanced_features.md#tolerance-to-off-by-one-error-for-an-individual-tlu))
- setting `global_p_error`, the error probability of the full circuit (see [here](advanced_features.md#a-global-tolerance-for-one-off-errors-for-the-entire-model))
- not setting `p_error` nor `global_p_error`, and using default parameters (see [here](advanced_features.md#using-default-error-probability))

{% hint style="warning" %}
`p_error` and `global_p_error` are somehow two concurrent parameters, in the sense they both have an impact on the choice of cryptographic parameters. It is forbidden in Concrete ML to set both `p_error` and `global_p_error` simultaneously.
`p_error` and `global_p_error` cannot be set at the same time, as they are incompatible with each other.
{% endhint %}

### An error probability for an individual TLU
### Tolerance to off-by-one error for an individual TLU

The first way to set error probabilities in Concrete ML is at the local level, by directly setting the probability of error of each individual TLU. This probability is referred to as `p_error`. A given PBS operation has a `1 - p_error` chance of being successful. The successful evaluation here means that the value decrypted after FHE evaluation is exactly the same as the one that would be computed in the clear.
The first way to set error probabilities in Concrete ML is at the local level, by directly setting the tolerance to error of each individual TLU operation (such as activation functions for a neuron output). This tolerance is referred to as `p_error`. A given PBS operation has a `1 - p_error` chance of being correct 100% of the time. The successful evaluation here means that the value decrypted after FHE evaluation is exactly the same as the one that would be computed in the clear. Otherwise, off-by-one errors might occur, but, in practice, these errors are not necessarily problematic if they are sufficiently rare.

For simplicity, it is best to use [default options](advanced_features.md#using-default-error-probability), irrespective of the type of model. Especially for deep neural networks, default values may be too pessimistic, reducing computation speed without any improvement in accuracy. For deep neural networks, some TLU errors might not affect the accuracy of the network, so `p_error` can be safely increased (e.g., see CIFAR classifications in [our showcase](../getting-started/showcase.md)).

Expand Down Expand Up @@ -63,9 +63,9 @@ clf.compile(X_train, p_error=0.1)

If the `p_error` value is specified and [simulation](compilation.md#fhe-simulation) is enabled, the run will take into account the randomness induced by the choice of `p_error`. This results in statistical similarity to the FHE evaluation.

### A global error probability for the entire model
### A global tolerance for one-off-errors for the entire model

A `global_p_error` is also available and defines the probability of success for the entire model. Here, the `p_error` for every PBS is computed internally in Concrete such that the `global_p_error` is reached.
A `global_p_error` is also available and defines the probability of 100% correctness for the entire model, compared to execution in the clear. In this case, the `p_error` for every TLU is determined internally in Concrete such that the `global_p_error` is reached for the whole model.

There might be cases where the user encounters a `No cryptography parameter found` error message. Increasing the `p_error` or the `global_p_error` in this case might help.

Expand All @@ -78,7 +78,7 @@ Usage is similar to the `p_error` parameter:
clf.compile(X_train, global_p_error=0.1)
```

In the above example, XGBoostClassifier in FHE has a 1/10 probability to have a shifted output value compared to the expected value. The shift is relative to the expected value, so even if the result is different, it should be **around** the expected value.
In the above example, XGBoostClassifier in FHE has a 1/10 probability to have a one-off output value compared to the expected value. The shift is relative to the expected value, so even if the result is different, it should be **close** to the expected value.

### Using default error probability

Expand Down Expand Up @@ -162,7 +162,7 @@ $$t = L - P$$

Then, the rounding operation can be computed as:

$$ \mathrm{round\_to\_t\_bits}(x, t) = \left\lfloor \frac{x}{2^t} \right\rceil \cdot 2^t $$
$$ \mathrm{round\_to\_P\_bits}(x, t) = \left\lfloor \frac{x}{2^t} \right\rceil \cdot 2^t $$

where $$x$$ is the input number, and $$\lfloor \cdot \rceil$$ denotes the operation that rounds to the nearest integer.

Expand Down
393 changes: 164 additions & 229 deletions docs/advanced_examples/LogisticRegressionTraining.ipynb

Large diffs are not rendered by default.

2 changes: 2 additions & 0 deletions docs/built-in-models/linear.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,8 @@ Using these models in FHE is extremely similar to what can be done with scikit-l

Models are also compatible with some of scikit-learn's main workflows, such as `Pipeline()` and `GridSearch()`.

## Pre-trained models

It is possible to convert an already trained scikit-learn linear model to a Concrete ML one by using the [`from_sklearn_model`](../developer-guide/api/concrete.ml.sklearn.base.md#classmethod-from_sklearn_model) method. See [below for an example](#loading-a-pre-trained-model). This functionality is only available for linear models.

## Quantization parameters
Expand Down
47 changes: 47 additions & 0 deletions docs/built-in-models/training.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# Training on Encrypted Data

Concrete ML offers the possibility to train [SGD Logistic Regression](../developer-guide/api/concrete.ml.sklearn.linear_model.md#class-sgdclassifier) on encrypted data. The [logistic regression training](../advanced_examples/LogisticRegressionTraining.ipynb) example shows this feature in action.

This example shows how to instantiate a logistic regression model that trains on encrypted data:

```python
from concrete.ml.sklearn import SGDClassifier
parameters_range = (-1.0, 1.0)

model = SGDClassifier(
random_state=42,
max_iter=50,
fit_encrypted=True,
parameters_range=parameters_range,
)
```

To activate encrypted training simply set `fit_encrypted=True` in the constructor. If this value is not set, training is performed
on clear data using `scikit-learn` gradient descent.

Next, to perform the training on encrypted data, call the `fit` function with the `fhe="execute"` argument:

<!--pytest-codeblocks:skip-->

```python
model.fit(X_binary, y_binary, fhe="execute")
```

{% hint style="info" %}
Training on encrypted data provides the highest level of privacy but is slower than training on clear data. Federated learning is an alternative approach, where data privacy can be ensured by using a trusted gradient aggregator, coupled with optional _differential privacy_ instead of encryption. Concrete ML
can import linear models, including logistic regression, that are trained using federated learning using the [`from_sklearn` function](linear.md#pre-trained-models).

{% endhint %}

## Training configuration

The `max_iter` parameter controls the number of batches that are processed by the training algorithm.
Copy link
Collaborator

@RomanBredehoft RomanBredehoft Jan 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a parameter from sklearn, I think we can remove it no ? Unless there's a good reason to set it for CML and in that case we should tell it here

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd leave it as it is important to set it properly.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would maybe emphasize on that then ! saying that's it's important for whatever reasons


The `parameters_range` parameter determines the initialization of the coefficients and the bias of the logistic regression. It is recommended to give values that are close to the min/max of the training data. It is also possible to normalize the training data so that it lies in the range $$[-1, 1]$$.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should add that this isn't just for initialization but also for quantization?


## Capabilities and Limitations

The logistic model that can be trained uses Stochastic Gradient Descent (SGD) and quantizes for data, weights, gradients and the error measure. It currently supports training 6-bit models, training both the coefficients and the bias.

The `SGDClassifier` does not currently support training models with other values for the bit-widths. The execution time to train a model
is proportional to the number of features and the number of training examples in the batch. The `SGDClassifier` training does not currently support client/server deployment for training.
Loading
Loading