zama-ai · andrei-stoian-zama · Jun 20, 2024 · Jun 17, 2024 · Jun 19, 2024 · Jun 19, 2024
@@ -45,6 +45,22 @@ df_decrypted = client.decrypt_to_pandas(df_encrypted)
 - **Quantized Float**: Floating-point numbers are quantized to integers within the supported range. This is achieved by computing a scale and zero point for each column, which are used to map the floating-point numbers to the quantized integer space.
 - **String Enum**: String columns are mapped to integers starting from 1. This mapping is stored and later used for de-quantization. If the number of unique strings exceeds 15, a `ValueError` is raised.
 
+### Using a user-defined schema
+
+Before encryption, pre-processing is applied to the data. For example **string enums** first need to be mapped to integers and floating point values must be quantized. By default, this mapping is done automatically. But, when two different clients encrypt their data separately, the automatic mappings may differ, for example due to some values missing from one of the client's DataFrame. Thus the column can not be be selected for merging encrypted data-frames.
+
+The Encrypted DataFrame supports user-defined mappings. These schemas a are defined as a dictionary where keys represent column names and values contain meta-data about the column. Supported column meta-data are:
+
+- string columns: mapping between string values and integers.
+- float columns: the min/max range that the column values lie in.
+
+```python
+schema = {
+    "string_column": {"abc": 1, "bcd": 2 },
+    "float_column": {"min": 0.1, "max": 0.5 }
+}
+```
+
 ## Supported operations
 
 Encrypted DataFrame is designed to support a subset of operations that are available for pandas DataFrames. For now, only the `merge` operation is supported. More operations will be added in the future releases.

@@ -1,10 +1,25 @@
 # Encrypted training
 
-This document explains how to train [SGD Logistic Regression](../references/api/concrete.ml.sklearn.linear_model.md#class-sgdclassifier) on encrypted data. The [logistic regression training](../advanced_examples/LogisticRegressionTraining.ipynb) example shows this feature in action.
+This document explains how to train [SGD Logistic Regression](../references/api/concrete.ml.sklearn.linear_model.md#class-sgdclassifier) on encrypted data.
+
+Training on encrypted data is done through an FHE program that is generated by Concrete ML, based on the characteristics of the data that are given to the `fit` function. The FHE program associated to a `SGDClassifier` object, once it is fit on the encrypted data, is specific to the distribution and the dimensionality of that data.
+
+When deploying encrypted training services, developers need to consider the type of data that future users of their services will train on:
+
+- the distribution of the data should match to achieve good accuracy
+- the dimensionality of the data needs to match since the deployed FHE programs are compiled for a fixed number of dimensions.
+
+See the [deployment](#deployment) section for more details.
+
+{% hint style="info" %}
+Training on encrypted data provides the highest level of privacy but is slower than training on clear data. Federated learning is an alternative approach, where data privacy can be ensured by using a trusted gradient aggregator, coupled with optional _differential privacy_ instead of encryption. Concrete ML can import models trained through federated learning using 3rd party tools. All model types are supported - linear, tree-based and neural networks - through the [`from_sklearn` function](linear.md#pre-trained-models) and the [`compile_torch_model`](../deep-learning/torch_support.md) function.
+{% endhint %}
 
 ## Example
 
-This example shows how to instantiate a logistic regression model that trains on encrypted data:
+The [logistic regression training](../advanced_examples/LogisticRegressionTraining.ipynb) example shows logistic regression training on encrypted data in action.
+
+The following snippet shows how to instantiate a logistic regression model that trains on encrypted data:
 
 ```python
 from concrete.ml.sklearn import SGDClassifier
@@ -18,7 +33,7 @@ model = SGDClassifier(
 )
 ```
 
-To activate encrypted training, simply set `fit_encrypted=True` in the constructor. If this value is not set, training is performed on clear data using `scikit-learn` gradient descent.
+To activate encrypted training, simply set `fit_encrypted=True` in the constructor. When the value is set, Concrete ML generates an FHE program which, when called through the `fit` function, processes encrypted training data, labels and initial weights and outputs trained model weights. If this value is not set, training is performed on clear data using `scikit-learn` gradient descent.
 
 Next, to perform the training on encrypted data, call the `fit` function with the `fhe="execute"` argument:
 
@@ -28,10 +43,6 @@ Next, to perform the training on encrypted data, call the `fit` function with th
 model.fit(X_binary, y_binary, fhe="execute")
 ```
 
-{% hint style="info" %}
-Training on encrypted data provides the highest level of privacy but is slower than training on clear data. Federated learning is an alternative approach, where data privacy can be ensured by using a trusted gradient aggregator, coupled with optional _differential privacy_ instead of encryption. Concrete ML can import linear models, including logistic regression, that are trained using federated learning using the [`from_sklearn` function](linear.md#pre-trained-models).
-{% endhint %}
-
 ## Training configuration
 
 The `max_iter` parameter controls the number of batches that are processed by the training algorithm.
@@ -43,3 +54,7 @@ The `parameters_range` parameter determines the initialization of the coefficien
 The trainable logistic model uses Stochastic Gradient Descent (SGD) and quantizes the data, weights, gradients and the error measure. It currently supports training 6-bit models, including g both the coefficients and the bias.
 
 The `SGDClassifier` does not currently support training models with other bit-width values. The execution time to train a model is proportional to the number of features and the number of training examples in the batch. The `SGDClassifier` training does not currently support client/server deployment for training.
+
+## Deployment
+
+Once you have tested an `SGDClassifier` that trains on encrypted data, is is possible to build an FHE training service by deploying the FHE training program of the `SGDClassifier`. See the [Production Deloyment](../guides/client_server.md) page for more details on how to the Concrete ML deployment utility classes. To deploy an FHE training program, the `mode='training'` parameter must be passed to the `FHEModelDev` class.
@@ -26,6 +26,10 @@ For a formal explanation of the mechanisms that enable FHE-compatible decision t
 Using the maximum depth parameter of decision trees and tree-ensemble models strongly increases the number of nodes in the trees. Therefore, we recommend using the XGBoost models which achieve better performance with lower depth.
 {% endhint %}
 
+## Pre-trained models
+
+You can convert an already trained scikit-learn tree-based model to a Concrete ML one by using the [`from_sklearn_model`](../references/api/concrete.ml.sklearn.base.md#classmethod-from_sklearn_model) method.
+
 ## Example
 
 Here's an example of how to use this model in FHE on a popular data-set using some of scikit-learn's pre-processing tools. You can find a more complete example in the [XGBClassifier notebook](../tutorials/ml_examples.md).

@@ -12,7 +12,7 @@ Concrete ML is an open source, privacy-preserving, machine learning framework ba
 
 - **Training on encrypted data**: FHE is an encryption technique that allows computing directly on encrypted data, without needing to decrypt it. With FHE, you can build private-by-design applications without compromising on features. Learn more about FHE in [this introduction](https://www.zama.ai/post/tfhe-deep-dive-part-1) or join the [FHE.org](https://fhe.org) community.
 
-- **Federated learning**: Training on encrypted data provides the highest level of privacy but is slower than training on clear data. Federated learning is an alternative approach, where data privacy can be ensured by using a trusted gradient aggregator, coupled with optional _differential privacy_ instead of encryption. Concrete ML can import linear models, including logistic regression, that are trained using federated learning using the [`from_sklearn` function](../built-in-models/linear.md#pre-trained-models).
+- **Federated learning**: Training on encrypted data provides the highest level of privacy but is slower than training on clear data. Federated learning is an alternative approach, where data privacy can be ensured by using a trusted gradient aggregator, coupled with optional _differential privacy_ instead of encryption. Concrete ML can import all types of model: linear, tree-based and neural networks, that are trained using federated learning using the [`from_sklearn` function](linear.md#pre-trained-models) and the [`compile_torch_model`](../deep-learning/torch_support.md) function.
 
 ## Example usage
 

@@ -19,7 +19,7 @@ The compiled model (`server.zip`) is deployed to a server and the cryptographic
 
 The `FHEModelDev`, `FHEModelClient`, and `FHEModelServer` classes in the `concrete.ml.deployment` module make it easy to deploy and interact between the client and server:
 
-- **`FHEModelDev`**: This class is used during the development phase to prepare and save the model artifacts (`client.zip` and `server.zip`). It handles the serialization of the underlying FHE circuit as well as the crypto-parameters used for generating the keys.
+- **`FHEModelDev`**: Use the `save` method of this class during the development phase to prepare and save the model artifacts (`client.zip` and `server.zip`). This class handles the serialization of the underlying FHE circuit as well as the crypto-parameters used for generating the keys. By changing the `mode` parameter of the `save` method, you can deploy a trained model or a [training FHE program](../built-in-models/training.md).
 
 - **`FHEModelClient`**: This class is used on the client side to generate and serialize the cryptographic keys, encrypt the data before sending it to the server, and decrypt the results received from the server. It also handles the loading of quantization parameters and pre/post-processing from `serialized_processing.json`.