Skip to content

Commit

Permalink
docs: add dataframe documentation (#576)
Browse files Browse the repository at this point in the history
  • Loading branch information
andrei-stoian-zama authored Apr 3, 2024
1 parent 5306f6c commit d3bf5ac
Show file tree
Hide file tree
Showing 20 changed files with 737 additions and 249 deletions.
2 changes: 1 addition & 1 deletion docs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ Learn the basics of Concrete ML, set it up, and make it run with ease.

Start building with Concrete ML by exploring its core features, discovering essential guides, and learning more with user-friendly tutorials.

<table data-view="cards"><thead><tr><th></th><th></th><th></th><th data-hidden data-card-target data-type="content-ref"></th><th data-hidden data-card-cover data-type="files"></th></tr></thead><tbody><tr><td><strong>Fundamentals</strong></td><td>Explore core features and basics of Concrete ML.<br><br></td><td><ul><li><a href="tutorials/ml_examples.md">Build-in models</a></li><li><a href="tutorials/dl_examples.md">Deep learning</a></li></ul></td><td></td><td><a href=".gitbook/assets/orange1.png">orange1.png</a></td></tr><tr><td><strong>Guides</strong></td><td>Discover essential guides to work with Concrete ML.<br><br></td><td><ul><li><a href="guides/prediction_with_fhe.md">Prediction with FHE</a></li><li><a href="guides/client_server.md">Production deployment</a></li><li><a href="guides/hybrid-models.md">Hybrid models</a></li></ul></td><td></td><td><a href=".gitbook/assets/orange2.png">orange2.png</a></td></tr><tr><td><strong>Tutorials</strong></td><td>Learn more about Concrete ML with our tutorials.<br><br></td><td><ul><li><a href="tutorials/showcase.md#start-here">Start here</a></li><li><a href="tutorials/showcase.md#go-further">Go further</a></li><li><a href="tutorials/showcase.md">See all tutorials</a></li></ul></td><td></td><td><a href=".gitbook/assets/orange3.png">orange3.png</a></td></tr></tbody></table>
<table data-view="cards"><thead><tr><th></th><th></th><th></th><th data-hidden data-card-target data-type="content-ref"></th><th data-hidden data-card-cover data-type="files"></th></tr></thead><tbody><tr><td><strong>Fundamentals</strong></td><td>Explore core features and basics of Concrete ML.<br><br></td><td><ul><li><a href="tutorials/ml_examples.md">Built-in models</a></li><li><a href="built-in-models/encrypted_dataframe.md">Encrypted data-frames</a></li><li><a href="tutorials/dl_examples.md">Deep learning</a></li></ul></td><td></td><td><a href=".gitbook/assets/orange1.png">orange1.png</a></td></tr><tr><td><strong>Guides</strong></td><td>Discover essential guides to work with Concrete ML.<br><br></td><td><ul><li><a href="guides/prediction_with_fhe.md">Prediction with FHE</a></li><li><a href="guides/client_server.md">Production deployment</a></li><li><a href="guides/hybrid-models.md">Hybrid models</a></li></ul></td><td></td><td><a href=".gitbook/assets/orange2.png">orange2.png</a></td></tr><tr><td><strong>Tutorials</strong></td><td>Learn more about Concrete ML with our tutorials.<br><br></td><td><ul><li><a href="tutorials/showcase.md#start-here">Start here</a></li><li><a href="tutorials/showcase.md#go-further">Go further</a></li><li><a href="tutorials/showcase.md">See all tutorials</a></li></ul></td><td></td><td><a href=".gitbook/assets/orange3.png">orange3.png</a></td></tr></tbody></table>

## Explore more

Expand Down
58 changes: 2 additions & 56 deletions docs/SUMMARY.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@
- [Tree-based models](built-in-models/tree.md)
- [Neural networks](built-in-models/neural-networks.md)
- [Nearest neighbors](built-in-models/nearest-neighbors.md)
- [Pandas](built-in-models/pandas.md)
- [Encrypted dataframe](built-in-models/encrypted_dataframe.md)
- [Encrypted training](built-in-models/training.md)

## Deep Learning
Expand All @@ -41,62 +41,8 @@

## References

<!-- auto-created, do not edit, begin -->

- [API](references/api/README.md)
- [concrete.ml.common.check_inputs.md](references/api/concrete.ml.common.check_inputs.md)
- [concrete.ml.common.debugging.custom_assert.md](references/api/concrete.ml.common.debugging.custom_assert.md)
- [concrete.ml.common.debugging.md](references/api/concrete.ml.common.debugging.md)
- [concrete.ml.common.md](references/api/concrete.ml.common.md)
- [concrete.ml.common.serialization.decoder.md](references/api/concrete.ml.common.serialization.decoder.md)
- [concrete.ml.common.serialization.dumpers.md](references/api/concrete.ml.common.serialization.dumpers.md)
- [concrete.ml.common.serialization.encoder.md](references/api/concrete.ml.common.serialization.encoder.md)
- [concrete.ml.common.serialization.loaders.md](references/api/concrete.ml.common.serialization.loaders.md)
- [concrete.ml.common.serialization.md](references/api/concrete.ml.common.serialization.md)
- [concrete.ml.common.utils.md](references/api/concrete.ml.common.utils.md)
- [concrete.ml.deployment.deploy_to_aws.md](references/api/concrete.ml.deployment.deploy_to_aws.md)
- [concrete.ml.deployment.deploy_to_docker.md](references/api/concrete.ml.deployment.deploy_to_docker.md)
- [concrete.ml.deployment.fhe_client_server.md](references/api/concrete.ml.deployment.fhe_client_server.md)
- [concrete.ml.deployment.md](references/api/concrete.ml.deployment.md)
- [concrete.ml.deployment.server.md](references/api/concrete.ml.deployment.server.md)
- [concrete.ml.deployment.utils.md](references/api/concrete.ml.deployment.utils.md)
- [concrete.ml.onnx.convert.md](references/api/concrete.ml.onnx.convert.md)
- [concrete.ml.onnx.md](references/api/concrete.ml.onnx.md)
- [concrete.ml.onnx.onnx_impl_utils.md](references/api/concrete.ml.onnx.onnx_impl_utils.md)
- [concrete.ml.onnx.onnx_model_manipulations.md](references/api/concrete.ml.onnx.onnx_model_manipulations.md)
- [concrete.ml.onnx.onnx_utils.md](references/api/concrete.ml.onnx.onnx_utils.md)
- [concrete.ml.onnx.ops_impl.md](references/api/concrete.ml.onnx.ops_impl.md)
- [concrete.ml.pytest.md](references/api/concrete.ml.pytest.md)
- [concrete.ml.pytest.torch_models.md](references/api/concrete.ml.pytest.torch_models.md)
- [concrete.ml.pytest.utils.md](references/api/concrete.ml.pytest.utils.md)
- [concrete.ml.quantization.base_quantized_op.md](references/api/concrete.ml.quantization.base_quantized_op.md)
- [concrete.ml.quantization.md](references/api/concrete.ml.quantization.md)
- [concrete.ml.quantization.post_training.md](references/api/concrete.ml.quantization.post_training.md)
- [concrete.ml.quantization.quantized_module.md](references/api/concrete.ml.quantization.quantized_module.md)
- [concrete.ml.quantization.quantized_module_passes.md](references/api/concrete.ml.quantization.quantized_module_passes.md)
- [concrete.ml.quantization.quantized_ops.md](references/api/concrete.ml.quantization.quantized_ops.md)
- [concrete.ml.quantization.quantizers.md](references/api/concrete.ml.quantization.quantizers.md)
- [concrete.ml.search_parameters.md](references/api/concrete.ml.search_parameters.md)
- [concrete.ml.search_parameters.p_error_search.md](references/api/concrete.ml.search_parameters.p_error_search.md)
- [concrete.ml.sklearn.base.md](references/api/concrete.ml.sklearn.base.md)
- [concrete.ml.sklearn.glm.md](references/api/concrete.ml.sklearn.glm.md)
- [concrete.ml.sklearn.linear_model.md](references/api/concrete.ml.sklearn.linear_model.md)
- [concrete.ml.sklearn.md](references/api/concrete.ml.sklearn.md)
- [concrete.ml.sklearn.neighbors.md](references/api/concrete.ml.sklearn.neighbors.md)
- [concrete.ml.sklearn.qnn.md](references/api/concrete.ml.sklearn.qnn.md)
- [concrete.ml.sklearn.qnn_module.md](references/api/concrete.ml.sklearn.qnn_module.md)
- [concrete.ml.sklearn.rf.md](references/api/concrete.ml.sklearn.rf.md)
- [concrete.ml.sklearn.svm.md](references/api/concrete.ml.sklearn.svm.md)
- [concrete.ml.sklearn.tree.md](references/api/concrete.ml.sklearn.tree.md)
- [concrete.ml.sklearn.tree_to_numpy.md](references/api/concrete.ml.sklearn.tree_to_numpy.md)
- [concrete.ml.sklearn.xgb.md](references/api/concrete.ml.sklearn.xgb.md)
- [concrete.ml.torch.compile.md](references/api/concrete.ml.torch.compile.md)
- [concrete.ml.torch.hybrid_model.md](references/api/concrete.ml.torch.hybrid_model.md)
- [concrete.ml.torch.md](references/api/concrete.ml.torch.md)
- [concrete.ml.torch.numpy_module.md](references/api/concrete.ml.torch.numpy_module.md)
- [concrete.ml.version.md](references/api/concrete.ml.version.md)

<!-- auto-created, do not edit, end -->
- [Pandas support](references/pandas.md)

## Explanations

Expand Down
116 changes: 116 additions & 0 deletions docs/built-in-models/encrypted_dataframe.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
# Working with Encrypted DataFrames

This comment has been minimized.

Copy link
@yuxizama

yuxizama Apr 10, 2024

Contributor

Working with encrypted DataFrames


Concrete ML builds upon the pandas data-frame functionality by introducing the capability to construct and perform operations on encrypted data-frames using FHE. This API ensures data scientists can leverage well-known pandas-like operations while maintaining privacy throughout the whole process.

This comment has been minimized.

Copy link
@yuxizama

yuxizama Apr 10, 2024

Contributor

Concrete ML extends Pandas DataFrames to enable the construction and execution of operations on encrypted data using Fully Homomorphic Encryption (FHE). This API allows data scientists to leverage familiar pandas-like operations while ensuring privacy at every stage.


Encrypted data-frames are a storage format for encrypted tabular data and they can be exchanged with third-parties without security risks.

This comment has been minimized.

Copy link
@yuxizama

yuxizama Apr 10, 2024

Contributor

Encrypted DataFrame is a storage format to securely store encrypted tabular data, allowing safe exchanges with third parties without security risks.


Potential applications include:

- Encrypted storage of tabular datasets
- Joint data analysis efforts between multiple parties
- Data preparation steps before machine learning tasks, such as inference or training
- Secure outsourcing of data analysis to untrusted third parties

## Encrypt and Decrypt a DataFrame

This comment has been minimized.

Copy link
@yuxizama

yuxizama Apr 10, 2024

Contributor

Encrypt and decrypt a DataFrame


To encrypt a pandas `DataFrame`, you must construct a `ClientEngine` which manages keys. Then call the `encrypt_from_pandas` function:

```python
from concrete.ml.pandas import ClientEngine
from io import StringIO
import pandas

data_left = """index,total_bill,tip,sex,smoker
1,12.54,2.5,Male,No
2,11.17,1.5,Female,No
3,20.29,2.75,Female,No
"""

# Load your pandas DataFrame
df = pandas.read_csv(StringIO(data_left))

# Obtain client object
client = ClientEngine(keys_path="my_keys")

# Encrypt the DataFrame
df_encrypted = client.encrypt_from_pandas(df)

# Decrypt the DataFrame to produce a pandas DataFrame
df_decrypted = client.decrypt_to_pandas(df_encrypted)
```

## Supported Data Types and Schema Definition

Concrete ML's encrypted `DataFrame` operations support a specific set of data types:

- **Integer**: Integers are supported within a specific range determined by the encryption scheme's quantization parameters. Default range is 1 to 15. 0 being used for the `NaN`. Values outside this range will cause a `ValueError` to be raised during the pre-processing stage.

This comment has been minimized.

Copy link
@yuxizama

yuxizama Apr 10, 2024

Contributor

....0 represents NaN. Values outside this range will raise a ValueError during the pre-processing stage.

- **Quantized Float**: Floating-point numbers are quantized to integers within the supported range. This is achieved by computing a scale and zero point for each column, which are used to map the floating-point numbers to the quantized integer space.

This comment has been minimized.

Copy link
@yuxizama

yuxizama Apr 10, 2024

Contributor

Second sentence:
To achieve this, Concrete ML computes a scale and a zero point for each column to map the floating-point numbers to the quantized integer space.

- **String Enum**: String columns are mapped to integers starting from 1. This mapping is stored and later used for de-quantization. If the number of unique strings exceeds 15, a `ValueError` is raised.

## Supported Operations on Encrypted Data-frames

This comment has been minimized.

Copy link
@yuxizama

yuxizama Apr 10, 2024

Contributor

Supported operations on encrypted DataFrames


> **Outsourced execution**: The merge operation on Encrypted DataFrames can be **securely** performed on a third-party server. This means that the server can execute the merge without ever having access to the unencrypted data. The server only requires the encrypted DataFrames.
Encrypted DataFrames support a subset of operations that are available for pandas DataFrames. The following operations are currently supported:

This comment has been minimized.

Copy link
@yuxizama

yuxizama Apr 10, 2024

Contributor

Concrete ML Encrypted DataFrames support a subset of operations of pandas DataFrames, including:


- `merge`: left or right join two data-frames

This comment has been minimized.

Copy link
@yuxizama

yuxizama Apr 10, 2024

Contributor

DataFrames


<!--pytest-codeblocks:cont-->

```python
df_right = """index,day,time,size
2,Thur,Lunch,2
5,Sat,Dinner,3
9,Sun,Dinner,2"""

# Encrypt the DataFrame
df_encrypted2 = client.encrypt_from_pandas(pandas.read_csv(StringIO(df_right)))

df_encrypted_merged = df_encrypted.merge(df_encrypted2, how="left", on="index")
```

## Serialization of Encrypted Data-frames

This comment has been minimized.

Copy link
@yuxizama

yuxizama Apr 10, 2024

Contributor

Serialization of encrypted Data-frames


Encrypted `DataFrame` objects can be serialized to a file format for storage or transfer. When serialized, they contain the encrypted data and [evaluation keys](../getting-started/concepts.md#cryptography-concepts) necessary to perform computations.

This comment has been minimized.

Copy link
@yuxizama

yuxizama Apr 10, 2024

Contributor

You can serialize encrypted 'DataFrame' objects to a file format for storage of transfer. When DataFrames are serialized, they contain the encrypted data and evaluation keys necessary to perform computations.


> **Security**: Serialized data-frames do not contain any secret keys. The data-frames can be exchanged with any third-party without any risk.

This comment has been minimized.

Copy link
@yuxizama

yuxizama Apr 10, 2024

Contributor

Serialized DataFrames do not contain any secret keys. You can exchange the encrypted DataFrames with any third party without any risk.

### Saving and loading Data-frames

To save or load an encrypted `DataFrame` from a file, use the following commands:

<!--pytest-codeblocks:cont-->

```python
from concrete.ml.pandas import load_encrypted_dataframe

# Save
df_encrypted_merged.save("df_encrypted_merged")

# Load
df_encrypted_merged = load_encrypted_dataframe("df_encrypted_merged")

# Decrypt the DataFrame
df_decrypted = client.decrypt_to_pandas(df_encrypted)
```

## Error Handling

This comment has been minimized.

Copy link
@yuxizama

yuxizama Apr 10, 2024

Contributor

Error handling


The library is designed to raise specific errors when encountering issues during the pre-processing and post-processing stages:

This comment has been minimized.

Copy link
@yuxizama

yuxizama Apr 10, 2024

Contributor

The library will raise specific errors when encountering issues during the pre-processing and post-processing stages:


- `ValueError`: Raised when a column contains values outside the allowed range for integers, when there are too many unique strings, or when encountering an unsupported data type. Raised also when an operation is attempted on a data type that is not supported by the operation.

This comment has been minimized.

Copy link
@yuxizama

yuxizama Apr 10, 2024

Contributor

ValueError is raised in any of the following cases:

  • A column contains values outside the allowed range of integers.
  • There are too many unique strings.
  • Unsupported data types are used.
  • An operation is attempted on an unsupported data type by the operation.

This comment has been minimized.

Copy link
@yuxizama

yuxizama Apr 10, 2024

Contributor

"too many unique strings" can you give a clear limit?


## Example Workflow

This comment has been minimized.

Copy link
@yuxizama

yuxizama Apr 10, 2024

Contributor

Example workflow


An example workflow where two clients encrypt two `DataFrame` objects, perform a merge operation on the server side, and then decrypt the results is available in the notebook [encrypted_pandas.ipynb](../advanced_examples/EncryptedPandas.ipynb).

This comment has been minimized.

Copy link
@yuxizama

yuxizama Apr 10, 2024

Contributor

Here is an example workflow in the notebook encrypted_pandas.ipynb showing two clients encrypt two DataFrame objects, perform a merge operation on the server side, and then decrypt the results.


## Current Limitations

This comment has been minimized.

Copy link
@yuxizama

yuxizama Apr 10, 2024

Contributor

Current limitations


While this API offers a new secure way to work on remotely stored and encrypted data, it has some strong limitations at the moment:

This comment has been minimized.

Copy link
@yuxizama

yuxizama Apr 10, 2024

Contributor

While this API offers a new secure way to work on remotely stored and encrypted data, the API has some strong limitations at the moment:


- **Precision of Values**: The precision for numerical values is limited to 4 bits.

This comment has been minimized.

Copy link
@yuxizama

yuxizama Apr 10, 2024

Contributor

Precision of values

- **Supported Operations**: The `merge` operation is the only one available.

This comment has been minimized.

Copy link
@yuxizama

yuxizama Apr 10, 2024

Contributor

Supported operation: Only the 'merge' operation is available

- **Index Handling**: Index values are not preserved; users should move any relevant data from the index to a dedicated new column before encrypting.

This comment has been minimized.

Copy link
@yuxizama

yuxizama Apr 10, 2024

Contributor

Index handling: "users" -> "you"

- **Integer Range**: The range of integers that can be encrypted is between 1 and 15.

This comment has been minimized.

Copy link
@yuxizama

yuxizama Apr 10, 2024

Contributor

Integer range

- **Uniqueness for `merge`**: The `merge` operation requires that the columns to merge on contain unique values. Currently this means that data-frames are limited to 15 rows.

This comment has been minimized.

Copy link
@yuxizama

yuxizama Apr 10, 2024

Contributor

DataFrames

- **Metadata Security**: Column names and the mapping of strings to integers are not encrypted and are sent to the server in clear text.

This comment has been minimized.

Copy link
@yuxizama

yuxizama Apr 10, 2024

Contributor

Metadata security

6 changes: 5 additions & 1 deletion docs/getting-started/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,11 @@

<figure><img src="../.gitbook/assets/doc_header_CML.png" alt=""><figcaption></figcaption></figure>

Concrete ML is an open source, privacy-preserving, machine learning framework based on Fully Homomorphic Encryption (FHE). It enables data scientists without any prior knowledge of cryptography to automatically turn machine learning models into their FHE equivalent, using familiar APIs from scikit-learn and PyTorch (see how it looks for [linear models](../built-in-models/linear.md), [tree-based models](../built-in-models/tree.md), and [neural networks](../built-in-models/neural-networks.md)). Concrete ML supports converting models for inference with FHE but can also [train some models](../built-in-models/training.md) on encrypted data.
Concrete ML is an open source, privacy-preserving, machine learning framework based on Fully Homomorphic Encryption (FHE). It enables data scientists without any prior knowledge of cryptography to:

This comment has been minimized.

Copy link
@yuxizama

yuxizama Apr 10, 2024

Contributor

It enables data scientists to do the following things without any prior knowledge of cryptography:


- automatically turn machine learning models into their FHE equivalent, using familiar APIs from scikit-learn and PyTorch (see how this works for [linear models](../built-in-models/linear.md), [tree-based models](../built-in-models/tree.md), and [neural networks](../built-in-models/neural-networks.md)).

This comment has been minimized.

Copy link
@yuxizama

yuxizama Apr 10, 2024

Contributor

Capitalised the initials of each list item.

- [train models](../built-in-models/training.md) on encrypted data.
- [pre-process encrypted data](../built-in-models/encrypted_dataframe.md) through a data-frame paradigm

This comment has been minimized.

Copy link
@yuxizama

yuxizama Apr 10, 2024

Contributor

DataFrame


Fully Homomorphic Encryption is an encryption technique that allows computing directly on encrypted data, without needing to decrypt it. With FHE, you can build private-by-design applications without compromising on features. You can learn more about FHE in [this introduction](https://www.zama.ai/post/tfhe-deep-dive-part-1) or by joining the [FHE.org](https://fhe.org) community.

Expand Down
14 changes: 7 additions & 7 deletions docs/getting-started/cloud.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
# Inference in the cloud
# Working in the cloud

Concrete ML models can be easily deployed in a client/server setting, enabling the creation of privacy-preserving services in the cloud.
Concrete ML models and data-frames can be easily deployed in a client/server setting, enabling the creation of privacy-preserving services in the cloud.

This comment has been minimized.

Copy link
@yuxizama

yuxizama Apr 10, 2024

Contributor

DataFrames


As seen in the [concepts section](concepts.md), once compiled to FHE, a Concrete ML model generates machine code that performs the inference on private data. _Secret_ encryption keys are needed so that the user can securely encrypt their data and decrypt the inference result. An _evaluation_ key is also needed for the server to securely process the user's encrypted data.
As seen in the [concepts section](concepts.md), once compiled to FHE, a Concrete ML model or data-frame generates machine code that execute prediction, training or pre-processing on encrypted data. _Secret_ encryption keys are needed so that the user can securely encrypt their data and decrypt the execution result. An _evaluation_ key is also needed for the server to securely process the user's encrypted data.

This comment has been minimized.

Copy link
@yuxizama

yuxizama Apr 10, 2024

Contributor

DataFrames


Keys are generated by the user _once_ for each service they use, based on the model the service provides and its cryptographic parameters.

Expand All @@ -12,10 +12,10 @@ The overall communications protocol to enable cloud deployment of machine learni

The steps detailed above are:

1. The model developer deploys the compiled machine learning model to the server. This model includes the cryptographic parameters. The server is now ready to provide private inference.
1. The model developer deploys the compiled machine learning model to the server. This model includes the cryptographic parameters. The server is now ready to provide private inference. Crypto-graphic parameters and compiled programs for data-frames are included directly in Concrete ML.

This comment has been minimized.

Copy link
@yuxizama

yuxizama Apr 10, 2024

Contributor

DataFrames

1. The client requests the cryptographic parameters (also called "client specs"). Once it receives them from the server, the _secret_ and _evaluation_ keys are generated.
1. The client sends the _evaluation_ key to the server. The server is now ready to accept requests from this client. The client sends their encrypted data.
1. The server uses the _evaluation_ key to securely run inference on the user's data and sends back the encrypted result.
1. The client sends the _evaluation_ key to the server. The server is now ready to accept requests from this client. The client sends their encrypted data. Serialized data-frames include client evaluation keys.

This comment has been minimized.

Copy link
@yuxizama

yuxizama Apr 10, 2024

Contributor

DataFrames

1. The server uses the _evaluation_ key to securely run prediction, training and pre-processing on the user's data and sends back the encrypted result.
1. The client now decrypts the result and can send back new requests.

For more information on how to implement this basic secure inference protocol, refer to the [Production Deployment section](../guides/client_server.md) and to the [client/server example](../advanced_examples/ClientServer.ipynb).
For more information on how to implement this basic secure inference protocol, refer to the [Production Deployment section](../guides/client_server.md) and to the [client/server example](../advanced_examples/ClientServer.ipynb). For information on training on encrypted data, see [the corresponding section](../built-in-models/training.md).
Loading

0 comments on commit d3bf5ac

Please sign in to comment.