Add a blog about how to use HF endpoints to run Concrete-ML privacy-p…

…reserving ML models.
bcm-at-zama · Feb 29, 2024 · 8af2edb · 8af2edb
1 parent c939376
commit 8af2edb
Show file tree

Hide file tree

Showing 8 changed files with 221 additions and 23 deletions.
diff --git a/_blog.yml b/_blog.yml
@@ -3557,7 +3557,7 @@
 
 - local: arena-tts
   title: "TTS Arena: Benchmarking Text-to-Speech Models in the Wild"
-  thumbnail: /blog/assets/arenas-on-the-hub/thumbnail.png 
+  thumbnail: /blog/assets/arenas-on-the-hub/thumbnail.png
   author: mrfakename
   guest: true
   date: Feb 27, 2024
@@ -3575,4 +3575,15 @@
     - nlp
     - community
     - research
-    - LLM 
+    - LLM
+
+- local: concrete-ml-inference-on-endpoints-fhe
+  title: "Running Privacy-Preserving Inferences on Hugging Face EndPoints"
+  author: binoua
+  thumbnail: /blog/assets/concrete-ml-inference-on-endpoints-fhe/thumbnail.png
+  date: March 1, 2024
+  tags:
+    - guide
+    - privacy
+    - research
+    - FHE
diff --git a/assets/concrete-ml-inference-on-endpoints-fhe/four.png b/assets/concrete-ml-inference-on-endpoints-fhe/four.png
diff --git a/assets/concrete-ml-inference-on-endpoints-fhe/one.png b/assets/concrete-ml-inference-on-endpoints-fhe/one.png
diff --git a/assets/concrete-ml-inference-on-endpoints-fhe/three.png b/assets/concrete-ml-inference-on-endpoints-fhe/three.png
diff --git a/assets/concrete-ml-inference-on-endpoints-fhe/thumbnail.png b/assets/concrete-ml-inference-on-endpoints-fhe/thumbnail.png
diff --git a/assets/concrete-ml-inference-on-endpoints-fhe/two.png b/assets/concrete-ml-inference-on-endpoints-fhe/two.png
diff --git a/concrete-ml-inference-on-endpoints-fhe.md b/concrete-ml-inference-on-endpoints-fhe.md
@@ -0,0 +1,199 @@
+title: "Running Privacy-Preserving Inferences on Hugging Face EndPoints"
+thumbnail: /blog/assets/concrete-ml-inference-on-endpoints-fhe/thumbnail.png
+authors:
+- user: Benoit Chevallier-Mames
+  guest: true
+---
+
+# Running Privacy-Preserving Inferences on Hugging Face EndPoints
+
+<!-- {blog_metadata} -->
+<!-- {authors} -->
+
+Eighteen months ago, Zama started [Concrete ML](https://github.com/zama-ai/concrete-ml), our privacy-preserving ML framework, with bindings to traditional ML frameworks such as scikit-learn, ONNX, torch and tensorflow. To ensure privacy for users' data, we use Fully Homomorphic Encryption (FHE), which is a cryptographic tool that allows us to make computations directly over encrypted data, without ever knowing the private key.
+
+From the start, we wanted to be able to pre-compile some FHE-friendly networks and make them available somewhere on the internet, for users to be able to use them trivially. We are ready today! And not in a random place on the internet, but directly on [Hugging Face](https://huggingface.co), the place to be for anything related to [open-source machine learning](https://github.com/huggingface)!
+
+More precisely, we use Hugging Face [endpoints](https://huggingface.co/docs/inference-endpoints/en/index) and [custom inference handlers](https://huggingface.co/docs/inference-endpoints/en/guides/custom_handler), to be able to store our Concrete ML models and let users deploy on HF machines in one click. At the end of our blog, readers should both understand how to use pre-compiled models and how to prepare their own pre-compiled models. This blog can also be considered as another tutorial for custom inference handlers.
+
+
+## Deploying a pre-compiled model
+
+Let's start with how to deploy an FHE-friendly model (prepared by Zama or by third parties - see "Preparing your own pre-compiled model" section below for learning how to prepare yours).
+
+First, look for the model you want to deploy: We have pre-compiled a [bunch of models](https://huggingface.co/zama-fhe?#models) on Zama's HF page. Let’s suppose you have chosen [concrete-ml-encrypted-decisiontree](https://huggingface.co/zama-fhe/concrete-ml-encrypted-decisiontree): As explained in the description, this pre-compiled model allows to detect spams without looking at the message content in the clear.
+
+Like any other model available on the Hugging Face platform, select _Deploy_ and then _Inference Endpoint (dedicated)_:
+
+![Alt text](assets/concrete-ml-inference-on-endpoints-fhe/one.png "Inference Endpoint (dedicated)")
+
+Next, choose the endpoint name or the region, and most importantly the CPU (Concrete ML models do not use GPUs for now, we are [working](https://www.zama.ai/post/tfhe-rs-v0-5) on it) as well as the best machine available - in the example below we chose 8 vCPU. Now click on _Create Endpoint_ and wait for the initialization to finish.
+
+![Alt text](assets/concrete-ml-inference-on-endpoints-fhe/two.png "Create Endpoint")
+
+After just a few seconds, the endpoint is deployed and your privacy-preserving model is ready to operate.
+
+![Alt text](assets/concrete-ml-inference-on-endpoints-fhe/three.png "Endpoint is created")
+
+Note: Don’t forget to delete the endpoint (or at least pause it) when you are no longer using it, or else it will cost more than anticipated.
+
+## Using the endpoint
+
+### Installing the client side
+
+Now, obviously, the goal is not only to deploy your endpoint, but to let users play with it. For that, users need to clone the repository on their computer. This is done by selecting _Clone Repository_, in the dropdown menu:
+
+![Alt text](assets/concrete-ml-inference-on-endpoints-fhe/four.png "Clone Repository")
+
+They will be given a small command line that they can run in their terminal:
+
+```bash
+git clone https://huggingface.co/zama-fhe/concrete-ml-encrypted-decisiontree
+```
+
+Once the command inside their terminal, they can go on the `concrete-ml-encrypted-decisiontree` directory and open `play_with_endpoint.py` with their editor. Here, they will find the line with `API_URL = …`, and should replace it with the new URL of the endpoint created in the previous section. In our case, it would be:
+
+```bash
+API_URL = "https://tcipez38maclnbm6.eu-west-1.aws.endpoints.huggingface.cloud”
+```
+
+In your case, fill it in with with _your_ entrypoint’s URL. Also, define an [access token](https://huggingface.co/docs/hub/en/security-tokens) and store it in an environment variable:
+
+```bash
+export HF_TOKEN=[your token hf_XX..XX]
+```
+
+Lastly, your machine needs to have Concrete ML installed locally: Make a virtual environment, source it and install the necessary dependencies:
+
+```bash
+python3.9 -m venv .venv
+pip install -U setuptools pip wheel
+source .venv/bin/activate
+pip install -r requirements.txt
+```
+
+### Running inferences
+
+Now, you can run the inferences on the entry point, by running:
+
+```bash
+python play_with_endpoint.py
+```
+
+It should generate some logs similar to the following:
+
+```bash
+Sending 0-th piece of the key (remaining size is 71984.14 kbytes)
+Storing the key in the database under uid=3307376977
+Sending 1-th piece of the key (remaining size is 0.02 kbytes)
+Size of the payload: 0.23 kilobytes
+for 0-th input, prediction=0 with expected 0 in 3.242 seconds
+for 1-th input, prediction=0 with expected 0 in 3.612 seconds
+for 2-th input, prediction=0 with expected 0 in 4.765 seconds
+
+(...)
+
+for 688-th input, prediction=0 with expected 1 in 3.176 seconds
+for 689-th input, prediction=1 with expected 1 in 4.027 seconds
+for 690-th input, prediction=0 with expected 0 in 4.329 seconds
+Accuracy on 691 samples is 0.8958031837916064
+Total time: 2873.860 seconds
+Duration per inference: 4.123 seconds
+```
+
+### Adapting to your application or needs
+
+If you edit `play_with_endpoint.py`, you’ll see that we iterate over different samples of the test dataset, and run encrypted inferences directly on the endpoint.
+
+```python
+for i in range(nb_samples):
+
+    # Quantize the input and encrypt it
+    encrypted_inputs = fhemodel_client.quantize_encrypt_serialize(X_test[i].reshape(1, -1))
+
+    # Prepare the payload
+    payload = {
+        "inputs": "fake",
+        "encrypted_inputs": to_json(encrypted_inputs),
+        "method": "inference",
+        "uid": uid,
+    }
+
+    if is_first:
+        print(f"Size of the payload: {sys.getsizeof(payload) / 1024:.2f} kilobytes")
+        is_first = False
+
+    # Run the inference on HF servers
+    duration -= time.time()
+    duration_inference = -time.time()
+    encrypted_prediction = query(payload)
+    duration += time.time()
+    duration_inference += time.time()
+
+    encrypted_prediction = from_json(encrypted_prediction)
+
+    # Decrypt the result and dequantize
+    prediction_proba = fhemodel_client.deserialize_decrypt_dequantize(encrypted_prediction)[0]
+    prediction = np.argmax(prediction_proba)
+
+    if verbose:
+        print(
+            f"for {i}-th input, {prediction=} with expected {Y_test[i]} in {duration_inference:.3f} seconds"
+        )
+
+    # Measure accuracy
+    nb_good += Y_test[i] == prediction
+```
+
+Of course, this is just an example of the entrypoint's usage. Developers are encouraged to adapt this example to their own use-case or application.
+
+### Under the hood
+
+Please note that all of this is done thanks to the flexibility of [custom handlers](https://huggingface.co/docs/inference-endpoints/en/guides/custom_handler) and we express our gratitude to the Hugging Face developers for offering such flexibility. The mechanism is defined in `handler.py`. As explained in the Hugging Face documentation, you can define the `__call__` method of `EndpointHandler` pretty much as you want: In our case, we have defined a `method` parameter, which can be `save_key` (to save FHE evaluation keys), `append_key` (to save FHE evaluation keys piece by piece if the key is too large to be sent in one single call) and finally `inference` (to run FHE inferences). These methods are used to set the evaluation key once, and then run all the inferences, one by one, as seen in `play_with_endpoint.py`.
+
+### Limits
+
+One can remark however that keys are stored in the RAM of the endpoint, which is not convenient for a production environment: at each restart, the keys are lost and need to be re-sent; Plus, when you have several machines to handle huge traffic, this RAM is not shared between the machines. Finally, the fact that the available CPU machines only provide 8 vCPUs at most for endpoints makes the execution time worse than if deployed on AWS machines.
+
+## Preparing your own pre-compiled model
+
+Now that you know how easy it is to deploy a pre-compiled model, you may want to prepare yours. For this, you can fork [one of the repositories we have prepared](https://huggingface.co/zama-fhe?#models). All the model categories supported by Concrete ML ([linear](https://docs.zama.ai/concrete-ml/built-in-models/linear) models, [tree-based](https://docs.zama.ai/concrete-ml/built-in-models/tree) models, built-in [MLP](https://docs.zama.ai/concrete-ml/built-in-models/neural-networks), [torch](https://docs.zama.ai/concrete-ml/deep-learning/torch_support) models) have at least one example, that can be used as a template for new pre-compiled models.
+
+Then, edit `creating_models.py`, and change the ML task to be the one you want to tackle in your pre-compiled model: for example, if you started with [concrete-ml-encrypted-decisiontree](https://huggingface.co/zama-fhe/concrete-ml-encrypted-decisiontree), change the dataset and the model kind.
+
+As explained earlier, you need to have installed Concrete ML to prepare your own pre-compiled model.
+
+Now you can launch `python creating_models.py`. This will train a model and create the necessary development files (`client.zip`, `server.zip` and `versions.json`) in the `compiled_model` directory. As explained in the [documentation](https://docs.zama.ai/concrete-ml/deployment/client_server), these files contain your pre-compiled model. If you have any issue, you can get support on the [fhe.org discord](http://discord.fhe.org).
+
+Last step is to modify `play_with_endpoint.py`, to also deal with the same ML task as in `creating_models.py`: set the dataset accordingly.
+
+Now, you can save this directory with the `compiled_model` directory and files, and your modifications in `creating_models.py` and `play_with_endpoint.py` on Hugging Face models. Certainly, you will need to run some tests and make slight adjustments for it to work.
+
+## Available pre-compiled models today
+
+For now, we have prepared a few pre-compiled models as examples, hoping the community will extend this soon.
+
+| Model kind  | Dataset  | Execution time on HF endpoint  |
+|---|---|---|
+| [Logistic Regression](https://huggingface.co/zama-fhe/concrete-ml-encrypted-logreg) | Synthetic  | 0.4 sec |
+[DecisionTree](https://huggingface.co/zama-fhe/concrete-ml-encrypted-decisiontree) | Spam | 2.0 sec
+[QNN](https://huggingface.co/zama-fhe/concrete-ml-encrypted-qnn) | Iris | 3.7 sec
+[CNN](https://huggingface.co/zama-fhe/concrete-ml-encrypted-deeplearning) | MNIST | 24 sec
+
+
+Keep in mind that CPU machines available as HF endpoints today are not as powerful as AWS' machines (generally m6i or hpc7a), so complex models' execution time is expected to be slower. Hopefully, more powerful machines will soon be available on Hugging Face endpoints to improve these timings.
+
+## Conclusion and next steps
+
+In this blog post, we have shown that custom endpoints are pretty easy yet powerful to use: what we do in Concrete ML is pretty different from the regular workflow of ML practitioners but still, we are able to accommodate the custom endpoints to deal with most of our needs. Kudos to Hugging Face engineers for developing such a generic solution.
+
+We explained how:
+
+- Developers can create their own pre-compiled models and make them available on Hugging Face models.
+- Companies can deploy developers' pre-compiled models and make them available to their users via HF endpoints.
+- Users can use these endpoints to run their ML tasks over encrypted data.
+
+To go further, it would be useful to have more powerful machines available on Hugging Face endpoints, to make inferences faster. Also, we could imagine that Concrete ML becomes more integrated on Hugging Face’s interface, and have a _Private-Preserving Inference Endpoint_ button, which would simplify developers' life even more. Finally, for an integration in more server machines, it could be useful to have a way to share a state between machines, and keep this state non-volatile (FHE inference keys would be stored there).
+
+
+Zama libraries [Concrete](https://github.com/zama-ai/concrete) and [Concrete-ML](https://github.com/zama-ai/concrete-ml) (Don't forget to star the repos on GitHub ⭐️💛) allow straightforward ML model building and conversion to the FHE equivalent to being able to compute and predict over encrypted data.
diff --git a/zh/starcoder2.md b/zh/starcoder2.md
@@ -6,47 +6,35 @@ authors:
 - user: loubnabnl
 - user: anton-l
 - user: nouamanetazi
-translators:
+translator:
 - user: AdinaY
 ---
 
 # StarCoder2 及 The Stack v2 数据集正式发布
 
-<div class="flex items-center justify-center">
-<img src="https://huggingface.co/datasets/bigcode/admin/resolve/main/sc2-banner.png" alt="StarCoder2">
-</div>
-
 BigCode 正式推出 StarCoder2 —— 一系列新一代的开放源代码大语言模型(LLMs)。这些模型全部基于一个全新、大规模且高品质的代码数据集 [The Stack v2](https://huggingface.co/datasets/bigcode/the-stack-v2/) 进行训练。我们不仅公开了所有的模型和数据集，还包括了数据处理和训练代码的详细信息，详情请参阅 [相关论文](https://drive.google.com/file/d/17iGn3c-sYNiLyRSY-A85QOzgzGnGiVI3/view?usp=sharing)。
 
 ## StarCoder2 是什么?
 
-StarCoder2 是一套面向代码的开放式大语言模型系列，提供3种规模的模型，分别包括 30 亿（3B）、70 亿（7B）和 150 亿（15B）参数。特别地，StarCoder2-15B 模型经过了超过 4 万亿 token 和 600 多种编程语言的训练，基于 The Stack v2 数据集。所有模型均采用分组查询注意力机制（Grouped Query Attention），具备 16384 个 token 的上下文窗口和 4096 个令牌的滑动窗口注意力，并通过“填充中间”（Fill-in-the-Middle）技术进行训练。
-
-StarCoder2 包含三种规模的模型：ServiceNow 训练的30亿参数模型、Hugging Face 训练的 70 亿参数模型以及 NVIDIA 利用 NVIDIA NeMo 在 NVIDIA 加速基础架构上训练的150亿参数模型：
+StarCoder2 是一套面向代码的开放式大语言模型系列，提供3种规模的模型，分别包括30亿（3B）、70亿（7B）和150亿（15B）参数。特别地，StarCoder2-15B 模型经过了超过4万亿 token 和600多种编程语言的训练，基于 The Stack v2 数据集。所有模型均采用分组查询注意力机制（Grouped Query Attention），具备16,384个 token 的上下文窗口和4,096个令牌的滑动窗口注意力，并通过“填充中间”（Fill-in-the-Middle）技术进行训练。
 
-- [StarCoder2-3B](https://huggingface.co/bigcode/starcoder2-3b) 基于 The Stack v2 的 17 种编程语言训练，处理了超过 3 万亿 token。
-- [StarCoder2-7B](https://huggingface.co/bigcode/starcoder2-7b) 基于 The Stack v2 的 17 种编程语言训练，处理了超过 3.5 万亿 token。
-- [StarCoder2-15B](https://huggingface.co/bigcode/starcoder2-15b) 基于 The Stack v2 的 600 多种编程语言训练，处理了超过 4 万亿 token。
+StarCoder2 包含三种规模的模型：ServiceNow 训练的30亿参数模型、Hugging Face 训练的70亿参数模型以及 NVIDIA 利用 NVIDIA NeMo 在 NVIDIA 加速基础架构上训练的150亿参数模型：
 
-StarCoder2-15B 模型在其级别中表现出色，与33亿以上参数的模型在多项评估中不相上下。StarCoder2-3B 的性能达到了 StarCoder1-15B 的水平:
+- [StarCoder2-3B](https://huggingface.co/bigcode/starcoder2-3b) 基于 The Stack v2 的17种编程语言训练，处理了超过3万亿 token。
+- [StarCoder2-7B](https://huggingface.co/bigcode/starcoder2-7b) 基于 The Stack v2 的17种编程语言训练，处理了超过3.5万亿 token。
+- [StarCoder2-15B](https://huggingface.co/bigcode/starcoder2-15b) 基于 The Stack v2 的600多种编程语言训练，处理了超过4万亿 token。
 
-<div class="flex items-center justify-center">
-<img src="https://huggingface.co/datasets/bigcode/admin/resolve/main/sc2-evals.png" alt="StarCoder2 Evaluation">
-</div>
+StarCoder2-15B 模型在其级别中表现出色，与33亿以上参数的模型在多项评估中不相上下。StarCoder2-3B 的性能达到了 StarCoder1-15B 的水平。
 
 ## The Stack v2 是什么?
 
-<div class="flex items-center justify-center">
-<img src="https://huggingface.co/datasets/bigcode/admin/resolve/main/stackv2-banner.png" alt="The Stack v2">
-</div>
-
 The Stack v2 是迄今为止最大的开放代码数据集，非常适合进行大语言模型的预训练。与 The Stack v1 相比，The Stack v2 拥有更大的数据规模，采用了更先进的语言和许可证检测流程以及更优的过滤机制。此外，训练数据集按照仓库进行了分组，使得模型训练能够获得仓库上下文的支持。
 
 | 数据集对比 | [The Stack v1](https://huggingface.co/datasets/bigcode/the-stack/) | [The Stack v2](https://huggingface.co/datasets/bigcode/the-stack-v2/) |
 |--------|------|------|
 | 全部数据量 | 6.4TB | 67.5TB |
 | 去重后数据量 | 2.9TB | 32.1TB | 
-| 训练数据集大小 | 约 2000 亿token | 约9000亿token |
+| 训练数据集大小 | 约2000亿token | 约9000亿token |
 
 该数据集源自软件遗产档案（Software Heritage archive），这是一个包含了丰富软件源代码及其开发历史的公共档案库。作为一个开放和非盈利的项目，软件遗产由 Inria 与 UNESCO 合作发起，旨在收集、保存并共享所有公开可用的软件源代码。我们对软件遗产提供这一无价资源表示感
 
@@ -73,6 +61,6 @@ BigCode 是由 Hugging Face 和 ServiceNow 联合领导的一个开放科研合
 - [StarCoder2 成员资格测试](https://stack-v2.dataportraits.org)：快速验证代码是否包含在预训练数据集中。
 
 ### 其他资源
-- [VSCode 扩展](https://marketplace.visualstudio.com/items?itemName=HuggingFace.huggingface-vscode)：使用 StarCoder 进行编码的插件。
+- [VSCode 扩展](https://marketplace.visualstudio.com/items?itemName=HuggingFace.huggingface-vscode)：使用 StarCoder 进行编码的插件！
 - [大型代码模型排行榜](https://huggingface.co/spaces/bigcode/bigcode-models-leaderboard)：比较不同模型的性能。
 所有资源和链接均可在 [huggingface.co/bigcode](https://huggingface.co/bigcode) 查阅！