From 38816ff3987b7e6f5aee2c0a291461213a656291 Mon Sep 17 00:00:00 2001
From: Robin Straub <robin@brumaire.io>
Date: Tue, 21 Mar 2023 14:28:40 +0100
Subject: [PATCH] docs: add tutorial on svm using linearsvc

---
 concrete-ml/svm.ipynb | 450 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 450 insertions(+)
 create mode 100644 concrete-ml/svm.ipynb
diff --git a/concrete-ml/svm.ipynb b/concrete-ml/svm.ipynb
new file mode 100644
index 0000000..ac33a97
--- /dev/null
+++ b/concrete-ml/svm.ipynb
@@ -0,0 +1,450 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "d07c3896",
+   "metadata": {},
+   "source": [
+    "# Support Vector Machine (SVM) classification using Concrete-ML\n",
+    "\n",
+    "In this tutorial, we will show how to create, train, and evaluate a Support Vector Machine (SVM) model using the Concrete-ML library, an open-source privacy-preserving machine learning framework based on fully homomorphic encryption (FHE).\n",
+    "\n",
+    "This tutorial is cut in 2 parts:\n",
+    "1. A quick setup of a LinearSVC model with Concrete-ML\n",
+    "2. A more in-depth approach taking a closer look to the concrete-ml specifics\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "404fb028",
+   "metadata": {},
+   "source": [
+    "## Introduction"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d3654d52",
+   "metadata": {},
+   "source": [
+    "### Concrete-ML and useful links\n",
+    "\n",
+    "> Concrete-ML is an open-source, privacy-preserving, machine learning inference framework based on fully homomorphic encryption (FHE). It enables data scientists without any prior knowledge of cryptography to automatically turn machine learning models into their FHE equivalent, using familiar APIs from Scikit-learn and PyTorch.\n",
+    "> \n",
+    "> <cite>&mdash; [Zama documentation](https://docs.zama.ai/concrete-ml/)</cite>\n",
+    "\n",
+    "This tutorial does not require extensive knowledge of Concrete-ML. Newcomers might nonetheless be interested in reading some of the introductory sections of the official documentation, such as:\n",
+    "\n",
+    "- [What is Concrete-ML](https://docs.zama.ai/concrete-ml/)\n",
+    "- [Key Concepts](https://docs.zama.ai/concrete-ml/getting-started/concepts)\n",
+    "\n",
+    "### Support Vector Machine\n",
+    "\n",
+    "SVM is a machine learning algorithm for classification and regression. LinearSVC is an efficient implementation of SVM\n",
+    "that works best when the data is linearly separable. In this tutorial, we will use the [iris dataset](https://scikit-learn.org/stable/auto_examples/datasets/plot_iris_dataset.html) which is a common example for demonstrating how to work with SVMs.\n",
+    "\n",
+    "Concrete-ML exposes a LinearSVC class which implements the\n",
+    "[scikit-learn LinearSVC](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html) interface, so you should feel right at home.\n",
+    "\n",
+    "### Setup code\n",
+    "\n",
+    "Just as in any machine learning project, let's start by importing some libraries and setting the dataset."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "ea5f1461",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# display visualizations and plots in the notebook itself\n",
+    "%matplotlib inline\n",
+    "\n",
+    "# import numpy and matplotlib\n",
+    "import numpy as np\n",
+    "import matplotlib.pyplot as plt\n",
+    "\n",
+    "# import the iris dataset from sklearn, as well as some utilities and the LinearSVC for reference\n",
+    "from sklearn import datasets\n",
+    "from sklearn.model_selection import train_test_split\n",
+    "from sklearn.svm import LinearSVC as SklearnLinearSVC\n",
+    "from sklearn.metrics import accuracy_score\n",
+    "\n",
+    "# import the concrete-ml LinearSVC implementation\n",
+    "from concrete.ml.sklearn.svm import LinearSVC as ConcreteLinearSVC\n",
+    "\n",
+    "# Load the iris dataset and select the first 2 features \n",
+    "iris = datasets.load_iris()\n",
+    "\n",
+    "# Split the dataset into a training and a testing set\n",
+    "X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.4, random_state=42)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "12e827d0",
+   "metadata": {},
+   "source": [
+    "## Part 1: Train a simple model with Concrete-ML\n",
+    "\n",
+    "Let's first start by quickly scaffolding a Concrete-ML LinearSVC code, so we can see how easy and familiar it is.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "88e9b7fc",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Scikit-learn Accuracy: 0.9833\n",
+      "Concrete-ML Quantized Accuracy: 0.6833\n",
+      "Concrete-ML FHE Accuracy: 0.6833\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Train a model with scikit-learn LinearSVC, perform prediction and compute the accuracy\n",
+    "svm_sklearn = SklearnLinearSVC()\n",
+    "svm_sklearn.fit(X_train, y_train)\n",
+    "y_pred_sklearn = svm_sklearn.predict(X_test)\n",
+    "accuracy_sklearn = accuracy_score(y_test, y_pred_sklearn)\n",
+    "\n",
+    "# Perform the same steps with the Concrete-ML LinearSVC implementation\n",
+    "svm_concrete = ConcreteLinearSVC()\n",
+    "svm_concrete.fit(X_train, y_train)\n",
+    "y_pred_concrete_clear = svm_concrete.predict(X_test)\n",
+    "accuracy_concrete_clear = accuracy_score(y_test, y_pred_concrete_clear)\n",
+    "\n",
+    "# A circuit needs to be compiled to enable FHE execution\n",
+    "circuit = svm_concrete.compile(X_train)\n",
+    "circuit.client.keygen(force=False)\n",
+    "# Now that a circuit is compiled, the svm_concrete can predict value with FHE\n",
+    "y_pred_concrete_fhe = svm_concrete.predict(X_test, execute_in_fhe=True)\n",
+    "accuracy_concrete_fhe = accuracy_score(y_test, y_pred_concrete_fhe)\n",
+    "\n",
+    "print(f\"Scikit-learn Accuracy: {accuracy_sklearn:.4f}\")\n",
+    "print(f\"Concrete-ML Quantized Accuracy: {accuracy_concrete_clear:.4f}\")\n",
+    "print(f\"Concrete-ML FHE Accuracy: {accuracy_concrete_fhe:.4f}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cbf7e0dd",
+   "metadata": {},
+   "source": [
+    "### Code explanation\n",
+    "\n",
+    "Let's have a more in-depth look at the code.\n",
+    "\n",
+    "#### First, we have a regular scikit-learn LinearSVC example.\n",
+    "\n",
+    "```python\n",
+    "# Load the iris dataset and select the first 2 features \n",
+    "iris = datasets.load_iris()\n",
+    "\n",
+    "# Split the dataset into a training and a testing set\n",
+    "X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target, test_size=0.4, random_state=42)\n",
+    "\n",
+    "# Train a model with scikit-learn LinearSVC, perform prediction and compute the accuracy\n",
+    "svm_sklearn = SklearnLinearSVC()\n",
+    "svm_sklearn.fit(X_train, y_train)\n",
+    "y_pred_sklearn = svm_sklearn.predict(X_test)\n",
+    "accuracy_sklearn = accuracy_score(y_test, y_pred_sklearn)\n",
+    "```\n",
+    "\n",
+    "Hopefully should not be confused by this, otherwise you may want to read the [official scikit-learn SVM documentation](https://scikit-learn.org/stable/modules/svm.html#svm-classification). \n",
+    "\n",
+    "The algorithm can be tweaked with the parameters exposed by the LinearSVC class, refer to the  [LinearSVC API documentation](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html) as to what can be customized.\n",
+    "\n",
+    "#### Second, we have a Concrete-ML implementation,which behaves just like the scikit-learn one.\n",
+    "\n",
+    "```python\n",
+    "# Perform the same steps with the Concrete-ML LinearSVC implementation\n",
+    "svm_concrete = ConcreteLinearSVC()\n",
+    "svm_concrete.fit(X_train, y_train)\n",
+    "y_pred_concrete_clear = svm_concrete.predict(X_test)\n",
+    "accuracy_concrete_clear = accuracy_score(y_test, y_pred_concrete_clear)\n",
+    "```\n",
+    "\n",
+    "One thing to note here is not only the model is trained using clear data, but the prediction are also performed in a *plain environment*, there is no encryption at this stage.\n",
+    "\n",
+    "In order to perform prediction in a FHE environment, the model first has to be compiled into a circuit.\n",
+    "\n",
+    "#### Third, the model is compiled to enable FHE execution\n",
+    "\n",
+    "```python\n",
+    "# A circuit needs to be compiled to enable FHE execution\n",
+    "circuit = svm_concrete.compile(X_train)\n",
+    "circuit.client.keygen(force=False)\n",
+    "# Now that a circuit is compiled, the svm_concrete can predict value with FHE\n",
+    "y_pred_concrete_fhe = svm_concrete.predict(X_test, execute_in_fhe=True)\n",
+    "accuracy_concrete_fhe = accuracy_score(y_test, y_pred_concrete_fhe)\n",
+    "```\n",
+    "\n",
+    "Now that the model is compiled, computing predictions in with FHE is just a matter of passing a execution_in_fhe parameter set to True.\n",
+    "\n",
+    "\n",
+    "#### Accuracy\n",
+    "\n",
+    "Finally, we can measure the accuracy of our 3 different predictions:\n",
+    "\n",
+    "```python\n",
+    "print(f\"Scikit-learn Accuracy: {accuracy_sklearn:.4f}\")\n",
+    "print(f\"Concrete-ML Clear Accuracy: {accuracy_concrete_clear:.4f}\")\n",
+    "print(f\"Concrete-ML FHE Accuracy: {accuracy_concrete_fhe:.4f}\")\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "51afbfe2",
+   "metadata": {},
+   "source": [
+    "### Key takeaways\n",
+    "\n",
+    "#### Simplicity of execution\n",
+    "\n",
+    "For a high-level use-case, Concrete-ML offers a very similar interface to scikit-learn. The main difference is *a model needs to be compiled to allow execution in FHE*.\n",
+    "\n",
+    "#### Model Accuracy\n",
+    "\n",
+    "Concrete-ML prediction accuracy is slightly under scikit-learn implementation. This is because of [quantization](https://docs.zama.ai/concrete-ml/advanced-topics/quantization): number precision needs to be fixed-size for the model to be evaluated in FHE. This can be alleviated down to where the accuracy difference is none or negligible.\n",
+    "\n",
+    "#### Execution time\n",
+    "\n",
+    "The execution is slower with Concrete-ML, especially when compiling the model. Enabling and using encryption indeed uses a lot more resources than training and using a model using plain data. The speed can be increased by *reducing the precision of the data* (understand diminish the fixed-size number precision). Depending on the project, you thus have to choose between:\n",
+    "\n",
+    "- a slower model that performs accurate predictions\n",
+    "- a faster model that performs less accurate predictions"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cdc4c4d2",
+   "metadata": {},
+   "source": [
+    "## Part 2: In-Depth model development\n",
+    "\n",
+    "a more in-depth approach, showing how to effectively develop with concrete-ml. This will quote and follow the steps of [model development](https://docs.zama.ai/concrete-ml/getting-started/concepts#i.-model-development)\n",
+    "\n",
+    "Especially:\n",
+    "- the effects of quantization and finding the good bit number\n",
+    "- setup the virtual library to speed up the development workflow\n",
+    "- use inference to use the model with encrypted data\n",
+    "\n",
+    "---\n",
+    "\n",
+    "### Step a: training the model\n",
+    "\n",
+    "Nothing new under the sun here, we need to train a relevant for our machine-learning problem. As we relied previously on the Iris example and improving it is outside the scope of this tutorial, we can just take back what we used so for."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "80f701f3",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# setup and train a scikit-learn LinearSVC model, just as before\n",
+    "svm_sklearn = SklearnLinearSVC()\n",
+    "svm_sklearn.fit(X_train, y_train)\n",
+    "# predict some test data and measure the model accuracy\n",
+    "y_pred_sklearn = svm_sklearn.predict(X_test)\n",
+    "accuracy = accuracy_score(y_test, y_pred_sklearn)\n",
+    "\n",
+    "print(f\"Scikit-learn Accuracy: {accuracy_sklearn:.4f}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ce2920d8",
+   "metadata": {},
+   "source": [
+    "Not too shabby.\n",
+    "\n",
+    "### Step b: quantize the model\n",
+    "\n",
+    "So far we conveniently avoided most of Concrete-ML specificities for the sake of simplicity. The first Concrete-ML specific step of developping a model is to quantize it, which soberly means to turn the model into an integer equivalent.\n",
+    "\n",
+    "Although you are strongly encouraged to read the [Zama introduction to quantization](https://docs.zama.ai/concrete-ml/advanced-topics/quantization), the key takeaway is **a model needs to be reduced to a *discrete*, smaller set in order for the encryption to happen**. Otherwise the data becomes too large to be manipulated in FHE. \n",
+    "\n",
+    "As of v0.6.0 the maximum bit size is 8. The lighter the bit size the more efficient the concrete-ml model is. Thus the goal of the quantization step is to find the lowest bit size value that offers an acceptable accuracy, so the model efficiency is maximized."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c7d66066",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# compute the accuracy of a n_bit quantized linear ranging from 2 to 8 bits\n",
+    "for n_bits in range(2, 9):\n",
+    "    svm_concrete = ConcreteLinearSVC(n_bits)\n",
+    "    svm_concrete.fit(X_train, y_train)\n",
+    "    y_pred = svm_concrete.predict(X_test)\n",
+    "    accuracy = accuracy_score(y_test, y_pred)\n",
+    "    print(f\"{n_bits} Bits Quantized Accuracy: {accuracy:.4f}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "122bd347",
+   "metadata": {},
+   "source": [
+    "### Step c: simulate the model execution\n",
+    "\n",
+    "Executing models with FHE can prove to be a slow process, depending on:\n",
+    "- the data-set size\n",
+    "- the model itself\n",
+    "- the hardware executing the model\n",
+    "\n",
+    "Concrete-ML allows to simulate FHE model execution using the [virtual library](https://docs.zama.ai/concrete-ml/advanced-topics/compilation#simulation-with-the-virtual-library). This speeds up the development process considerably by avoiding long-running compilation to be repeated every time.\n",
+    "\n",
+    "> Testing FHE models on very large data-sets can take a long time. Furthermore, not all models are compatible with FHE constraints out-of-the-box. Simulation using the Virtual Library allows you to execute a model that was quantized, to measure the accuracy it would have in FHE, but also to determine the modifications required to make it FHE compatible.\n",
+    ">\n",
+    "> — [Zama documentation](https://docs.zama.ai/concrete-ml/getting-started/concepts#i.-model-development)\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "74f5bf79",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# import the configuration from concrete-numpy\n",
+    "from concrete.numpy import Configuration\n",
+    "\n",
+    "# define a configuration\n",
+    "COMPIL_CONFIG_VL = Configuration(\n",
+    "    dump_artifacts_on_unexpected_failures=False,\n",
+    "    enable_unsafe_features=True,\n",
+    ")\n",
+    "\n",
+    "# arbitrarily set the bit size to 8\n",
+    "n_bits = 8\n",
+    "svm_concrete = ConcreteLinearSVC(n_bits)\n",
+    "svm_concrete.fit(X_train, y_train)\n",
+    "\n",
+    "# compile the model with virtual lib enabled and with the defined configuration\n",
+    "circuit = svm_concrete.compile(X_train, use_virtual_lib=True, configuration=COMPIL_CONFIG_VL)\n",
+    "\n",
+    "# the model can now be executed with FHE\n",
+    "y_pred = svm_concrete.predict(X_test, execute_in_fhe=True)\n",
+    "accuracy = accuracy_score(y_test, y_pred)\n",
+    "print(f\"{n_bits} Bits FHE Accuracy: {accuracy:.4f}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e1721c59",
+   "metadata": {},
+   "source": [
+    "*The virtual library enables some unsafe features, so it should understandably not be used for production*\n",
+    "\n",
+    "So far so good, the model is compiled and executed much quicker with the virtual library, allowing us to run it with different configurations.\n",
+    "\n",
+    "In a more complex scenario, we would want to fine-tune a lot of model parameters, however here for the sake of simplicity we kept the model to its default configuration and are only left to play with the bit size.\n",
+    "\n",
+    "We can now put the two previous parts together and make sure our quantized model prediction stay accurate in FHE"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8b4c9814",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "for n_bits in range(2, 9):\n",
+    "    svm_concrete = ConcreteLinearSVC(n_bits)\n",
+    "    svm_concrete.fit(X_train, y_train)\n",
+    "    svm_concrete.compile(X_train, use_virtual_lib=True, configuration=COMPIL_CONFIG_VL) # the model is now compiled\n",
+    "    y_pred = svm_concrete.predict(X_test, execute_in_fhe=True) # the execution is done in FHE\n",
+    "    accuracy = accuracy_score(y_test, y_pred)\n",
+    "    print(f\"{n_bits} Bits Quantized Accuracy: {accuracy:.4f}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "1785f094",
+   "metadata": {},
+   "source": [
+    "The model predictions in FHE (with virtual library) are aligned with the predictions of the plain, quantized model.\n",
+    "\n",
+    "We now need to settle for a bit size. Depending on the model use case we might want to favor speed or accuracy. We can consider 2 different use-case of our iris recognition model. One is a \"machine-learning as a service\" model, in which users would request our model to analyze their data. In such a context we might want to favorise speed (thus execution time and reduce computation cost) and select a smaller bitsize, such as 4. On the contrary we can also invision our model as being used by scientists to classify irises, where computation costs are not as much of an issue, but requires the best possible accuracy. This scenario would lead us to select a bitsize to 6, as it is the lowest bitsize that provides the best accuracy (0.9833).\n",
+    "\n",
+    "In both scenarios, by putting a little more time in testing and selecting our bitsize, we managed to reduce it from 25% to 50%, avoiding unnecessary computation efforts and speeding up the production model.\n",
+    "\n",
+    "### Step d: compile the model\n",
+    "\n",
+    "Now that we have selected a relevant bit-size we can compile the model, without virtual lib, so we can use it in production"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "861502da",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# set up and train a 6bit quantized LinearSVC model\n",
+    "svm_concrete = ConcreteLinearSVC(4)\n",
+    "svm_concrete.fit(X_train, y_train)\n",
+    "\n",
+    "# compile the model and generate a key\n",
+    "circuit = svm_concrete.compile(X_train)\n",
+    "circuit.client.keygen(force=False)\n",
+    "\n",
+    "# predict the test set to verify the compiled model accuracy\n",
+    "y_pred = svm_concrete.predict(X_test, execute_in_fhe=True)\n",
+    "accuracy = accuracy_score(y_test, y_pred)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "00a2461e",
+   "metadata": {},
+   "source": [
+    "## Conclusion\n",
+    "\n",
+    "Setting up FHE with Concrete-ML on a LinearSVC model is very simple, in the regard that Concrete-ML provides an implementation of the [scikit-learn LinearSVC interface](https://scikit-learn.org/stable/modules/generated/sklearn.svm.LinearSVC.html). As a matter of fact, a working FHE model can be setup with just a few lines of code.\n",
+    "\n",
+    "Setting up a model with FHE benefits nonetheless from some additional work. For LinearSVC models, the main point is to select a relevant bit-size for [quantizing](https://docs.zama.ai/concrete-ml/advanced-topics/quantization) the model. Some additional tools can smooth up the development workflow, such as alleviating the [compilation](https://docs.zama.ai/concrete-ml/advanced-topics/compilation) time by making use of the [virtual library](https://docs.zama.ai/concrete-ml/advanced-topics/compilation#simulation-with-the-virtual-library) \n",
+    "\n",
+    "Once the model is carefully trained and quantized, it is ready to be deployed and used in production. Here are some useful links that cover this subject:\n",
+    "- [Inference in the Cloud](https://docs.zama.ai/concrete-ml/getting-started/cloud) summarize the steps for cloud deployment\n",
+    "- [Production Deployment](https://docs.zama.ai/concrete-ml/advanced-topics/client_server) offers a high-level view of how to deploy a Concrete-ML model in a client/server setting.\n",
+    "- [Client Server in Concrete ML](https://github.com/zama-ai/concrete-ml/blob/release/0.6.x/docs/advanced_examples/ClientServer.ipynb) provides a more hands-on approach as another tutorial."
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.8.10"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}