Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Edit security chapter f #516

Merged
merged 2 commits into from
Nov 12, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 8 additions & 8 deletions contents/core/privacy_security/privacy_security.qmd
Original file line number Diff line number Diff line change
Expand Up @@ -993,7 +993,7 @@ For many real-time and embedded applications, fully homomorphic encryption remai

### Homomorphic Encryption

Ready to unlock the power of encrypted computation? Homomorphic encryption is like a magic trick for your data! In this Colab, we'll learn how to do calculations on secret numbers without ever revealing them. Imagine training a model on data you can't even see – that's the power of this mind-bending technology.
The power of encrypted computation is unlocked through homomorphic encryption – a transformative approach in which calculations are performed directly on encrypted data, ensuring privacy is preserved throughout the process. This Colab explores the principles of computing on encrypted numbers without exposing the underlying data. Imagine a scenario where a machine learning model is trained on data that cannot be directly accessed – such is the strength of homomorphic encryption.

[![](https://colab.research.google.com/assets/colab-badge.png)](https://colab.research.google.com/drive/1GjKT5Lgh9Madjsyr9UiyeogUgVpTEBMp?usp=sharing)

Expand All @@ -1003,9 +1003,9 @@ Ready to unlock the power of encrypted computation? Homomorphic encryption is li

#### Core Idea

The overarching goal of Multi-Party Communication (MPC) is to enable different parties to jointly compute a function over their inputs while keeping those inputs private. For example, two organizations may want to collaborate on training a machine learning model by combining their respective data sets. Still, they cannot directly reveal that data due to Privacy or confidentiality constraints. MPC provides protocols and techniques that allow them to achieve the benefits of pooled data for model accuracy without compromising the privacy of each organization's sensitive data.
Multi-Party Communication (MPC) enables multiple parties to jointly compute a function over their inputs while ensuring that each party’s inputs remain confidential. For instance, two organizations can collaborate on training a machine learning model by combining datasets without revealing sensitive information to each other. MPC protocols are essential where privacy and confidentiality regulations restrict direct data sharing, such as in healthcare or financial sectors.

At a high level, MPC works by carefully splitting the computation into parts that each party can execute independently using their private input. The results are then combined to reveal only the final output of the function and nothing about the intermediate values. Cryptographic techniques are used to guarantee that the partial results remain private provably.
MPC divides computation into parts that each participant executes independently using their private data. These results are then combined to reveal only the final output, preserving the privacy of intermediate values. Cryptographic techniques are used to guarantee that the partial results remain private provably.

Let's take a simple example of an MPC protocol. One of the most basic MPC protocols is the secure addition of two numbers. Each party splits its input into random shares that are secretly distributed. They exchange the shares and locally compute the sum of the shares, which reconstructs the final sum without revealing the individual inputs. For example, if Alice has input x and Bob has input y:

Expand All @@ -1019,7 +1019,7 @@ Let's take a simple example of an MPC protocol. One of the most basic MPC protoc

5. $s_1 + s_2 = x + y$ is the final sum, without revealing $x$ or $y$.

Alice's and Bob's individual inputs ($x$ and $y$) remain private, and each party only reveals one number associated with their original inputs. The random outputs ensure that no information about the original numbers disclosed.
Alice's and Bob's individual inputs ($x$ and $y$) remain private, and each party only reveals one number associated with their original inputs. The random outputs ensure that no information about the original numbers is disclosed.

**Secure Comparison:** Another basic operation is a secure comparison of two numbers, determining which is greater than the other. This can be done using techniques like Yao's Garbled Circuits, where the comparison circuit is encrypted to allow joint evaluation of the inputs without leaking them.

Expand Down Expand Up @@ -1049,7 +1049,7 @@ While MPC protocols provide strong privacy guarantees, they come at a high compu

* Oblivious transfer and garbled circuits add masking and encryption to hide data access patterns and execution flows.

* MPC systems require extensive communication and interaction between parties to compute on shares/ciphertexts jointly.
* MPC systems require extensive communication and interaction between parties to jointly compute on shares/ciphertexts.

As a result, MPC protocols can slow down computations by 3-4 orders of magnitude compared to plain implementations. This becomes prohibitively expensive for large datasets and models. Therefore, training machine learning models on encrypted data using MPC remains infeasible today for realistic dataset sizes due to the overhead. Clever optimizations and approximations are needed to make MPC practical.

Expand All @@ -1059,9 +1059,9 @@ Ongoing MPC research closes this efficiency gap through cryptographic advances,

#### Core Idea

Synthetic data generation has emerged as an important privacy-preserving machine learning approach that allows models to be developed and tested without exposing real user data. The key idea is to train generative models on real-world datasets and then sample from these models to synthesize artificial data that statistically match the original data distribution but does not contain actual user information. For example, a GAN could be trained on a dataset of sensitive medical records to learn the underlying patterns and then used to sample synthetic patient data.
Synthetic data generation has emerged as an important privacy-preserving machine learning approach that allows models to be developed and tested without exposing real user data. The key idea is to train generative models on real-world datasets and then sample from these models to synthesize artificial data that statistically matches the original data distribution but does not contain actual user information. For example, a GAN could be trained on a dataset of sensitive medical records to learn the underlying patterns and then used to sample synthetic patient data.

The primary challenge of synthesizing data is to ensure adversaries are unable to re-identify the original dataset. A simple approach to achieving synthetic data is adding noise to the original dataset, which still risks privacy leakage. When noise is added to data in the context of differential privacy, sophisticated mechanisms based on the data's sensitivity are used to calibrate the amount and distribution of noise. Through these mathematically rigorous frameworks, differential Privacy generally guarantees Privacy at some level, which is the primary goal of this privacy-preserving technique. Beyond preserving privacy, synthetic data combats multiple data availability issues such as imbalanced datasets, scarce datasets, and anomaly detection.
The primary challenge of synthesizing data is to ensure adversaries cannot re-identify the original dataset. A simple approach to achieving synthetic data is adding noise to the original dataset, which still risks privacy leakage. When noise is added to data in the context of differential privacy, sophisticated mechanisms based on the data's sensitivity are used to calibrate the amount and distribution of noise. Through these mathematically rigorous frameworks, differential Privacy generally guarantees Privacy at some level, which is the primary goal of this privacy-preserving technique. Beyond preserving privacy, synthetic data combats multiple data availability issues such as imbalanced datasets, scarce datasets, and anomaly detection.

Researchers can freely share this synthetic data and collaborate on modeling without revealing private medical information. Well-constructed synthetic data protects Privacy while providing utility for developing accurate models. Key techniques to prevent reconstructing the original data include adding differential privacy noise during training, enforcing plausibility constraints, and using multiple diverse generative models. Here are some common approaches for generating synthetic data:

Expand Down Expand Up @@ -1095,7 +1095,7 @@ While synthetic data tries to remove any evidence of the original dataset, priva

A core challenge with synthetic data is the potential gap between synthetic and real-world data distributions. Despite advancements in generative modeling techniques, synthetic data may only partially capture real data's complexity, diversity, and nuanced patterns. This can limit the utility of synthetic data for robustly training machine learning models. Rigorously evaluating synthetic data quality through adversary methods and comparing model performance to real data benchmarks helps assess and improve fidelity. However, inherently, synthetic data remains an approximation.

Another critical concern is the privacy risks of synthetic data. Generative models may leak identifiable information about individuals in the training data, which could enable reconstruction of private information. Emerging adversarial attacks demonstrate the challenges in preventing identity leakage from synthetic data generation pipelines. Techniques like differential Privacy can help safeguard Privacy but come with tradeoffs in data utility. There is an inherent tension between producing useful synthetic data and fully protecting sensitive training data, which must be balanced.
Another critical concern is the privacy risks of synthetic data. Generative models may leak identifiable information about individuals in the training data, which could enable the reconstruction of private information. Emerging adversarial attacks demonstrate the challenges in preventing identity leakage from synthetic data generation pipelines. Techniques like differential privacy can help safeguard privacy, but they come with tradeoffs in data utility. There is an inherent tension between producing valid synthetic data and fully protecting sensitive training data, which must be balanced.

Additional pitfalls of synthetic data include amplified biases, mislabeling, the computational overhead of training generative models, storage costs, and failure to account for out-of-distribution novel data. While these are secondary to the core synthetic-real gap and privacy risks, they remain important considerations when evaluating the suitability of synthetic data for particular machine-learning tasks. As with any technique, the advantages of synthetic data come with inherent tradeoffs and limitations that require thoughtful mitigation strategies.

Expand Down
Loading