From ac3caae1acff69e061ea65f85d649d4b60c83307 Mon Sep 17 00:00:00 2001 From: Sara Khosravi Date: Sat, 9 Nov 2024 23:30:20 -0500 Subject: [PATCH 1/2] Trigger change for pull request From e041e7e2395b6b737e24ac2626642e27b4527df0 Mon Sep 17 00:00:00 2001 From: Sara Khosravi Date: Sat, 9 Nov 2024 23:36:23 -0500 Subject: [PATCH 2/2] Final edit to Security Chapter - Today's Changes Only --- .../core/privacy_security/privacy_security.qmd | 16 ++++++++-------- 1 file changed, 8 insertions(+), 8 deletions(-) diff --git a/contents/core/privacy_security/privacy_security.qmd b/contents/core/privacy_security/privacy_security.qmd index 7284990f..42a9dd82 100644 --- a/contents/core/privacy_security/privacy_security.qmd +++ b/contents/core/privacy_security/privacy_security.qmd @@ -993,7 +993,7 @@ For many real-time and embedded applications, fully homomorphic encryption remai ### Homomorphic Encryption -Ready to unlock the power of encrypted computation? Homomorphic encryption is like a magic trick for your data! In this Colab, we'll learn how to do calculations on secret numbers without ever revealing them. Imagine training a model on data you can't even see – that's the power of this mind-bending technology. +The power of encrypted computation is unlocked through homomorphic encryption – a transformative approach in which calculations are performed directly on encrypted data, ensuring privacy is preserved throughout the process. This Colab explores the principles of computing on encrypted numbers without exposing the underlying data. Imagine a scenario where a machine learning model is trained on data that cannot be directly accessed – such is the strength of homomorphic encryption. [![](https://colab.research.google.com/assets/colab-badge.png)](https://colab.research.google.com/drive/1GjKT5Lgh9Madjsyr9UiyeogUgVpTEBMp?usp=sharing) @@ -1003,9 +1003,9 @@ Ready to unlock the power of encrypted computation? Homomorphic encryption is li #### Core Idea -The overarching goal of Multi-Party Communication (MPC) is to enable different parties to jointly compute a function over their inputs while keeping those inputs private. For example, two organizations may want to collaborate on training a machine learning model by combining their respective data sets. Still, they cannot directly reveal that data due to Privacy or confidentiality constraints. MPC provides protocols and techniques that allow them to achieve the benefits of pooled data for model accuracy without compromising the privacy of each organization's sensitive data. +Multi-Party Communication (MPC) enables multiple parties to jointly compute a function over their inputs while ensuring that each party’s inputs remain confidential. For instance, two organizations can collaborate on training a machine learning model by combining datasets without revealing sensitive information to each other. MPC protocols are essential where privacy and confidentiality regulations restrict direct data sharing, such as in healthcare or financial sectors. -At a high level, MPC works by carefully splitting the computation into parts that each party can execute independently using their private input. The results are then combined to reveal only the final output of the function and nothing about the intermediate values. Cryptographic techniques are used to guarantee that the partial results remain private provably. +MPC divides computation into parts that each participant executes independently using their private data. These results are then combined to reveal only the final output, preserving the privacy of intermediate values. Cryptographic techniques are used to guarantee that the partial results remain private provably. Let's take a simple example of an MPC protocol. One of the most basic MPC protocols is the secure addition of two numbers. Each party splits its input into random shares that are secretly distributed. They exchange the shares and locally compute the sum of the shares, which reconstructs the final sum without revealing the individual inputs. For example, if Alice has input x and Bob has input y: @@ -1019,7 +1019,7 @@ Let's take a simple example of an MPC protocol. One of the most basic MPC protoc 5. $s_1 + s_2 = x + y$ is the final sum, without revealing $x$ or $y$. -Alice's and Bob's individual inputs ($x$ and $y$) remain private, and each party only reveals one number associated with their original inputs. The random outputs ensure that no information about the original numbers disclosed. +Alice's and Bob's individual inputs ($x$ and $y$) remain private, and each party only reveals one number associated with their original inputs. The random outputs ensure that no information about the original numbers is disclosed. **Secure Comparison:** Another basic operation is a secure comparison of two numbers, determining which is greater than the other. This can be done using techniques like Yao's Garbled Circuits, where the comparison circuit is encrypted to allow joint evaluation of the inputs without leaking them. @@ -1049,7 +1049,7 @@ While MPC protocols provide strong privacy guarantees, they come at a high compu * Oblivious transfer and garbled circuits add masking and encryption to hide data access patterns and execution flows. -* MPC systems require extensive communication and interaction between parties to compute on shares/ciphertexts jointly. +* MPC systems require extensive communication and interaction between parties to jointly compute on shares/ciphertexts. As a result, MPC protocols can slow down computations by 3-4 orders of magnitude compared to plain implementations. This becomes prohibitively expensive for large datasets and models. Therefore, training machine learning models on encrypted data using MPC remains infeasible today for realistic dataset sizes due to the overhead. Clever optimizations and approximations are needed to make MPC practical. @@ -1059,9 +1059,9 @@ Ongoing MPC research closes this efficiency gap through cryptographic advances, #### Core Idea -Synthetic data generation has emerged as an important privacy-preserving machine learning approach that allows models to be developed and tested without exposing real user data. The key idea is to train generative models on real-world datasets and then sample from these models to synthesize artificial data that statistically match the original data distribution but does not contain actual user information. For example, a GAN could be trained on a dataset of sensitive medical records to learn the underlying patterns and then used to sample synthetic patient data. +Synthetic data generation has emerged as an important privacy-preserving machine learning approach that allows models to be developed and tested without exposing real user data. The key idea is to train generative models on real-world datasets and then sample from these models to synthesize artificial data that statistically matches the original data distribution but does not contain actual user information. For example, a GAN could be trained on a dataset of sensitive medical records to learn the underlying patterns and then used to sample synthetic patient data. -The primary challenge of synthesizing data is to ensure adversaries are unable to re-identify the original dataset. A simple approach to achieving synthetic data is adding noise to the original dataset, which still risks privacy leakage. When noise is added to data in the context of differential privacy, sophisticated mechanisms based on the data's sensitivity are used to calibrate the amount and distribution of noise. Through these mathematically rigorous frameworks, differential Privacy generally guarantees Privacy at some level, which is the primary goal of this privacy-preserving technique. Beyond preserving privacy, synthetic data combats multiple data availability issues such as imbalanced datasets, scarce datasets, and anomaly detection. +The primary challenge of synthesizing data is to ensure adversaries cannot re-identify the original dataset. A simple approach to achieving synthetic data is adding noise to the original dataset, which still risks privacy leakage. When noise is added to data in the context of differential privacy, sophisticated mechanisms based on the data's sensitivity are used to calibrate the amount and distribution of noise. Through these mathematically rigorous frameworks, differential Privacy generally guarantees Privacy at some level, which is the primary goal of this privacy-preserving technique. Beyond preserving privacy, synthetic data combats multiple data availability issues such as imbalanced datasets, scarce datasets, and anomaly detection. Researchers can freely share this synthetic data and collaborate on modeling without revealing private medical information. Well-constructed synthetic data protects Privacy while providing utility for developing accurate models. Key techniques to prevent reconstructing the original data include adding differential privacy noise during training, enforcing plausibility constraints, and using multiple diverse generative models. Here are some common approaches for generating synthetic data: @@ -1095,7 +1095,7 @@ While synthetic data tries to remove any evidence of the original dataset, priva A core challenge with synthetic data is the potential gap between synthetic and real-world data distributions. Despite advancements in generative modeling techniques, synthetic data may only partially capture real data's complexity, diversity, and nuanced patterns. This can limit the utility of synthetic data for robustly training machine learning models. Rigorously evaluating synthetic data quality through adversary methods and comparing model performance to real data benchmarks helps assess and improve fidelity. However, inherently, synthetic data remains an approximation. -Another critical concern is the privacy risks of synthetic data. Generative models may leak identifiable information about individuals in the training data, which could enable reconstruction of private information. Emerging adversarial attacks demonstrate the challenges in preventing identity leakage from synthetic data generation pipelines. Techniques like differential Privacy can help safeguard Privacy but come with tradeoffs in data utility. There is an inherent tension between producing useful synthetic data and fully protecting sensitive training data, which must be balanced. +Another critical concern is the privacy risks of synthetic data. Generative models may leak identifiable information about individuals in the training data, which could enable the reconstruction of private information. Emerging adversarial attacks demonstrate the challenges in preventing identity leakage from synthetic data generation pipelines. Techniques like differential privacy can help safeguard privacy, but they come with tradeoffs in data utility. There is an inherent tension between producing valid synthetic data and fully protecting sensitive training data, which must be balanced. Additional pitfalls of synthetic data include amplified biases, mislabeling, the computational overhead of training generative models, storage costs, and failure to account for out-of-distribution novel data. While these are secondary to the core synthetic-real gap and privacy risks, they remain important considerations when evaluating the suitability of synthetic data for particular machine-learning tasks. As with any technique, the advantages of synthetic data come with inherent tradeoffs and limitations that require thoughtful mitigation strategies.