From 72c79293613dbab31881db08c504efc52dd2026f Mon Sep 17 00:00:00 2001 From: Erik-Jan van Kesteren Date: Mon, 26 Feb 2024 12:07:51 +0100 Subject: [PATCH] Update README.md --- README.md | 88 ++++++++++++++++++++++++++----------------------------- 1 file changed, 42 insertions(+), 46 deletions(-) diff --git a/README.md b/README.md index c1e0900..0ad2f92 100644 --- a/README.md +++ b/README.md @@ -1,73 +1,70 @@ # Metasyn disclosure control +[![](https://img.shields.io/badge/metasyn-plugin-blue?logo=python&logoColor=white)](https://github.com/sodascience/metasyn) +[![Python package](https://github.com/sodascience/metasyn-disclosure-control/actions/workflows/python-package.yml/badge.svg)](https://github.com/sodascience/metasyn-disclosure-control/actions/workflows/python-package.yml) +[![Project Status: WIP – Initial development is in progress, but there has not yet been a stable, usable release suitable for the public.](https://www.repostatus.org/badges/latest/wip.svg)](https://www.repostatus.org/#wip) -A privacy plugin for [metasyn](https://github.com/sodascience/metasyn), based on statistical disclosure control (SDC) rules as found in the following documents: +A privacy plugin for [metasyn](https://github.com/sodascience/metasyn), based on statistical disclosure control (SDC) rules of thumb as found in the following documents: - The [SDC handbook](https://securedatagroup.org/guides-and-resources/sdc-handbook/) of the Secure Data group in the UK - The Data Without Boundaries document [Guidelines for output checking](https://wayback.archive-it.org/12090/*/https:/cros-legacy.ec.europa.eu/system/files/dwb_standalone-document_output-checking-guidelines.pdf) (pdf) - Statistics Netherlands' output guidelines +While producing synthetic data with [metasyn](https://github.com/sodascience/metasyn) is already a great first step towards protecting privacy, it doesn't adhere to official standards. For example, fitting a uniform distribution will disclose the lowest and highest values in the dataset, which may be a privacy issue in sensitive data. This plugin solves these kinds of problems. -While the base metasyn package is generally good at protecting privacy, it doesn't adhere to any standard level of privacy. For example, the uniform distributions in the base package will simply find the lowest and highest values in the dataset, and use those as the boundaries for the uniform distribution. In some cases the minimum and maximum values can be disclosive. That is why we have built this plugin that implements the disclosure control standard. +> Currently, the disclosure control plugin is work in progress. Especially in light of this, we disclaim +any responsibility as a result of using this plugin. ## Usage -The most basic usage is as follows: +Basic usage for our built-in titanic dataset is as follows: ```py -import polars as pl -from metasyn import MetaFrame +from metasyn import MetaFrame, demo_data from metasyncontrib.disclosure import DisclosurePrivacy +from metasyncontrib.disclosure.faker import DisclosureFaker + +df = demo_data("titanic") + +spec = [ + {"name": "PassengerId", "distribution": {"unique": True}}, + {"name": "Name", "distribution": DisclosureFaker("name")}, +] -df = pl.read_csv("your_data.csv") mf = MetaFrame.fit_dataframe( df=df, dist_providers=["builtin", "metasyn-disclosure"], - privacy=DisclosurePrivacy() + privacy=DisclosurePrivacy(), + var_specs=spec ) -mf.synthesize() +mf.synthesize(5) ``` +``` +shape: (5, 13) +┌─────────────┬────────────────────┬────────┬──────┬───┬────────────┬────────────┬─────────────────────┬────────┐ +│ PassengerId ┆ Name ┆ Sex ┆ Age ┆ … ┆ Birthday ┆ Board time ┆ Married since ┆ all_NA │ +│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │ +│ i64 ┆ str ┆ cat ┆ i64 ┆ ┆ date ┆ time ┆ datetime[μs] ┆ f32 │ +╞═════════════╪════════════════════╪════════╪══════╪═══╪════════════╪════════════╪═════════════════════╪════════╡ +│ 0 ┆ Benjamin Cox ┆ female ┆ 27 ┆ … ┆ 1931-12-01 ┆ 14:33:06 ┆ 2022-07-30 02:16:37 ┆ null │ +│ 1 ┆ Mr. David Robinson ┆ female ┆ null ┆ … ┆ 1906-02-18 ┆ null ┆ 2022-08-03 13:09:19 ┆ null │ +│ 2 ┆ Randy Mosley ┆ male ┆ 24 ┆ … ┆ 1933-01-06 ┆ 15:52:54 ┆ 2022-07-18 18:52:05 ┆ null │ +│ 3 ┆ Vincent Maddox ┆ female ┆ 24 ┆ … ┆ 1937-02-10 ┆ 16:58:30 ┆ 2022-07-23 20:29:49 ┆ null │ +│ 4 ┆ Kristin Holland ┆ male ┆ 17 ┆ … ┆ 1939-12-09 ┆ 18:07:45 ┆ 2022-08-05 02:41:51 ┆ null │ +└─────────────┴────────────────────┴────────┴──────┴───┴────────────┴────────────┴─────────────────────┴────────┘ +``` -## Current status of the plugin - -Currently, there the disclosure plugin is work in progress. Especially in light of this, we disclaim -any responsibility as a result of using this plugin. For most of the distributions -the micro-aggregation technique is used. This technique pre-averages a sorted version of the data, -which then supplied to the original fitting mechanism. The idea is that during this pre-averaging -step, we ensure that the rule of thumb is followed, so that the fitting method doesn't need to do -anything in particular. While, from a statistical point of view, we are losing more information than -we probably need, it should ensure the safety of the data. - -Below we have summarized the status for each of the variable types: - -### Discrete - -It technically works, but a new micro-aggregation algorithm specifically for integers might yield -better and more consistent results. Currently are implemented: - -- DiscreteUniform, UniqueKey, Poisson - -### Continuous - -No current issues, following are implemented: - -- Uniform, TruncatedNormal, Normal, LogNormal, Exponential - -### Datetime - -Implemented: - -- UniformDate, UniformTime, UniformDateTime -### String +## Implementation details +The rules of thumb, roughly, are: -Currently only Faker distribution is implemented (which is the same as the metasyn base package, -since the distribution is not fit to any data). The regex distribution is currently not implemented. +- at least 10 units +- at least 10 degrees of freedom +- no group disclosure +- no dominance -### Categorical +For most distributions, we implemented micro-aggregation. This technique pre-averages a sorted version of the data, which then supplied to the original fitting mechanism. The idea is that during this pre-averaging step, we ensure that the rules of thumb are followed, so that the fitting method doesn't need to do anything in particular. While from a statistical point of view, we are losing more information than we probably need, it should ensure the safety of the data. -A safe version of the multinoulli distribution is implemented. There is still some discussion on what to do if the dominance -rule is violated. @@ -84,7 +81,6 @@ To create a pull request: ## Contact -**Metasyn-disclosure** is a project by the [ODISSEI Social Data Science (SoDa)](https://odissei-data.nl/nl/soda/) team. -Do you have questions, suggestions, or remarks on the technical implementation? File an issue in the issue tracker or feel free to contact [Raoul Schram](https://github.com/qubixes) or [Erik-Jan van Kesteren](https://github.com/vankesteren). +This is a project by the [ODISSEI Social Data Science (SoDa)](https://odissei-data.nl/nl/soda/) team. Do you have questions, suggestions, or remarks on the technical implementation? File an issue in the issue tracker or feel free to contact [Raoul Schram](https://github.com/qubixes) or [Erik-Jan van Kesteren](https://github.com/vankesteren). SoDa logo