Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
vankesteren committed Feb 26, 2024
1 parent fbfb8de commit 72c7929
Showing 1 changed file with 42 additions and 46 deletions.
88 changes: 42 additions & 46 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,73 +1,70 @@
# Metasyn disclosure control
[![](https://img.shields.io/badge/metasyn-plugin-blue?logo=python&logoColor=white)](https://github.com/sodascience/metasyn)
[![Python package](https://github.com/sodascience/metasyn-disclosure-control/actions/workflows/python-package.yml/badge.svg)](https://github.com/sodascience/metasyn-disclosure-control/actions/workflows/python-package.yml)
[![Project Status: WIP – Initial development is in progress, but there has not yet been a stable, usable release suitable for the public.](https://www.repostatus.org/badges/latest/wip.svg)](https://www.repostatus.org/#wip)

A privacy plugin for [metasyn](https://github.com/sodascience/metasyn), based on statistical disclosure control (SDC) rules as found in the following documents:
A privacy plugin for [metasyn](https://github.com/sodascience/metasyn), based on statistical disclosure control (SDC) rules of thumb as found in the following documents:

- The [SDC handbook](https://securedatagroup.org/guides-and-resources/sdc-handbook/) of the Secure Data group in the UK
- The Data Without Boundaries document [Guidelines for output checking](https://wayback.archive-it.org/12090/*/https:/cros-legacy.ec.europa.eu/system/files/dwb_standalone-document_output-checking-guidelines.pdf) (pdf)
- Statistics Netherlands' output guidelines

While producing synthetic data with [metasyn](https://github.com/sodascience/metasyn) is already a great first step towards protecting privacy, it doesn't adhere to official standards. For example, fitting a uniform distribution will disclose the lowest and highest values in the dataset, which may be a privacy issue in sensitive data. This plugin solves these kinds of problems.

While the base metasyn package is generally good at protecting privacy, it doesn't adhere to any standard level of privacy. For example, the uniform distributions in the base package will simply find the lowest and highest values in the dataset, and use those as the boundaries for the uniform distribution. In some cases the minimum and maximum values can be disclosive. That is why we have built this plugin that implements the disclosure control standard.
> Currently, the disclosure control plugin is work in progress. Especially in light of this, we disclaim
any responsibility as a result of using this plugin.

## Usage

The most basic usage is as follows:
Basic usage for our built-in titanic dataset is as follows:

```py
import polars as pl
from metasyn import MetaFrame
from metasyn import MetaFrame, demo_data
from metasyncontrib.disclosure import DisclosurePrivacy
from metasyncontrib.disclosure.faker import DisclosureFaker

df = demo_data("titanic")

spec = [
{"name": "PassengerId", "distribution": {"unique": True}},
{"name": "Name", "distribution": DisclosureFaker("name")},
]

df = pl.read_csv("your_data.csv")
mf = MetaFrame.fit_dataframe(
df=df,
dist_providers=["builtin", "metasyn-disclosure"],
privacy=DisclosurePrivacy()
privacy=DisclosurePrivacy(),
var_specs=spec
)
mf.synthesize()
mf.synthesize(5)
```

```
shape: (5, 13)
┌─────────────┬────────────────────┬────────┬──────┬───┬────────────┬────────────┬─────────────────────┬────────┐
│ PassengerId ┆ Name ┆ Sex ┆ Age ┆ … ┆ Birthday ┆ Board time ┆ Married since ┆ all_NA │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ cat ┆ i64 ┆ ┆ date ┆ time ┆ datetime[μs] ┆ f32 │
╞═════════════╪════════════════════╪════════╪══════╪═══╪════════════╪════════════╪═════════════════════╪════════╡
│ 0 ┆ Benjamin Cox ┆ female ┆ 27 ┆ … ┆ 1931-12-01 ┆ 14:33:06 ┆ 2022-07-30 02:16:37 ┆ null │
│ 1 ┆ Mr. David Robinson ┆ female ┆ null ┆ … ┆ 1906-02-18 ┆ null ┆ 2022-08-03 13:09:19 ┆ null │
│ 2 ┆ Randy Mosley ┆ male ┆ 24 ┆ … ┆ 1933-01-06 ┆ 15:52:54 ┆ 2022-07-18 18:52:05 ┆ null │
│ 3 ┆ Vincent Maddox ┆ female ┆ 24 ┆ … ┆ 1937-02-10 ┆ 16:58:30 ┆ 2022-07-23 20:29:49 ┆ null │
│ 4 ┆ Kristin Holland ┆ male ┆ 17 ┆ … ┆ 1939-12-09 ┆ 18:07:45 ┆ 2022-08-05 02:41:51 ┆ null │
└─────────────┴────────────────────┴────────┴──────┴───┴────────────┴────────────┴─────────────────────┴────────┘
```

## Current status of the plugin

Currently, there the disclosure plugin is work in progress. Especially in light of this, we disclaim
any responsibility as a result of using this plugin. For most of the distributions
the micro-aggregation technique is used. This technique pre-averages a sorted version of the data,
which then supplied to the original fitting mechanism. The idea is that during this pre-averaging
step, we ensure that the rule of thumb is followed, so that the fitting method doesn't need to do
anything in particular. While, from a statistical point of view, we are losing more information than
we probably need, it should ensure the safety of the data.

Below we have summarized the status for each of the variable types:

### Discrete

It technically works, but a new micro-aggregation algorithm specifically for integers might yield
better and more consistent results. Currently are implemented:

- DiscreteUniform, UniqueKey, Poisson

### Continuous

No current issues, following are implemented:

- Uniform, TruncatedNormal, Normal, LogNormal, Exponential

### Datetime

Implemented:

- UniformDate, UniformTime, UniformDateTime

### String
## Implementation details
The rules of thumb, roughly, are:

Currently only Faker distribution is implemented (which is the same as the metasyn base package,
since the distribution is not fit to any data). The regex distribution is currently not implemented.
- at least 10 units
- at least 10 degrees of freedom
- no group disclosure
- no dominance

### Categorical
For most distributions, we implemented micro-aggregation. This technique pre-averages a sorted version of the data, which then supplied to the original fitting mechanism. The idea is that during this pre-averaging step, we ensure that the rules of thumb are followed, so that the fitting method doesn't need to do anything in particular. While from a statistical point of view, we are losing more information than we probably need, it should ensure the safety of the data.

A safe version of the multinoulli distribution is implemented. There is still some discussion on what to do if the dominance
rule is violated.


<!-- CONTRIBUTING -->
Expand All @@ -84,7 +81,6 @@ To create a pull request:

<!-- CONTACT -->
## Contact
**Metasyn-disclosure** is a project by the [ODISSEI Social Data Science (SoDa)](https://odissei-data.nl/nl/soda/) team.
Do you have questions, suggestions, or remarks on the technical implementation? File an issue in the issue tracker or feel free to contact [Raoul Schram](https://github.com/qubixes) or [Erik-Jan van Kesteren](https://github.com/vankesteren).
This is a project by the [ODISSEI Social Data Science (SoDa)](https://odissei-data.nl/nl/soda/) team. Do you have questions, suggestions, or remarks on the technical implementation? File an issue in the issue tracker or feel free to contact [Raoul Schram](https://github.com/qubixes) or [Erik-Jan van Kesteren](https://github.com/vankesteren).

<img src="soda.png" alt="SoDa logo" width="250px"/>

0 comments on commit 72c7929

Please sign in to comment.