Skip to content

Commit

Permalink
update readme, pyproject.toml
Browse files Browse the repository at this point in the history
  • Loading branch information
vankesteren committed Feb 25, 2024
1 parent 4a7f979 commit fbfb8de
Show file tree
Hide file tree
Showing 3 changed files with 44 additions and 19 deletions.
43 changes: 27 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,21 +1,32 @@
# Metasyn disclosure control

This is a plugin for the [metasyn](https://github.com/sodascience/metasyn) Python library. Metasyn
is a package to create synthetic data for tabular datasets automatically.
While the base metasyn package is generally good at protecting privacy, it doesn't adhere to any
standard level of privacy. For example, the uniform distributions in the base package will simply find
the lowest and highest values in the dataset, and use those as the boundaries for the uniform
distribution. In some cases the minimum and maximum values can be disclosive. That is why we have
built this plugin that implements the disclosure control standard.

## Rule of Thumb

In this package we have implemented the "rule of thumb" as described in the
[European guidelines](https://ec.europa.eu/eurostat/cros/system/files/dwb_standalone-document_output-checking-guidelines.pdf)
for output checking. The main idea behind the rule of thumb is that it is on the safe side
of what you are allowed to disclose. If you follow the rule of thumb then the idea is that
the output should be considered privacy conserving, without the need for a specialist that
looks at the specific context.
A privacy plugin for [metasyn](https://github.com/sodascience/metasyn), based on statistical disclosure control (SDC) rules as found in the following documents:

- The [SDC handbook](https://securedatagroup.org/guides-and-resources/sdc-handbook/) of the Secure Data group in the UK
- The Data Without Boundaries document [Guidelines for output checking](https://wayback.archive-it.org/12090/*/https:/cros-legacy.ec.europa.eu/system/files/dwb_standalone-document_output-checking-guidelines.pdf) (pdf)
- Statistics Netherlands' output guidelines


While the base metasyn package is generally good at protecting privacy, it doesn't adhere to any standard level of privacy. For example, the uniform distributions in the base package will simply find the lowest and highest values in the dataset, and use those as the boundaries for the uniform distribution. In some cases the minimum and maximum values can be disclosive. That is why we have built this plugin that implements the disclosure control standard.

## Usage

The most basic usage is as follows:

```py
import polars as pl
from metasyn import MetaFrame
from metasyncontrib.disclosure import DisclosurePrivacy

df = pl.read_csv("your_data.csv")
mf = MetaFrame.fit_dataframe(
df=df,
dist_providers=["builtin", "metasyn-disclosure"],
privacy=DisclosurePrivacy()
)
mf.synthesize()
```


## Current status of the plugin

Expand Down
18 changes: 16 additions & 2 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -11,10 +11,17 @@ authors = [
description = "Plugin package for metasyn that applies the disclosure control."
readme = "README.md"
requires-python = ">=3.8"
keywords = ["metasyn", "disclosure control"]
license = {text = "MIT"}
keywords = ["metasyn", "disclosure control", "metadata", "open-data", "privacy", "synthetic-data"]
license = { file = "LICENSE" }
classifiers = [
"Programming Language :: Python :: 3",
"Programming Language :: Python :: 3.8",
"Programming Language :: Python :: 3.9",
"Programming Language :: Python :: 3.10",
"Programming Language :: Python :: 3.11",
"Programming Language :: Python :: 3.12",
"Development Status :: 3 - Alpha",
"License :: OSI Approved :: MIT License",
]
dependencies = [
"metasyn",
Expand All @@ -24,6 +31,13 @@ dependencies = [
]
dynamic = ["version"]

[project.optional-dependencies]
dev = [
"ruff",
"mypy",
"pytest"
]

[tool.setuptools]
packages = ["metasyncontrib"]

Expand Down
2 changes: 1 addition & 1 deletion tests/test_other_dist.py
Original file line number Diff line number Diff line change
Expand Up @@ -33,7 +33,7 @@ def test_datetime(class_norm, class_disc):


def test_categorical():
np.random.seed()
np.random.seed(45)
dist_norm = MultinoulliDistribution.default_distribution()
series = pl.Series([dist_norm.draw() for _ in range(40)], dtype=pl.Categorical)
dist_norm = MultinoulliDistribution.fit(series)
Expand Down

0 comments on commit fbfb8de

Please sign in to comment.