Skip to content

Commit

Permalink
Merge pull request #45 from microsoft/mmendonca/rebalance_cohort
Browse files Browse the repository at this point in the history
Cohort managing classes
  • Loading branch information
mrfmendonca authored Feb 9, 2023
2 parents 775830f + 3c5542c commit 3f38fb7
Show file tree
Hide file tree
Showing 187 changed files with 38,375 additions and 6,384 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -28,7 +28,7 @@ jobs:
- name: Install package
run: |
pip install --upgrade pip
pip install -e .
pip install -e .[dev]
- name: Run Tests
run: |
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/format_check.yml
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,6 @@ jobs:
# Setup environment in remote runner
- uses: actions/setup-python@v2
with:
python-version: 3.8
python-version: 3.9

- uses: pre-commit/[email protected]
7 changes: 3 additions & 4 deletions .github/workflows/github-pages.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,21 +13,20 @@ jobs:
steps:
- uses: actions/checkout@v2

- name: Setup Python 3.8
- name: Setup Python 3.9
uses: actions/setup-python@v2
with:
python-version: 3.8
python-version: 3.9

- name: Install package
run: |
pip install --upgrade pip
pip install -e .
pip install -e .[dev]
- name: Build
run: |
sudo apt install pandoc
sudo apt install graphviz
pip install seaborn
make html
working-directory: docs

Expand Down
15 changes: 4 additions & 11 deletions .github/workflows/release-pypi.yml
Original file line number Diff line number Diff line change
Expand Up @@ -44,15 +44,7 @@ jobs:
# Setup environment in remote runner
- uses: actions/setup-python@v2
with:
python-version: 3.8

# -------------------
# Install basic packages (not associated with the repo)
- name: update and upgrade pip, setuptools, wheel and twine
run: |
pip install --upgrade pip
pip install --upgrade setuptools wheel twine pip-tools
pip install configparser semver
python-version: 3.9

# -------------------
# Update the package version by bumping the version to the next
Expand All @@ -67,7 +59,8 @@ jobs:
# package.__version__ will point to the previous version
- name: Install current package
run: |
pip install -e .
pip install --upgrade pip
pip install -e .[dev]
pip list
# -------------------
Expand All @@ -78,7 +71,7 @@ jobs:
# -------------------
# Build wheel
- name: build wheel
run: python setup.py sdist bdist_wheel
run: python -m build

# -------------------
# Publish package to PyPi as ...
Expand Down
3 changes: 1 addition & 2 deletions .readthedocs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -10,10 +10,9 @@ build:
jobs:
pre_build:
- cp -r notebooks/ docs/notebooks/
- pip install seaborn

python:
# Install our python package before building the docs
install:
- method: pip
path: .
path: .[dev]
60 changes: 53 additions & 7 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,25 +9,40 @@ This repo is a part of the [Responsible AI Toolbox](https://github.com/microsoft
<p align="center">
<img src="./docs/imgs/responsible-ai-toolbox-mitigations.png" alt="ResponsibleAIToolboxMitigationsOverview" width="750"/>

There are two main functions covered in this library:
- **Data Balance Analysis** (Exploratory Data Analysis): covering metrics that help to determine how balanced is your dataset.
- **Data Processing** (Data Enhancements): covering several transformer classes that aim to change or mitigate certain aspects of a dataset.
There are three main modules covered in this library:
- **Data Processing** (Data Enhancements): include several transformer classes that aim to change or mitigate certain aspects of a dataset.
The goal of this module is to provide a unified interface for different mitigation methods scattered around
multiple machine learning libraries, such as scikit-learn, mlxtend, sdv, among others.
- **Data Balance Analysis** (Exploratory Data Analysis): include metrics that help to determine how balanced is your dataset.
- **Cohort** (managing cohorts): include a set of classes that allow users to handle cohorts of data by creating customized pipelines
for each cohort.


In this library, we take a **targeted approach to mitigating errors** in Machine Learning models. This is complementary and different from the traditional blanket approaches which aim at maximizing a single-score performance number, such as overall accuracy, by merely increasing the size of traning data or model architecture. Since blanket approaches are often costly but also ineffective for improving the model in areas of poorest performance, with targeted approaches to model improvement we focus the improvement efforts in areas previously identified to have more errors and their underlying diagnoses of error. For example, if a practitioner has identified that the model is underperforming for a cohort of interest by using Error Analysis in the Responsible AI Dashboard, they may also continue the debugging process by finding out through Data Balance Analysis and find out that there is class imbalance for this particular cohort. To mitigate the issue, they then focus on improving class imbalance for the cohort of interest by using the Responsible AI Mitigations library. This and several other examples in the documentation of each mitigation function illustrate how targeted approaches may help practitioner best at mitigation giving them more control in the model improvement process.


## Installation

Use the following pip command to install the Responsible AI Toolbox. Make sure you are using Python 3.7, 3.8, or 3.9.
Use the following pip command to install the Responsible AI Toolbox. Make sure you are using Python 3.7, 3.8, or 3.9. If running in jupyter, please make sure to restart the jupyter kernel after installing. There are three installation options for the ``raimitigations`` package:

If running in jupyter, please make sure to restart the jupyter kernel after installing.
* To install the minimum dependencies, use:

```
pip install raimitigations
```

* To install the minimum dependencies + the packages required to run all of the notebooks in the ``notebooks/`` folder:

```
pip install raimitigations[all]
```

* To install all the dependencies used for development (such as ``pytest``, for example), use:

```
pip install raimitigations[dev]
```

## Documentation

To learn more about the supported dataset measurements and mitigation techniques covered in the **raimitigations** package, [please check out this documentation.](https://responsible-ai-toolbox-mitigations.readthedocs.io/en/latest/)
Expand All @@ -53,6 +68,7 @@ methods offered in the **dataprocessing** module.
- [Identifying correlated features: tutorial](notebooks/dataprocessing/module_tests/feat_sel_corr_tutorial.ipynb)
- [Data Rebalance using imblearn](notebooks/dataprocessing/module_tests/rebalance_imbl.ipynb)
- [Data Rebalance using SDV](notebooks/dataprocessing/module_tests/rebalance_sdv.ipynb)
- [Using scikit-learn's Pipeline](notebooks/dataprocessing/module_tests/pipeline_test.ipynb)

Here is a set of case study scenarios where we use the transformations available in the **dataprocessing**
module in order to train a model for a real-world dataset.
Expand All @@ -62,6 +78,27 @@ module in order to train a model for a real-world dataset.
- [Case Study 2](notebooks/dataprocessing/case_study/case2.ipynb)
- [Case Study 3](notebooks/dataprocessing/case_study/case3.ipynb)

## Handling Cohorts

Here is a set of tutorial notebooks that aim to explain how to manage cohorts.

- [Creating Single Cohorts](notebooks/cohort/cohort_definition.ipynb)
- [Creating Different Pipelines for each Cohort](notebooks/cohort/cohort_manager.ipynb)
- [Different Pre-processing Scenarios using cohorts](notebooks/cohort/cohort_manager_scenarios.ipynb)
- [Using Decoupled Classifiers](notebooks/cohort/decoupled.ipynb)

Here is a set of case study notebooks showing how creating customized dataprocessing pipelines for each
cohort can help in some scenarios.

- [Cohort Case Study 1](notebooks/cohort/case_study/case_1.ipynb)
- [Cohort Case Study 1 - Rebalancing only specific cohorts](notebooks/cohort/case_study/case_1_rebalance.ipynb)
- [Cohort Case Study 1 - Using RAI Toolbox](notebooks/cohort/case_study/case_1_dashboard.ipynb)
- [Cohort Case Study 2](notebooks/cohort/case_study/case_2.ipynb)
- [Cohort Case Study 3](notebooks/cohort/case_study/case_3.ipynb)
- [Decoupled Classifier Case 1](notebooks/cohort/case_study/decoupled_class/case_1)
- [Decoupled Classifier Case 2](notebooks/cohort/case_study/decoupled_class/case_2)
- [Decoupled Classifier Case 3](notebooks/cohort/case_study/decoupled_class/case_3)



## Dependencies
Expand Down Expand Up @@ -95,14 +132,23 @@ This project has adopted the [Microsoft Open Source Code of Conduct](https://ope
For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or
contact [[email protected]](mailto:[email protected]) with any additional questions or comments.

### Installing Using ``dev`` Mode

After cloning this repo and moving to its root folder, install the package in editable mode with the development dependencies using:

```console
> pip install -e .[dev]
```

### Pre-Commit

This repository uses pre-commit hooks to guarantee that the code format is kept consistent. For development, make sure to
activate pre-commit before creating a pull request. Any code pushed to this repository is checked for code consistency using
Github Actions, so if pre-commit is not used when doing a commit, there is a chance that it fails in the format check workflow.
Using pre-commit will avoid this.

To use pre-commit with this repository, first install pre-commit:
To use pre-commit with this repository, first install pre-commit (**NOTE:** when installing the package with the ``[dev]`` tag, the
``pre-commit`` package will already be installed):

```console
> pip install pre-commit
Expand Down Expand Up @@ -169,7 +215,7 @@ Any use of third-party trademarks or logos are subject to those third-party's po

## Research and Acknowledgements

**Current Maintainers:** [Matheus Mendonça](https://github.com/mrfmendonca), [Dany Rouhana](https://github.com/danyrouh), [Mark Encarnación](https://github.com/markenc)
**Current Maintainers:** [Marah Abdin](https://github.com/marah-abdin), [Matheus Mendonça](https://github.com/mrfmendonca), [Dany Rouhana](https://github.com/danyrouh), [Mark Encarnación](https://github.com/markenc)

**Past Maintainers:** [Akshara Ramakrishnan](https://github.com/akshara-msft), [Irina Spiridonova](https://github.com/irinasp)

Expand Down
4 changes: 3 additions & 1 deletion docs/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,6 +6,8 @@ API reference
.. toctree::
:maxdepth: 2

databalanceanalysis/databalanceanalysis
dataprocessing/dataprocessing
databalanceanalysis/databalanceanalysis
cohort/cohort
utils/utils

18 changes: 18 additions & 0 deletions docs/cohort/case_studies.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
.. _cohort_examples:

Cohort Case Studies
===================

Here is a set of case study scenarios where we use the **CohortManager** class and the **DecoupledClass** class.


.. nbgallery::
../notebooks/cohort/case_study/case_1
../notebooks/cohort/case_study/case_1_rebalance
../notebooks/cohort/case_study/case_1_dashboard
../notebooks/cohort/case_study/case_2
../notebooks/cohort/case_study/case_3
../notebooks/cohort/case_study/integration_raiwidgets
../notebooks/cohort/case_study/decoupled_class/case_1
../notebooks/cohort/case_study/decoupled_class/case_2
../notebooks/cohort/case_study/decoupled_class/case_3
16 changes: 16 additions & 0 deletions docs/cohort/cohort.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
Cohort
======

The ``cohort`` module enables creating and handling multiple cohorts using an intuitive interface. This module allows
the application of different data processing pipelines over each cohort, as well as computing multiple metrics separately
for each of the existing cohorts.

.. toctree::
:maxdepth: 1

cohort_definition
cohort_handler
cohort_manager
decoupled_class
utils

13 changes: 13 additions & 0 deletions docs/cohort/cohort_definition.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
.. _cohort_def:

CohortDefinition
================

.. autoclass:: raimitigations.cohort.CohortDefinition
:members:

Examples
--------

.. nbgallery::
../notebooks/cohort/cohort_definition
13 changes: 13 additions & 0 deletions docs/cohort/cohort_handler.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
.. _cohort_handler:

CohortHandler
=============

.. autoclass:: raimitigations.cohort.CohortHandler
:members:
:show-inheritance:

.. rubric:: Class Diagram

.. inheritance-diagram:: raimitigations.cohort.CohortManager raimitigations.cohort.DecoupledClass
:parts: 1
46 changes: 46 additions & 0 deletions docs/cohort/cohort_manager.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
.. _cohort_manager:

CohortManager
=============

The **CohortManager** allows the application of different data processing pipelines over each cohort. Also allows the creation and
filtering of multiple cohorts using a simple interface. Finally, allows the creation of different estimators for each cohort using
the ``.predict()`` and ``predict_proba()`` interfaces. This class uses the :ref:`cohort.CohortDefinition<cohort_def>`
internally in order to create, filter, and manipulate multiple cohorts. There are multiple ways of using the
:ref:`cohort.CohortManager<cohort_manager>` class when building a pipeline, and these different scenarios are summarized in following
figure.

.. figure:: ../imgs/scenarios.jpg
:scale: 20
:alt: Balancing over cohorts

*Figure 1 - The CohortManager class can be used in different ways to target mitigations to different cohorts. The main differences
between these scenarios consist on whether the same or different type of data mitigation is applied to the cohort data, and whether
a single or separate models will be trained for different cohorts. Depending on these choices, CohortManager will take care of
slicing the data accordingly, applying the specified data mitigation strategy, merging the data back, and retraining the model(s).*

The **Cohort Manager - Scenarios and Examples** notebook, located in ``notebooks/cohort/cohort_manager_scenarios.ipynb`` and listed in
the **Examples** section below, shows how each of these scenarios can be implemented through simple code snippets.

.. autoclass:: raimitigations.cohort.CohortManager
:members:

.. rubric:: Class Diagram

.. inheritance-diagram:: raimitigations.cohort.CohortManager
:parts: 1

.. _cohort_manager_ex:

Examples
--------

.. nbgallery::
../notebooks/cohort/cohort_manager
../notebooks/cohort/cohort_manager_scenarios
../notebooks/cohort/case_study/case_1
../notebooks/cohort/case_study/case_1_rebalance
../notebooks/cohort/case_study/case_1_dashboard
../notebooks/cohort/case_study/case_2
../notebooks/cohort/case_study/case_3
../notebooks/cohort/case_study/integration_raiwidgets
56 changes: 56 additions & 0 deletions docs/cohort/decoupled_class.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,56 @@
.. _decoupled_class:

DecoupledClass
==============

This class implements techniques for learning different estimators (models) for different cohorts based on the approach
presented in `"Decoupled classifiers for group-fair and efficient machine learning." <https://www.microsoft.com/en-us/research/publication/decoupled-classifiers-for-group-fair-and-efficient-machine-learning/>`_
Cynthia Dwork, Nicole Immorlica, Adam Tauman Kalai, and Max Leiserson. Conference on fairness, accountability and transparency. PMLR, 2018. The approach
searches and combines cohort-specific classifiers to optimize for different definitions of group fairness and can be used
as a post-processing step on top of any model class. The current implementation in this library supports only binary
classification and we welcome contributions that can extend these ideas for multi-class and regression problems.

The basis decoupling algorithm can be summarized in two steps:

* A different family of classifiers is trained on each cohort of interest. The algorithm partitions the training data
for each cohort and learns a classifier for each cohort. Each cohort-specific trained classifier results in a family
of potential classifiers to be used after the classifier output is adjusted based on different thresholds on the model
output. For example, depending on which errors are most important to the application (e.g. false positives vs. false
negatives for binary classification), thresholding the model prediction at different values of the model output (e.g.
likelihood, softmax) will result in different classifiers. This step generates a whole family of classifiers based on
different thresholds.
* Among the cohort-specific classifiers search for one representative classifier for each cohort such that a joint loss
is optimized. This step searches through all combinations of classifiers from the previous step to find the combination
that best optimizes a definition of a joint loss across all cohorts. While there are different definitions of such a joint
loss, this implementation currently supports definitions of the Balanced Loss, L1 loss, and Demographic Parity as examples
of losses that focus on group fairness. More definitions of losses are described in the longer version of the paper.

One issue that arises commonly in cohort-specific learning is that some cohorts may also have little data in the training set,
which may hinder the capability of a decoupled classifier to learn a better estimator for that cohort. To mitigate the problem,
the DecoupledClassifier class also allows using transfer learning from the overall data for these cohorts.

The figure below shows the types of scenarios that the DecoupledClassifier class can implement and how it compares to the CohortManager class. First, while the CohortManager class offers a general way to customize pipelines or train custom classifiers for cohorts, it does not offer any post-training capabilities for selecting classifiers such that they optimize a joint loss function for group fairness. In addition, transfer learning for minority cohorts is only available in the DecoupledClassifier class. To implement a scenario where the same type of data processing mitigation is applied to different cohorts separately, one can use the DecoupledClassifier with a transform pipeline (including the estimator).

.. figure:: ../imgs/decoupled_class_figure_1.png
:scale: 50

*Figure 1 - The DecoupledClassifier class can currently implement the highlighted scenarios in this figure, with additional functionalities in comparison to the CohortManager being i) joint optimization of a loss function for group fairness, and ii) transfer learning for minority cohorts.*

The tutorial notebook in addition to the decoupled classifiers case study notebooks demonstrate different scenarios where one can use this class.

.. autoclass:: raimitigations.cohort.DecoupledClass
:members:

.. rubric:: Class Diagram

.. inheritance-diagram:: raimitigations.cohort.DecoupledClass
:parts: 1

Examples
--------

.. nbgallery::
../notebooks/cohort/decoupled
../notebooks/cohort/case_study/decoupled_class/case_1
../notebooks/cohort/case_study/decoupled_class/case_2
../notebooks/cohort/case_study/decoupled_class/case_3
Loading

0 comments on commit 3f38fb7

Please sign in to comment.