-
Notifications
You must be signed in to change notification settings - Fork 7
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #45 from microsoft/mmendonca/rebalance_cohort
Cohort managing classes
- Loading branch information
Showing
187 changed files
with
38,375 additions
and
6,384 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -22,6 +22,6 @@ jobs: | |
# Setup environment in remote runner | ||
- uses: actions/setup-python@v2 | ||
with: | ||
python-version: 3.8 | ||
python-version: 3.9 | ||
|
||
- uses: pre-commit/[email protected] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -9,25 +9,40 @@ This repo is a part of the [Responsible AI Toolbox](https://github.com/microsoft | |
<p align="center"> | ||
<img src="./docs/imgs/responsible-ai-toolbox-mitigations.png" alt="ResponsibleAIToolboxMitigationsOverview" width="750"/> | ||
|
||
There are two main functions covered in this library: | ||
- **Data Balance Analysis** (Exploratory Data Analysis): covering metrics that help to determine how balanced is your dataset. | ||
- **Data Processing** (Data Enhancements): covering several transformer classes that aim to change or mitigate certain aspects of a dataset. | ||
There are three main modules covered in this library: | ||
- **Data Processing** (Data Enhancements): include several transformer classes that aim to change or mitigate certain aspects of a dataset. | ||
The goal of this module is to provide a unified interface for different mitigation methods scattered around | ||
multiple machine learning libraries, such as scikit-learn, mlxtend, sdv, among others. | ||
- **Data Balance Analysis** (Exploratory Data Analysis): include metrics that help to determine how balanced is your dataset. | ||
- **Cohort** (managing cohorts): include a set of classes that allow users to handle cohorts of data by creating customized pipelines | ||
for each cohort. | ||
|
||
|
||
In this library, we take a **targeted approach to mitigating errors** in Machine Learning models. This is complementary and different from the traditional blanket approaches which aim at maximizing a single-score performance number, such as overall accuracy, by merely increasing the size of traning data or model architecture. Since blanket approaches are often costly but also ineffective for improving the model in areas of poorest performance, with targeted approaches to model improvement we focus the improvement efforts in areas previously identified to have more errors and their underlying diagnoses of error. For example, if a practitioner has identified that the model is underperforming for a cohort of interest by using Error Analysis in the Responsible AI Dashboard, they may also continue the debugging process by finding out through Data Balance Analysis and find out that there is class imbalance for this particular cohort. To mitigate the issue, they then focus on improving class imbalance for the cohort of interest by using the Responsible AI Mitigations library. This and several other examples in the documentation of each mitigation function illustrate how targeted approaches may help practitioner best at mitigation giving them more control in the model improvement process. | ||
|
||
|
||
## Installation | ||
|
||
Use the following pip command to install the Responsible AI Toolbox. Make sure you are using Python 3.7, 3.8, or 3.9. | ||
Use the following pip command to install the Responsible AI Toolbox. Make sure you are using Python 3.7, 3.8, or 3.9. If running in jupyter, please make sure to restart the jupyter kernel after installing. There are three installation options for the ``raimitigations`` package: | ||
|
||
If running in jupyter, please make sure to restart the jupyter kernel after installing. | ||
* To install the minimum dependencies, use: | ||
|
||
``` | ||
pip install raimitigations | ||
``` | ||
|
||
* To install the minimum dependencies + the packages required to run all of the notebooks in the ``notebooks/`` folder: | ||
|
||
``` | ||
pip install raimitigations[all] | ||
``` | ||
|
||
* To install all the dependencies used for development (such as ``pytest``, for example), use: | ||
|
||
``` | ||
pip install raimitigations[dev] | ||
``` | ||
|
||
## Documentation | ||
|
||
To learn more about the supported dataset measurements and mitigation techniques covered in the **raimitigations** package, [please check out this documentation.](https://responsible-ai-toolbox-mitigations.readthedocs.io/en/latest/) | ||
|
@@ -53,6 +68,7 @@ methods offered in the **dataprocessing** module. | |
- [Identifying correlated features: tutorial](notebooks/dataprocessing/module_tests/feat_sel_corr_tutorial.ipynb) | ||
- [Data Rebalance using imblearn](notebooks/dataprocessing/module_tests/rebalance_imbl.ipynb) | ||
- [Data Rebalance using SDV](notebooks/dataprocessing/module_tests/rebalance_sdv.ipynb) | ||
- [Using scikit-learn's Pipeline](notebooks/dataprocessing/module_tests/pipeline_test.ipynb) | ||
|
||
Here is a set of case study scenarios where we use the transformations available in the **dataprocessing** | ||
module in order to train a model for a real-world dataset. | ||
|
@@ -62,6 +78,27 @@ module in order to train a model for a real-world dataset. | |
- [Case Study 2](notebooks/dataprocessing/case_study/case2.ipynb) | ||
- [Case Study 3](notebooks/dataprocessing/case_study/case3.ipynb) | ||
|
||
## Handling Cohorts | ||
|
||
Here is a set of tutorial notebooks that aim to explain how to manage cohorts. | ||
|
||
- [Creating Single Cohorts](notebooks/cohort/cohort_definition.ipynb) | ||
- [Creating Different Pipelines for each Cohort](notebooks/cohort/cohort_manager.ipynb) | ||
- [Different Pre-processing Scenarios using cohorts](notebooks/cohort/cohort_manager_scenarios.ipynb) | ||
- [Using Decoupled Classifiers](notebooks/cohort/decoupled.ipynb) | ||
|
||
Here is a set of case study notebooks showing how creating customized dataprocessing pipelines for each | ||
cohort can help in some scenarios. | ||
|
||
- [Cohort Case Study 1](notebooks/cohort/case_study/case_1.ipynb) | ||
- [Cohort Case Study 1 - Rebalancing only specific cohorts](notebooks/cohort/case_study/case_1_rebalance.ipynb) | ||
- [Cohort Case Study 1 - Using RAI Toolbox](notebooks/cohort/case_study/case_1_dashboard.ipynb) | ||
- [Cohort Case Study 2](notebooks/cohort/case_study/case_2.ipynb) | ||
- [Cohort Case Study 3](notebooks/cohort/case_study/case_3.ipynb) | ||
- [Decoupled Classifier Case 1](notebooks/cohort/case_study/decoupled_class/case_1) | ||
- [Decoupled Classifier Case 2](notebooks/cohort/case_study/decoupled_class/case_2) | ||
- [Decoupled Classifier Case 3](notebooks/cohort/case_study/decoupled_class/case_3) | ||
|
||
|
||
|
||
## Dependencies | ||
|
@@ -95,14 +132,23 @@ This project has adopted the [Microsoft Open Source Code of Conduct](https://ope | |
For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or | ||
contact [[email protected]](mailto:[email protected]) with any additional questions or comments. | ||
|
||
### Installing Using ``dev`` Mode | ||
|
||
After cloning this repo and moving to its root folder, install the package in editable mode with the development dependencies using: | ||
|
||
```console | ||
> pip install -e .[dev] | ||
``` | ||
|
||
### Pre-Commit | ||
|
||
This repository uses pre-commit hooks to guarantee that the code format is kept consistent. For development, make sure to | ||
activate pre-commit before creating a pull request. Any code pushed to this repository is checked for code consistency using | ||
Github Actions, so if pre-commit is not used when doing a commit, there is a chance that it fails in the format check workflow. | ||
Using pre-commit will avoid this. | ||
|
||
To use pre-commit with this repository, first install pre-commit: | ||
To use pre-commit with this repository, first install pre-commit (**NOTE:** when installing the package with the ``[dev]`` tag, the | ||
``pre-commit`` package will already be installed): | ||
|
||
```console | ||
> pip install pre-commit | ||
|
@@ -169,7 +215,7 @@ Any use of third-party trademarks or logos are subject to those third-party's po | |
|
||
## Research and Acknowledgements | ||
|
||
**Current Maintainers:** [Matheus Mendonça](https://github.com/mrfmendonca), [Dany Rouhana](https://github.com/danyrouh), [Mark Encarnación](https://github.com/markenc) | ||
**Current Maintainers:** [Marah Abdin](https://github.com/marah-abdin), [Matheus Mendonça](https://github.com/mrfmendonca), [Dany Rouhana](https://github.com/danyrouh), [Mark Encarnación](https://github.com/markenc) | ||
|
||
**Past Maintainers:** [Akshara Ramakrishnan](https://github.com/akshara-msft), [Irina Spiridonova](https://github.com/irinasp) | ||
|
||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
.. _cohort_examples: | ||
|
||
Cohort Case Studies | ||
=================== | ||
|
||
Here is a set of case study scenarios where we use the **CohortManager** class and the **DecoupledClass** class. | ||
|
||
|
||
.. nbgallery:: | ||
../notebooks/cohort/case_study/case_1 | ||
../notebooks/cohort/case_study/case_1_rebalance | ||
../notebooks/cohort/case_study/case_1_dashboard | ||
../notebooks/cohort/case_study/case_2 | ||
../notebooks/cohort/case_study/case_3 | ||
../notebooks/cohort/case_study/integration_raiwidgets | ||
../notebooks/cohort/case_study/decoupled_class/case_1 | ||
../notebooks/cohort/case_study/decoupled_class/case_2 | ||
../notebooks/cohort/case_study/decoupled_class/case_3 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
Cohort | ||
====== | ||
|
||
The ``cohort`` module enables creating and handling multiple cohorts using an intuitive interface. This module allows | ||
the application of different data processing pipelines over each cohort, as well as computing multiple metrics separately | ||
for each of the existing cohorts. | ||
|
||
.. toctree:: | ||
:maxdepth: 1 | ||
|
||
cohort_definition | ||
cohort_handler | ||
cohort_manager | ||
decoupled_class | ||
utils | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
.. _cohort_def: | ||
|
||
CohortDefinition | ||
================ | ||
|
||
.. autoclass:: raimitigations.cohort.CohortDefinition | ||
:members: | ||
|
||
Examples | ||
-------- | ||
|
||
.. nbgallery:: | ||
../notebooks/cohort/cohort_definition |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,13 @@ | ||
.. _cohort_handler: | ||
|
||
CohortHandler | ||
============= | ||
|
||
.. autoclass:: raimitigations.cohort.CohortHandler | ||
:members: | ||
:show-inheritance: | ||
|
||
.. rubric:: Class Diagram | ||
|
||
.. inheritance-diagram:: raimitigations.cohort.CohortManager raimitigations.cohort.DecoupledClass | ||
:parts: 1 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,46 @@ | ||
.. _cohort_manager: | ||
|
||
CohortManager | ||
============= | ||
|
||
The **CohortManager** allows the application of different data processing pipelines over each cohort. Also allows the creation and | ||
filtering of multiple cohorts using a simple interface. Finally, allows the creation of different estimators for each cohort using | ||
the ``.predict()`` and ``predict_proba()`` interfaces. This class uses the :ref:`cohort.CohortDefinition<cohort_def>` | ||
internally in order to create, filter, and manipulate multiple cohorts. There are multiple ways of using the | ||
:ref:`cohort.CohortManager<cohort_manager>` class when building a pipeline, and these different scenarios are summarized in following | ||
figure. | ||
|
||
.. figure:: ../imgs/scenarios.jpg | ||
:scale: 20 | ||
:alt: Balancing over cohorts | ||
|
||
*Figure 1 - The CohortManager class can be used in different ways to target mitigations to different cohorts. The main differences | ||
between these scenarios consist on whether the same or different type of data mitigation is applied to the cohort data, and whether | ||
a single or separate models will be trained for different cohorts. Depending on these choices, CohortManager will take care of | ||
slicing the data accordingly, applying the specified data mitigation strategy, merging the data back, and retraining the model(s).* | ||
|
||
The **Cohort Manager - Scenarios and Examples** notebook, located in ``notebooks/cohort/cohort_manager_scenarios.ipynb`` and listed in | ||
the **Examples** section below, shows how each of these scenarios can be implemented through simple code snippets. | ||
|
||
.. autoclass:: raimitigations.cohort.CohortManager | ||
:members: | ||
|
||
.. rubric:: Class Diagram | ||
|
||
.. inheritance-diagram:: raimitigations.cohort.CohortManager | ||
:parts: 1 | ||
|
||
.. _cohort_manager_ex: | ||
|
||
Examples | ||
-------- | ||
|
||
.. nbgallery:: | ||
../notebooks/cohort/cohort_manager | ||
../notebooks/cohort/cohort_manager_scenarios | ||
../notebooks/cohort/case_study/case_1 | ||
../notebooks/cohort/case_study/case_1_rebalance | ||
../notebooks/cohort/case_study/case_1_dashboard | ||
../notebooks/cohort/case_study/case_2 | ||
../notebooks/cohort/case_study/case_3 | ||
../notebooks/cohort/case_study/integration_raiwidgets |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,56 @@ | ||
.. _decoupled_class: | ||
|
||
DecoupledClass | ||
============== | ||
|
||
This class implements techniques for learning different estimators (models) for different cohorts based on the approach | ||
presented in `"Decoupled classifiers for group-fair and efficient machine learning." <https://www.microsoft.com/en-us/research/publication/decoupled-classifiers-for-group-fair-and-efficient-machine-learning/>`_ | ||
Cynthia Dwork, Nicole Immorlica, Adam Tauman Kalai, and Max Leiserson. Conference on fairness, accountability and transparency. PMLR, 2018. The approach | ||
searches and combines cohort-specific classifiers to optimize for different definitions of group fairness and can be used | ||
as a post-processing step on top of any model class. The current implementation in this library supports only binary | ||
classification and we welcome contributions that can extend these ideas for multi-class and regression problems. | ||
|
||
The basis decoupling algorithm can be summarized in two steps: | ||
|
||
* A different family of classifiers is trained on each cohort of interest. The algorithm partitions the training data | ||
for each cohort and learns a classifier for each cohort. Each cohort-specific trained classifier results in a family | ||
of potential classifiers to be used after the classifier output is adjusted based on different thresholds on the model | ||
output. For example, depending on which errors are most important to the application (e.g. false positives vs. false | ||
negatives for binary classification), thresholding the model prediction at different values of the model output (e.g. | ||
likelihood, softmax) will result in different classifiers. This step generates a whole family of classifiers based on | ||
different thresholds. | ||
* Among the cohort-specific classifiers search for one representative classifier for each cohort such that a joint loss | ||
is optimized. This step searches through all combinations of classifiers from the previous step to find the combination | ||
that best optimizes a definition of a joint loss across all cohorts. While there are different definitions of such a joint | ||
loss, this implementation currently supports definitions of the Balanced Loss, L1 loss, and Demographic Parity as examples | ||
of losses that focus on group fairness. More definitions of losses are described in the longer version of the paper. | ||
|
||
One issue that arises commonly in cohort-specific learning is that some cohorts may also have little data in the training set, | ||
which may hinder the capability of a decoupled classifier to learn a better estimator for that cohort. To mitigate the problem, | ||
the DecoupledClassifier class also allows using transfer learning from the overall data for these cohorts. | ||
|
||
The figure below shows the types of scenarios that the DecoupledClassifier class can implement and how it compares to the CohortManager class. First, while the CohortManager class offers a general way to customize pipelines or train custom classifiers for cohorts, it does not offer any post-training capabilities for selecting classifiers such that they optimize a joint loss function for group fairness. In addition, transfer learning for minority cohorts is only available in the DecoupledClassifier class. To implement a scenario where the same type of data processing mitigation is applied to different cohorts separately, one can use the DecoupledClassifier with a transform pipeline (including the estimator). | ||
|
||
.. figure:: ../imgs/decoupled_class_figure_1.png | ||
:scale: 50 | ||
|
||
*Figure 1 - The DecoupledClassifier class can currently implement the highlighted scenarios in this figure, with additional functionalities in comparison to the CohortManager being i) joint optimization of a loss function for group fairness, and ii) transfer learning for minority cohorts.* | ||
|
||
The tutorial notebook in addition to the decoupled classifiers case study notebooks demonstrate different scenarios where one can use this class. | ||
|
||
.. autoclass:: raimitigations.cohort.DecoupledClass | ||
:members: | ||
|
||
.. rubric:: Class Diagram | ||
|
||
.. inheritance-diagram:: raimitigations.cohort.DecoupledClass | ||
:parts: 1 | ||
|
||
Examples | ||
-------- | ||
|
||
.. nbgallery:: | ||
../notebooks/cohort/decoupled | ||
../notebooks/cohort/case_study/decoupled_class/case_1 | ||
../notebooks/cohort/case_study/decoupled_class/case_2 | ||
../notebooks/cohort/case_study/decoupled_class/case_3 |
Oops, something went wrong.