Merge pull request #45 from microsoft/mmendonca/rebalance_cohort

Cohort managing classes
microsoft · Feb 9, 2023 · 3f38fb7 · 3f38fb7
2 parents 775830f + 3c5542c
commit 3f38fb7
Show file tree

Hide file tree

Showing 187 changed files with 38,375 additions and 6,384 deletions.
diff --git a/.github/workflows/ci.yml b/.github/workflows/ci.yml
@@ -28,7 +28,7 @@ jobs:
     - name: Install package
       run: |
         pip install --upgrade pip
-        pip install -e .
+        pip install -e .[dev]
 
     - name: Run Tests
       run: |

diff --git a/.github/workflows/format_check.yml b/.github/workflows/format_check.yml
@@ -22,6 +22,6 @@ jobs:
       # Setup environment in remote runner
       - uses: actions/setup-python@v2
         with:
-          python-version: 3.8
+          python-version: 3.9
 
       - uses: pre-commit/[email protected]
diff --git a/.github/workflows/github-pages.yml b/.github/workflows/github-pages.yml
@@ -13,21 +13,20 @@ jobs:
     steps:
       - uses: actions/checkout@v2
 
-      - name: Setup Python 3.8
+      - name: Setup Python 3.9
         uses: actions/setup-python@v2
         with:
-          python-version: 3.8
+          python-version: 3.9
 
       - name: Install package
         run: |
           pip install --upgrade pip
-          pip install -e .
+          pip install -e .[dev]
 
       - name: Build
         run: |
           sudo apt install pandoc
           sudo apt install graphviz
-          pip install seaborn
           make html
         working-directory: docs
 

diff --git a/.github/workflows/release-pypi.yml b/.github/workflows/release-pypi.yml
@@ -44,15 +44,7 @@ jobs:
       # Setup environment in remote runner
       - uses: actions/setup-python@v2
         with:
-          python-version: 3.8
-
-      # -------------------
-      # Install basic packages (not associated with the repo)
-      - name: update and upgrade pip, setuptools, wheel and twine
-        run: |
-           pip install --upgrade pip
-           pip install --upgrade setuptools wheel twine pip-tools
-           pip install configparser semver
+          python-version: 3.9
 
       # -------------------
       # Update the package version by bumping the version to the next
@@ -67,7 +59,8 @@ jobs:
       # package.__version__ will point to the previous version
       - name: Install current package
         run: |
-          pip install -e .
+          pip install --upgrade pip
+          pip install -e .[dev]
           pip list
 
       # -------------------
@@ -78,7 +71,7 @@ jobs:
       # -------------------
       # Build wheel
       - name: build wheel
-        run: python setup.py sdist bdist_wheel
+        run: python -m build
 
       # -------------------
       # Publish package to PyPi as ...

diff --git a/.readthedocs.yaml b/.readthedocs.yaml
@@ -10,10 +10,9 @@ build:
   jobs:
     pre_build:
       - cp -r notebooks/ docs/notebooks/
-      - pip install seaborn
 
 python:
   # Install our python package before building the docs
   install:
     - method: pip
-      path: .
+      path: .[dev]
diff --git a/README.md b/README.md
@@ -9,25 +9,40 @@ This repo is a part of the [Responsible AI Toolbox](https://github.com/microsoft
 <p align="center">
 <img src="./docs/imgs/responsible-ai-toolbox-mitigations.png" alt="ResponsibleAIToolboxMitigationsOverview" width="750"/>
 
-There are two main functions covered in this library:
-- **Data Balance Analysis** (Exploratory Data Analysis): covering metrics that help to determine how balanced is your dataset.
-- **Data Processing** (Data Enhancements): covering several transformer classes that aim to change or mitigate certain aspects of a dataset.
+There are three main modules covered in this library:
+- **Data Processing** (Data Enhancements): include several transformer classes that aim to change or mitigate certain aspects of a dataset.
 The goal of this module is to provide a unified interface for different mitigation methods scattered around
 multiple machine learning libraries, such as scikit-learn, mlxtend, sdv, among others.
+- **Data Balance Analysis** (Exploratory Data Analysis): include metrics that help to determine how balanced is your dataset.
+- **Cohort** (managing cohorts): include a set of classes that allow users to handle cohorts of data by creating customized pipelines
+for each cohort.
 
 
 In this library, we take a **targeted approach to mitigating errors** in Machine Learning models. This is complementary and different from the traditional blanket approaches which aim at maximizing a single-score performance number, such as overall accuracy, by merely increasing the size of traning data or model architecture. Since blanket approaches are often costly but also ineffective for improving the model in areas of poorest performance, with targeted approaches to model improvement we focus the improvement efforts in areas previously identified to have more errors and their underlying diagnoses of error. For example, if a practitioner has identified that the model is underperforming for a cohort of interest by using Error Analysis in the Responsible AI Dashboard, they may also continue the debugging process by finding out through Data Balance Analysis and find out that there is class imbalance for this particular cohort. To mitigate the issue, they then focus on improving class imbalance for the cohort of interest by using the Responsible AI Mitigations library. This and several other examples in the documentation of each mitigation function illustrate how targeted approaches may help practitioner best at mitigation giving them more control in the model improvement process.
 
 
 ## Installation
 
-Use the following pip command to install the Responsible AI Toolbox. Make sure you are using Python 3.7, 3.8, or 3.9.
+Use the following pip command to install the Responsible AI Toolbox. Make sure you are using Python 3.7, 3.8, or 3.9. If running in jupyter, please make sure to restart the jupyter kernel after installing. There are three installation options for the ``raimitigations`` package:
 
-If running in jupyter, please make sure to restart the jupyter kernel after installing.
+* To install the minimum dependencies, use:
 
 ```
 pip install raimitigations
 ```
+
+* To install the minimum dependencies + the packages required to run all of the notebooks in the ``notebooks/`` folder:
+
+```
+pip install raimitigations[all]
+```
+
+* To install all the dependencies used for development (such as ``pytest``, for example), use:
+
+```
+pip install raimitigations[dev]
+```
+
 ## Documentation
 
 To learn more about the supported dataset measurements and mitigation techniques covered in the **raimitigations** package, [please check out this documentation.](https://responsible-ai-toolbox-mitigations.readthedocs.io/en/latest/)
@@ -53,6 +68,7 @@ methods offered in the **dataprocessing** module.
 - [Identifying correlated features: tutorial](notebooks/dataprocessing/module_tests/feat_sel_corr_tutorial.ipynb)
 - [Data Rebalance using imblearn](notebooks/dataprocessing/module_tests/rebalance_imbl.ipynb)
 - [Data Rebalance using SDV](notebooks/dataprocessing/module_tests/rebalance_sdv.ipynb)
+- [Using scikit-learn's Pipeline](notebooks/dataprocessing/module_tests/pipeline_test.ipynb)
 
 Here is a set of case study scenarios where we use the transformations available in the **dataprocessing**
 module in order to train a model for a real-world dataset.
@@ -62,6 +78,27 @@ module in order to train a model for a real-world dataset.
 - [Case Study 2](notebooks/dataprocessing/case_study/case2.ipynb)
 - [Case Study 3](notebooks/dataprocessing/case_study/case3.ipynb)
 
+## Handling Cohorts
+
+Here is a set of tutorial notebooks that aim to explain how to manage cohorts.
+
+- [Creating Single Cohorts](notebooks/cohort/cohort_definition.ipynb)
+- [Creating Different Pipelines for each Cohort](notebooks/cohort/cohort_manager.ipynb)
+- [Different Pre-processing Scenarios using cohorts](notebooks/cohort/cohort_manager_scenarios.ipynb)
+- [Using Decoupled Classifiers](notebooks/cohort/decoupled.ipynb)
+
+Here is a set of case study notebooks showing how creating customized dataprocessing pipelines for each
+cohort can help in some scenarios.
+
+- [Cohort Case Study 1](notebooks/cohort/case_study/case_1.ipynb)
+- [Cohort Case Study 1 - Rebalancing only specific cohorts](notebooks/cohort/case_study/case_1_rebalance.ipynb)
+- [Cohort Case Study 1 - Using RAI Toolbox](notebooks/cohort/case_study/case_1_dashboard.ipynb)
+- [Cohort Case Study 2](notebooks/cohort/case_study/case_2.ipynb)
+- [Cohort Case Study 3](notebooks/cohort/case_study/case_3.ipynb)
+- [Decoupled Classifier Case 1](notebooks/cohort/case_study/decoupled_class/case_1)
+- [Decoupled Classifier Case 2](notebooks/cohort/case_study/decoupled_class/case_2)
+- [Decoupled Classifier Case 3](notebooks/cohort/case_study/decoupled_class/case_3)
+
 
 
 ## Dependencies
@@ -95,14 +132,23 @@ This project has adopted the [Microsoft Open Source Code of Conduct](https://ope
 For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or
 contact [[email protected]](mailto:[email protected]) with any additional questions or comments.
 
+### Installing Using ``dev`` Mode
+
+After cloning this repo and moving to its root folder, install the package in editable mode with the development dependencies using:
+
+```console
+> pip install -e .[dev]
+```
+
 ### Pre-Commit
 
 This repository uses pre-commit hooks to guarantee that the code format is kept consistent. For development, make sure to
 activate pre-commit before creating a pull request. Any code pushed to this repository is checked for code consistency using
 Github Actions, so if pre-commit is not used when doing a commit, there is a chance that it fails in the format check workflow.
 Using pre-commit will avoid this.
 
-To use pre-commit with this repository, first install pre-commit:
+To use pre-commit with this repository, first install pre-commit (**NOTE:** when installing the package with the ``[dev]`` tag, the
+``pre-commit`` package will already be installed):
 
 ```console
 > pip install pre-commit
@@ -169,7 +215,7 @@ Any use of third-party trademarks or logos are subject to those third-party's po
 
 ## Research and Acknowledgements
 
-**Current Maintainers:** [Matheus Mendonça](https://github.com/mrfmendonca), [Dany Rouhana](https://github.com/danyrouh), [Mark Encarnación](https://github.com/markenc)
+**Current Maintainers:** [Marah Abdin](https://github.com/marah-abdin), [Matheus Mendonça](https://github.com/mrfmendonca), [Dany Rouhana](https://github.com/danyrouh), [Mark Encarnación](https://github.com/markenc)
 
 **Past Maintainers:** [Akshara Ramakrishnan](https://github.com/akshara-msft), [Irina Spiridonova](https://github.com/irinasp)
 

diff --git a/docs/api.rst b/docs/api.rst
@@ -6,6 +6,8 @@ API reference
 .. toctree::
    :maxdepth: 2
 
-   databalanceanalysis/databalanceanalysis
    dataprocessing/dataprocessing
+   databalanceanalysis/databalanceanalysis
+   cohort/cohort
+   utils/utils
 
diff --git a/docs/cohort/case_studies.rst b/docs/cohort/case_studies.rst
@@ -0,0 +1,18 @@
+.. _cohort_examples:
+
+Cohort Case Studies
+===================
+
+Here is a set of case study scenarios where we use the **CohortManager** class and the **DecoupledClass** class.
+
+
+.. nbgallery::
+   ../notebooks/cohort/case_study/case_1
+   ../notebooks/cohort/case_study/case_1_rebalance
+   ../notebooks/cohort/case_study/case_1_dashboard
+   ../notebooks/cohort/case_study/case_2
+   ../notebooks/cohort/case_study/case_3
+   ../notebooks/cohort/case_study/integration_raiwidgets
+   ../notebooks/cohort/case_study/decoupled_class/case_1
+   ../notebooks/cohort/case_study/decoupled_class/case_2
+   ../notebooks/cohort/case_study/decoupled_class/case_3
diff --git a/docs/cohort/cohort.rst b/docs/cohort/cohort.rst
@@ -0,0 +1,16 @@
+Cohort
+======
+
+The ``cohort`` module enables creating and handling multiple cohorts using an intuitive interface. This module allows
+the application of different data processing pipelines over each cohort, as well as computing multiple metrics separately
+for each of the existing cohorts.
+
+.. toctree::
+   :maxdepth: 1
+
+   cohort_definition
+   cohort_handler
+   cohort_manager
+   decoupled_class
+   utils
+
diff --git a/docs/cohort/cohort_definition.rst b/docs/cohort/cohort_definition.rst
@@ -0,0 +1,13 @@
+.. _cohort_def:
+
+CohortDefinition
+================
+
+.. autoclass:: raimitigations.cohort.CohortDefinition
+   :members:
+
+Examples
+--------
+
+.. nbgallery::
+   ../notebooks/cohort/cohort_definition
diff --git a/docs/cohort/cohort_handler.rst b/docs/cohort/cohort_handler.rst
@@ -0,0 +1,13 @@
+.. _cohort_handler:
+
+CohortHandler
+=============
+
+.. autoclass:: raimitigations.cohort.CohortHandler
+   :members:
+   :show-inheritance:
+
+.. rubric:: Class Diagram
+
+.. inheritance-diagram:: raimitigations.cohort.CohortManager raimitigations.cohort.DecoupledClass
+     :parts: 1
diff --git a/docs/cohort/cohort_manager.rst b/docs/cohort/cohort_manager.rst
@@ -0,0 +1,46 @@
+.. _cohort_manager:
+
+CohortManager
+=============
+
+The **CohortManager** allows the application of different data processing pipelines over each cohort. Also allows the creation and
+filtering of multiple cohorts using a simple interface. Finally, allows the creation of different estimators for each cohort using
+the ``.predict()`` and ``predict_proba()`` interfaces. This class uses the :ref:`cohort.CohortDefinition<cohort_def>`
+internally in order to create, filter, and manipulate multiple cohorts. There are multiple ways of using the
+:ref:`cohort.CohortManager<cohort_manager>` class when building a pipeline, and these different scenarios are summarized in following
+figure.
+
+.. figure:: ../imgs/scenarios.jpg
+  :scale: 20
+  :alt: Balancing over cohorts
+
+  *Figure 1 - The CohortManager class can be used in different ways to target mitigations to different cohorts. The main differences
+  between these scenarios consist on whether the same or different type of data mitigation is applied to the cohort data, and whether
+  a single or separate models will be trained for different cohorts. Depending on these choices, CohortManager will take care of
+  slicing the data accordingly, applying the specified data mitigation strategy, merging the data back, and retraining the model(s).*
+
+The **Cohort Manager - Scenarios and Examples** notebook, located in ``notebooks/cohort/cohort_manager_scenarios.ipynb`` and listed in
+the **Examples** section below, shows how each of these scenarios can be implemented through simple code snippets.
+
+.. autoclass:: raimitigations.cohort.CohortManager
+   :members:
+
+.. rubric:: Class Diagram
+
+.. inheritance-diagram:: raimitigations.cohort.CohortManager
+     :parts: 1
+
+.. _cohort_manager_ex:
+
+Examples
+--------
+
+.. nbgallery::
+   ../notebooks/cohort/cohort_manager
+   ../notebooks/cohort/cohort_manager_scenarios
+   ../notebooks/cohort/case_study/case_1
+   ../notebooks/cohort/case_study/case_1_rebalance
+   ../notebooks/cohort/case_study/case_1_dashboard
+   ../notebooks/cohort/case_study/case_2
+   ../notebooks/cohort/case_study/case_3
+   ../notebooks/cohort/case_study/integration_raiwidgets
diff --git a/docs/cohort/decoupled_class.rst b/docs/cohort/decoupled_class.rst
@@ -0,0 +1,56 @@
+.. _decoupled_class:
+
+DecoupledClass
+==============
+
+This class implements techniques for learning different estimators (models) for different cohorts based on the approach
+presented in `"Decoupled classifiers for group-fair and efficient machine learning." <https://www.microsoft.com/en-us/research/publication/decoupled-classifiers-for-group-fair-and-efficient-machine-learning/>`_
+Cynthia Dwork, Nicole Immorlica, Adam Tauman Kalai, and Max Leiserson. Conference on fairness, accountability and transparency. PMLR, 2018. The approach
+searches and combines cohort-specific classifiers to optimize for different definitions of group fairness and can be used
+as a post-processing step on top of any model class. The current implementation in this library supports only binary
+classification and we welcome contributions that can extend these ideas for multi-class and regression problems.
+
+The basis decoupling algorithm can be summarized in two steps:
+
+    * A different family of classifiers is trained on each cohort of interest. The algorithm partitions the training data
+      for each cohort and learns a classifier for each cohort. Each cohort-specific trained classifier results in a family
+      of potential classifiers to be used after the classifier output is adjusted based on different thresholds on the model
+      output. For example, depending on which errors are most important to the application (e.g. false positives vs. false
+      negatives for binary classification), thresholding the model prediction at different values of the model output (e.g.
+      likelihood, softmax) will result in different classifiers. This step generates a whole family of classifiers based on
+      different thresholds.
+    * Among the cohort-specific classifiers search for one representative classifier for each cohort such that a joint loss
+      is optimized. This step searches through all combinations of classifiers from the previous step to find the combination
+      that best optimizes a definition of a joint loss across all cohorts. While there are different definitions of such a joint
+      loss, this implementation currently supports definitions of the Balanced Loss, L1 loss, and Demographic Parity as examples
+      of losses that focus on group fairness. More definitions of losses are described in the longer version of the paper.
+
+One issue that arises commonly in cohort-specific learning is that some cohorts may also have little data in the training set,
+which may hinder the capability of a decoupled classifier to learn a better estimator for that cohort. To mitigate the problem,
+the DecoupledClassifier class also allows using transfer learning from the overall data for these cohorts.
+
+The figure below shows the types of scenarios that the DecoupledClassifier class can implement and how it compares to the CohortManager class. First, while the CohortManager class offers a general way to customize pipelines or train custom classifiers for cohorts, it does not offer any post-training capabilities for selecting classifiers such that they optimize a joint loss function for group fairness. In addition, transfer learning for minority cohorts is only available in the DecoupledClassifier class. To implement a scenario where the same type of data processing mitigation is applied to different cohorts separately, one can use the DecoupledClassifier with a transform pipeline (including the estimator).
+
+.. figure:: ../imgs/decoupled_class_figure_1.png
+  :scale: 50
+
+  *Figure 1 - The DecoupledClassifier class can currently implement the highlighted scenarios in this figure, with additional functionalities in comparison to the CohortManager being i) joint optimization of a loss function for group fairness, and ii) transfer learning for minority cohorts.*
+
+The tutorial notebook in addition to the decoupled classifiers case study notebooks demonstrate different scenarios where one can use this class.
+
+.. autoclass:: raimitigations.cohort.DecoupledClass
+   :members:
+
+.. rubric:: Class Diagram
+
+.. inheritance-diagram:: raimitigations.cohort.DecoupledClass
+     :parts: 1
+
+Examples
+--------
+
+.. nbgallery::
+   ../notebooks/cohort/decoupled
+   ../notebooks/cohort/case_study/decoupled_class/case_1
+   ../notebooks/cohort/case_study/decoupled_class/case_2
+   ../notebooks/cohort/case_study/decoupled_class/case_3