Fixed propensity example (#70)

* Fixed propensity example * Fix import * Fix import * Update docs/examples/example_propensity.ipynb Co-authored-by: Kevin Klein <[email protected]> * Update docs/examples/example_propensity.ipynb Co-authored-by: Kevin Klein <[email protected]> * Fix formating * Update docs/examples/example_propensity.ipynb Co-authored-by: Kevin Klein <[email protected]> * Update docs/examples/example_propensity.ipynb Co-authored-by: Kevin Klein <[email protected]> * Fix formating * Modify note * Update docs/examples/example_propensity.ipynb Co-authored-by: Kevin Klein <[email protected]> --------- Co-authored-by: Kevin Klein <[email protected]>
Quantco · Jul 24, 2024 · 2f428fb · 2f428fb
1 parent a1ee27c
commit 2f428fb
Show file tree

Hide file tree

Showing 2 changed files with 215 additions and 0 deletions.
diff --git a/docs/examples/example_propensity.ipynb b/docs/examples/example_propensity.ipynb
@@ -0,0 +1,214 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# What if I know the propensity score?\n",
+    "\n",
+    "In some experiment settings we may know beforehand the probabilities of treatment assignments,\n",
+    "e.g. if we have data from a {term}`RCT<Randomized Control Trial (RCT)>` with known treatment\n",
+    "probabilities.\n",
+    "\n",
+    "In that case we may not want to learn a propensity model rather just use the known probabilities.\n",
+    "\n",
+    "Loading the data\n",
+    "----------------\n",
+    "\n",
+    "Just like in our {ref}`example on estimating CATEs with a MetaLearner\n",
+    "<example-basic>`, we will first load some experiment data:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "vscode": {
+     "languageId": "plaintext"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "import pandas as pd\n",
+    "from pathlib import Path\n",
+    "from git_root import git_root\n",
+    "\n",
+    "df = pd.read_csv(git_root(\"data/learning_mindset.zip\"))\n",
+    "outcome_column = \"achievement_score\"\n",
+    "treatment_column = \"intervention\"\n",
+    "feature_columns = [\n",
+    "    column for column in df.columns if column not in [outcome_column, treatment_column]\n",
+    "]\n",
+    "categorical_feature_columns = [\n",
+    "    \"ethnicity\",\n",
+    "    \"gender\",\n",
+    "    \"frst_in_family\",\n",
+    "    \"school_urbanicity\",\n",
+    "    \"schoolid\",\n",
+    "]\n",
+    "# Note that explicitly setting the dtype of these features to category\n",
+    "# allows both lightgbm as well as shap plots to\n",
+    "# 1. Operate on features which are not of type int, bool or float\n",
+    "# 2. Correctly interpret categoricals with int values to be\n",
+    "#    interpreted as categoricals, as compared to ordinals/numericals.\n",
+    "for categorical_feature_column in categorical_feature_columns:\n",
+    "    df[categorical_feature_column] = df[categorical_feature_column].astype(\"category\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Creating our own estimator\n",
+    "--------------------------\n",
+    "\n",
+    "In this tutorial we will assume that we know that all observations were assigned to the\n",
+    "treatment with a fixed probability of 0.3, which is close to the fraction of the observations\n",
+    "assigned to the treatment group:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "vscode": {
+     "languageId": "plaintext"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "df[treatment_column].mean()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "```{note}\n",
+    "The fact that we have a fixed propensity score for all observations is not true for this\n",
+    "dataset, we just use it for illustrational purposes.\n",
+    "```\n",
+    "\n",
+    "Now we can define our custom ``sklearn``-like classifier. We recommend inheriting from\n",
+    "the ``sklearn`` base classes and following the rules explained in the\n",
+    "[sklearn documentation](https://scikit-learn.org/stable/developers/develop.html) to avoid\n",
+    "having to define helper functions and ensure the correct functionality of the ``metalearners``\n",
+    "library."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "vscode": {
+     "languageId": "plaintext"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "from sklearn.base import BaseEstimator, ClassifierMixin\n",
+    "from typing import Any\n",
+    "from typing_extensions import Self\n",
+    "import numpy as np\n",
+    "import pandas as pd\n",
+    "\n",
+    "\n",
+    "class FixedPropensityModel(ClassifierMixin, BaseEstimator):\n",
+    "    def __init__(self, propensity_score: float) -> None:\n",
+    "        self.propensity_score = propensity_score\n",
+    "\n",
+    "    def fit(self, X: pd.DataFrame, y: pd.Series) -> Self:\n",
+    "        self.classes_ = np.unique(y.to_numpy())  # sklearn requires this\n",
+    "        return self\n",
+    "\n",
+    "    def predict(self, X: pd.DataFrame) -> np.ndarray[Any, Any]:\n",
+    "        return np.argmax(self.predict_proba(X), axis=1)\n",
+    "\n",
+    "    def predict_proba(self, X: pd.DataFrame) -> np.ndarray[Any, Any]:\n",
+    "        return np.full((len(X), 2), [1 - self.propensity_score, self.propensity_score])"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Fitting the MetaLearner\n",
+    "-----------------------\n",
+    "\n",
+    "Finally we can instantiate and fit our MetaLearner using our own custom propensity model:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "vscode": {
+     "languageId": "plaintext"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "from metalearners import RLearner\n",
+    "from lightgbm import LGBMRegressor\n",
+    "\n",
+    "rlearner = RLearner(\n",
+    "    nuisance_model_factory=LGBMRegressor,\n",
+    "    propensity_model_factory=FixedPropensityModel,\n",
+    "    treatment_model_factory=LGBMRegressor,\n",
+    "    nuisance_model_params={\"verbose\": -1},\n",
+    "    propensity_model_params={\"propensity_score\": 0.3},\n",
+    "    treatment_model_params={\"verbose\": -1},\n",
+    "    is_classification=False,\n",
+    "    n_variants=2,\n",
+    ")\n",
+    "rlearner.fit(\n",
+    "    X=df[feature_columns],\n",
+    "    y=df[outcome_column],\n",
+    "    w=df[treatment_column],\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We can check that the propensity estimates correspond to our expectation:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "vscode": {
+     "languageId": "plaintext"
+    }
+   },
+   "outputs": [],
+   "source": [
+    "rlearner.predict_nuisance(\n",
+    "    X=df[feature_columns], model_kind=\"propensity_model\", model_ord=0, is_oos=False\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Further comments\n",
+    "----------------\n",
+    "\n",
+    "* This example shows how we can use the same propensity score for all observations in the\n",
+    "  binary treatment setting, the class could be easily extended for multiple treatment\n",
+    "  variants a. Moreover, customizing the propensity score according to some simple \n",
+    "  extracted from the input features could easily be accommodated analogously."
+   ]
+  }
+ ],
+ "metadata": {
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/docs/examples/index.rst b/docs/examples/index.rst
@@ -13,3 +13,4 @@ Examples
    Tuning hyperparameters of a MetaLearner with MetaLearnerGridSearch <example_gridsearch.ipynb>
    Generating data <example_data_generation.ipynb>
    Estimating CATEs for survival analysis <example_survival.ipynb>
+   What if I know the propensity score? <example_propensity.ipynb>