diff --git a/AC2/python-debiased-ml-for-partially-linear-iv-model.ipynb b/AC2/python-debiased-ml-for-partially-linear-iv-model.ipynb index 9dd4dce8..b882ddb6 100644 --- a/AC2/python-debiased-ml-for-partially-linear-iv-model.ipynb +++ b/AC2/python-debiased-ml-for-partially-linear-iv-model.ipynb @@ -1,625 +1,637 @@ { - "cells": [ - { - "cell_type": "markdown", - "metadata": { - "papermill": { - "duration": 0.008106, - "end_time": "2021-04-23T10:41:33.911944", - "exception": false, - "start_time": "2021-04-23T10:41:33.903838", - "status": "completed" - }, - "tags": [], - "id": "qjaFmAyEzcLz" - }, - "source": [ - "# Double/Debiased ML for Partially Linear IV Model\n", - "\n", - "This is a simple implementation of Debiased Machine Learning for the Partially Linear\n", - "IV Regression Model, which provides an application of DML IV inference.\n", - "\n", - "\n", - "Reference:\n", - "\n", - "- https://arxiv.org/abs/1608.00060\n", - "- https://www.amazon.com/Business-Data-Science-Combining-Accelerate/dp/1260452778\n", - "\n", - "The code is based on the book.\n" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "papermill": { - "duration": 0.006963, - "end_time": "2021-04-23T10:41:33.926085", - "exception": false, - "start_time": "2021-04-23T10:41:33.919122", - "status": "completed" - }, - "tags": [], - "id": "G3YNyr-ezcL4" - }, - "source": [ - "\n", - "# Partially Linear IV Model\n", - "\n", - "We consider the partially linear structural equation model:\n", - "\\begin{eqnarray}\n", - " & Y - D\\theta_0 = g_0(X) + \\zeta, & E[\\zeta \\mid Z,X]= 0,\\\\\n", - " & Z = m_0(X) + V, & E[V \\mid X] = 0.\n", - "\\end{eqnarray}\n", - "\n", - "\n", - "Note that this model is not a regression model unless $Z=D$. The model is a canonical\n", - "model in causal inference, going back to P. Wright's work on IV methods for estimating demand/supply equations, with the modern difference being that $g_0$ and $m_0$ are nonlinear, potentially complicated functions of high-dimensional $X$. \n", - "\n", - "\n", - "The idea of this model is that there is a structural or causal relation between $Y$ and $D$, captured by $\\theta_0$, and $g_0(X) + \\zeta$ is the stochastic error, partly explained by covariates $X$. $V$ and $\\zeta$ are stochastic errors that are not explained by $X$. Since $Y$ and $D$ are jointly determined, we need an external factor, commonly referred to as an instrument, $Z$ to create exogenous variation\n", - "in $D$. Note that $Z$ should affect $D$. The $X$ here serve again as confounding factors, so we can think of variation in $Z$ as being exogenous only conditional on $X$.\n", - "\n", - "\n", - "The causal DAG this model corresponds to is given by:\n", - "$$\n", - "Z \\to D, X \\to (Y, Z, D), L \\to (Y,D),\n", - "$$\n", - "where $L$ is the latent confounder affecting both $Y$ and $D$, but not $Z$.\n", - "\n", - "\n", - "\n", - "---\n", - "\n", - "# Example\n", - "\n", - "A simple contextual example is from biostatistics, where $Y$ is a health outcome and $D$ is indicator of smoking. Thus, $\\theta_0$ is captures the effect of smoking on health. Health outcome $Y$ and smoking behavior $D$ are treated as being jointly determined. $X$ represents patient characteristics, and $Z$ could be a doctor's advice not to smoke (or another behavioral treatment) that may affect the outcome $Y$ only through shifting the behavior $D$, conditional on characteristics $X$. \n", - "\n", - "----\n", - "\n", - "\n", - "\n", - "# PLIVM in Residualized Form\n", - "\n", - "The PLIV model above can be rewritten in the following residualized form:\n", - "$$\n", - " \\tilde Y = \\tilde D \\theta_0 + \\zeta, \\quad E[\\zeta \\mid V,X]= 0,\n", - "$$\n", - "where\n", - "$$\n", - " \\tilde Y = (Y- \\ell_0(X)), \\quad \\ell_0(X) = E[Y \\mid X] \\\\\n", - " \\tilde D = (D - r_0(X)), \\quad r_0(X) = E[D \\mid X] \\\\\n", - " \\tilde Z = (Z- m_0(X)), \\quad m_0(X) = E[Z \\mid X].\n", - "$$\n", - " The \"tilde\" variables (e.g. $\\tilde Y$) above represent original variables after taking out or \"partialling out\"\n", - " the effect of $X$. Note that $\\theta_0$ is identified from this equation if $V$\n", - " and $U$ have non-zero correlation, which automatically means that $U$ and $V$\n", - " must have non-zero variation.\n", - "\n", - " \n", - "\n", - "-----\n", - "\n", - "# DML for PLIV Model\n", - "\n", - "Given identification, DML proceeds as follows\n", - "\n", - "Compute the estimates $\\hat \\ell_0$, $\\hat r_0$, and $\\hat m_0$ , which amounts\n", - "to solving the three problems of predicting $Y$, $D$, and $Z$ using\n", - "$X$, using any generic ML method, giving us estimated residuals\n", - "$$\n", - "\\tilde Y = Y - \\hat \\ell_0(X), \\\\ \\tilde D= D - \\hat r_0(X), \\\\ \\tilde Z = Z- \\hat m_0(X).\n", - "$$\n", - "The estimates should be of a cross-validated form, as detailed in the algorithm below.\n", - "\n", - "Estimate $\\theta_0$ by the the intstrumental\n", - "variable regression of $\\tilde Y$ on $\\tilde D$ using $\\tilde Z$ as an instrument.\n", - "Use the conventional inference for the IV regression estimator, ignoring\n", - "the estimation error in these residuals.\n", - "\n", - "The reason we work with this residualized form is that it eliminates the bias\n", - "arising when solving the prediction problem in stage 1. The role of cross-validation\n", - "is to avoid another source of bias due to potential overfitting.\n", - "\n", - "The estimator is adaptive,\n", - "in the sense that the first stage estimation errors do not affect the second\n", - "stage errors.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "_kg_hide-output": true, - "execution": { - "iopub.execute_input": "2021-04-23T10:41:33.973149Z", - "iopub.status.busy": "2021-04-23T10:41:33.971090Z", - "iopub.status.idle": "2021-04-23T10:42:08.602961Z", - "shell.execute_reply": "2021-04-23T10:42:08.601638Z" - }, - "papermill": { - "duration": 34.670095, - "end_time": "2021-04-23T10:42:08.603197", - "exception": false, - "start_time": "2021-04-23T10:41:33.933102", - "status": "completed" - }, - "tags": [], - "id": "yGW6JhG5zcL5" - }, - "outputs": [], - "source": [ - "# Import relevant packages\n", - "import pandas as pd\n", - "import numpy as np\n", - "import matplotlib.pyplot as plt\n", - "from sklearn.model_selection import train_test_split\n", - "from sklearn.model_selection import cross_val_score, cross_val_predict, KFold\n", - "from sklearn.pipeline import make_pipeline\n", - "from sklearn.linear_model import LassoCV, RidgeCV, ElasticNetCV, LinearRegression, Ridge, Lasso, LogisticRegressionCV\n", - "from sklearn.preprocessing import StandardScaler\n", - "from sklearn.ensemble import GradientBoostingRegressor, GradientBoostingClassifier\n", - "from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier\n", - "from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier\n", - "import patsy\n", - "import warnings\n", - "from sklearn.base import BaseEstimator, clone\n", - "import statsmodels.api as sm\n", - "from statsmodels.tools import add_constant\n", - "from IPython.display import Markdown\n", - "import os\n", - "import seaborn as sns\n", - "warnings.simplefilter('ignore')\n", - "np.random.seed(1234)" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "execution": { - "iopub.execute_input": "2021-04-23T10:42:08.664371Z", - "iopub.status.busy": "2021-04-23T10:42:08.629661Z", - "iopub.status.idle": "2021-04-23T10:42:10.458175Z", - "shell.execute_reply": "2021-04-23T10:42:10.456976Z" - }, - "papermill": { - "duration": 1.846109, - "end_time": "2021-04-23T10:42:10.458406", - "exception": false, - "start_time": "2021-04-23T10:42:08.612297", - "status": "completed" - }, - "tags": [], - "id": "j2WVUbBDzcL-" - }, - "outputs": [], - "source": [ - "def dml(X, Z, D, y, modely, modeld, modelz, *, nfolds, classifier=False):\n", - " '''\n", - " DML for the Partially Linear Model setting with cross-fitting\n", - "\n", - " Input\n", - " -----\n", - " X: the controls\n", - " Z: the instrument\n", - " D: the treatment\n", - " y: the outcome\n", - " modely: the ML model for predicting the outcome y\n", - " modeld: the ML model for predicting the treatment D\n", - " modelz: the ML model for predicting the instrument Z\n", - " nfolds: the number of folds in cross-fitting\n", - " classifier: bool, whether the modeld is a classifier or a regressor\n", - "\n", - " Output\n", - " ------\n", - " point: the point estimate of the treatment effect of D on y\n", - " stderr: the standard error of the treatment effect\n", - " yhat: the cross-fitted predictions for the outcome y\n", - " Dhat: the cross-fitted predictions for the treatment D\n", - " Zhat: the cross-fitted predictions for the instrument Z\n", - " resy: the outcome residuals\n", - " resD: the treatment residuals\n", - " resZ: the instrument residuals\n", - " epsilon: the final residual-on-residual OLS regression residual\n", - " '''\n", - " cv = KFold(n_splits=nfolds, shuffle=True, random_state=123) # shuffled k-folds\n", - " yhat = cross_val_predict(modely, X, y, cv=cv, n_jobs=-1) # out-of-fold predictions for y\n", - " # out-of-fold predictions for D\n", - " # use predict or predict_proba dependent on classifier or regressor for D\n", - " if classifier:\n", - " Dhat = cross_val_predict(modeld, X, D, cv=cv, method='predict_proba', n_jobs=-1)[:, 1]\n", - " Zhat = cross_val_predict(modelz, X, Z, cv=cv, method='predict_proba', n_jobs=-1)[:, 1]\n", - " else:\n", - " Dhat = cross_val_predict(modeld, X, D, cv=cv, n_jobs=-1)\n", - " Zhat = cross_val_predict(modelz, X, Z, cv=cv, n_jobs=-1)\n", - " # calculate outcome and treatment residuals\n", - " resy = y - yhat\n", - " resD = D - Dhat\n", - " resZ = Z - Zhat\n", - " # rename for downstream tasks\n", - " resy = resy.rename(\"resy\")\n", - " resD = resD.rename(\"resD\")\n", - " resZ = resZ.rename(\"resZ\")\n", - "\n", - " # final stage ols based point estimate and standard error\n", - " point = np.mean(resy * resZ) / np.mean(resD*resZ)\n", - " epsilon = resy - point * resD\n", - " var = np.mean(epsilon**2 * resZ**2) / np.mean(resD*resZ)**2\n", - " stderr = np.sqrt(var / X.shape[0])\n", - " return point, stderr, yhat, Dhat, Zhat, resy, resD, resZ, epsilon" - ] - }, - { - "cell_type": "code", - "source": [ - "def summary(point, stderr, yhat, Dhat, Zhat, resy, resD, resZ, epsilon, X, D, Z, y, *, name):\n", - " '''\n", - " Convenience summary function that takes the results of the DML IV function\n", - " and summarizes several estimation quantities and performance metrics.\n", - " '''\n", - " return pd.DataFrame({'estimate': point, # point estimate\n", - " 'stderr': stderr, # standard error\n", - " 'lower': point - 1.96*stderr, # lower end of 95% confidence interval\n", - " 'upper': point + 1.96*stderr, # upper end of 95% confidence interval\n", - " 'rmse y': np.sqrt(np.mean(resy**2)), # RMSE of model that predicts outcome y\n", - " 'rmse Z': np.sqrt(np.mean(resZ**2)), # RMSE of model that predicts instrument Z\n", - " 'rmse D': np.sqrt(np.mean(resD**2)) # RMSE of model that predicts treatment D\n", - " }, index=[name])" - ], - "metadata": { - "id": "F6U6fnzne-77" - }, - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": { - "papermill": { - "duration": 0.011698, - "end_time": "2021-04-23T10:42:10.482689", - "exception": false, - "start_time": "2021-04-23T10:42:10.470991", - "status": "completed" - }, - "tags": [], - "id": "x1g3XjsIzcL_" - }, - "source": [ - "-----\n", - "\n", - "\n", - "# Emprical Example: Acemoglu, Johnson, Robinson (AER).\n", - "\n", - "* Y is log GDP;\n", - "* D is a measure of Protection from Expropriation, a proxy for\n", - "quality of insitutions;\n", - "* Z is the log of Settler's mortality;\n", - "* W are geographical variables (latitude, latitude squared, continent dummies as well as interactions)\n", - "\n" - ] - }, - { - "cell_type": "code", - "source": [ - "file = \"https://raw.githubusercontent.com/CausalAIBook/MetricsMLNotebooks/main/data/AJR.csv\"\n", - "data = pd.read_csv(file)\n", - "data.shape" - ], - "metadata": { - "id": "0Pc6OCp24rji" - }, - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "source": [ - "Run the following command to install hdmpy for rigorous lasso:\n", - "\n", - "``` !pip install multiprocess ```\n", - "\n", - "\n", - "```!git clone https://github.com/maxhuppertz/hdmpy.git ```" - ], - "metadata": { - "id": "3rGw2hAqSx2Y" - } - }, - { - "cell_type": "code", - "source": [ - "!pip install multiprocess\n", - "!git clone https://github.com/maxhuppertz/hdmpy.git" - ], - "metadata": { - "id": "O5Rknnz2Rh9O" - }, - "execution_count": null, - "outputs": [] + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "qjaFmAyEzcLz", + "papermill": { + "duration": 0.008106, + "end_time": "2021-04-23T10:41:33.911944", + "exception": false, + "start_time": "2021-04-23T10:41:33.903838", + "status": "completed" }, - { - "cell_type": "code", - "source": [ - "import hdmpy\n", - "class RLasso(BaseEstimator):\n", - "\n", - " def __init__(self, *, post=True):\n", - " self.post = post\n", - "\n", - " def fit(self, X, y):\n", - " self.rlasso_ = hdmpy.rlasso(X, y, post=self.post)\n", - " return self\n", - "\n", - " def predict(self, X):\n", - " return np.array(X) @ np.array(self.rlasso_.est['beta']).flatten() + np.array(self.rlasso_.est['intercept'])" - ], - "metadata": { - "id": "Kb9N0c5uS47n" - }, - "execution_count": null, - "outputs": [] + "tags": [] + }, + "source": [ + "# Double/Debiased ML for Partially Linear IV Model\n", + "\n", + "This is a simple implementation of Debiased Machine Learning for the Partially Linear\n", + "IV Regression Model, which provides an application of DML IV inference.\n", + "\n", + "\n", + "Reference:\n", + "\n", + "- https://arxiv.org/abs/1608.00060\n", + "- https://www.amazon.com/Business-Data-Science-Combining-Accelerate/dp/1260452778\n", + "\n", + "The code is based on the book.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "G3YNyr-ezcL4", + "papermill": { + "duration": 0.006963, + "end_time": "2021-04-23T10:41:33.926085", + "exception": false, + "start_time": "2021-04-23T10:41:33.919122", + "status": "completed" }, - { - "cell_type": "code", - "source": [ - "y = data['GDP']\n", - "d = data['Exprop']\n", - "z = data['logMort']\n", - "\n", - "controls = data.drop(['GDP', 'Exprop', 'logMort'], axis=1)\n", - "x = patsy.dmatrix('-1 + (Latitude + Latitude2 + Africa + Asia + Namer + Samer)**2', data=controls, return_type='dataframe')" - ], - "metadata": { - "id": "nO6a6tJWS-FO" - }, - "execution_count": null, - "outputs": [] + "tags": [] + }, + "source": [ + "\n", + "# Partially Linear IV Model\n", + "\n", + "We consider the partially linear structural equation model:\n", + "\\begin{eqnarray}\n", + " & Y - D\\theta_0 = g_0(X) + \\zeta, & E[\\zeta \\mid Z,X]= 0,\\\\\n", + " & Z = m_0(X) + V, & E[V \\mid X] = 0.\n", + "\\end{eqnarray}\n", + "\n", + "\n", + "Note that this model is not a regression model unless $Z=D$. The model is a canonical\n", + "model in causal inference, going back to P. Wright's work on IV methods for estimating demand/supply equations, with the modern difference being that $g_0$ and $m_0$ are nonlinear, potentially complicated functions of high-dimensional $X$. \n", + "\n", + "\n", + "The idea of this model is that there is a structural or causal relation between $Y$ and $D$, captured by $\\theta_0$, and $g_0(X) + \\zeta$ is the stochastic error, partly explained by covariates $X$. $V$ and $\\zeta$ are stochastic errors that are not explained by $X$. Since $Y$ and $D$ are jointly determined, we need an external factor, commonly referred to as an instrument, $Z$ to create exogenous variation\n", + "in $D$. Note that $Z$ should affect $D$. The $X$ here serve again as confounding factors, so we can think of variation in $Z$ as being exogenous only conditional on $X$.\n", + "\n", + "\n", + "The causal DAG this model corresponds to is given by:\n", + "$$\n", + "Z \\to D, X \\to (Y, Z, D), L \\to (Y,D),\n", + "$$\n", + "where $L$ is the latent confounder affecting both $Y$ and $D$, but not $Z$.\n", + "\n", + "\n", + "\n", + "---\n", + "\n", + "# Example\n", + "\n", + "A simple contextual example is from biostatistics, where $Y$ is a health outcome and $D$ is indicator of smoking. Thus, $\\theta_0$ is captures the effect of smoking on health. Health outcome $Y$ and smoking behavior $D$ are treated as being jointly determined. $X$ represents patient characteristics, and $Z$ could be a doctor's advice not to smoke (or another behavioral treatment) that may affect the outcome $Y$ only through shifting the behavior $D$, conditional on characteristics $X$. \n", + "\n", + "----\n", + "\n", + "\n", + "\n", + "# PLIVM in Residualized Form\n", + "\n", + "The PLIV model above can be rewritten in the following residualized form:\n", + "$$\n", + " \\tilde Y = \\tilde D \\theta_0 + \\zeta, \\quad E[\\zeta \\mid V,X]= 0,\n", + "$$\n", + "where\n", + "$$\n", + " \\tilde Y = (Y- \\ell_0(X)), \\quad \\ell_0(X) = E[Y \\mid X] \\\\\n", + " \\tilde D = (D - r_0(X)), \\quad r_0(X) = E[D \\mid X] \\\\\n", + " \\tilde Z = (Z- m_0(X)), \\quad m_0(X) = E[Z \\mid X].\n", + "$$\n", + " The \"tilde\" variables (e.g. $\\tilde Y$) above represent original variables after taking out or \"partialling out\"\n", + " the effect of $X$. Note that $\\theta_0$ is identified from this equation if $V$\n", + " and $U$ have non-zero correlation, which automatically means that $U$ and $V$\n", + " must have non-zero variation.\n", + "\n", + " \n", + "\n", + "-----\n", + "\n", + "# DML for PLIV Model\n", + "\n", + "Given identification, DML proceeds as follows\n", + "\n", + "Compute the estimates $\\hat \\ell_0$, $\\hat r_0$, and $\\hat m_0$ , which amounts\n", + "to solving the three problems of predicting $Y$, $D$, and $Z$ using\n", + "$X$, using any generic ML method, giving us estimated residuals\n", + "$$\n", + "\\tilde Y = Y - \\hat \\ell_0(X), \\\\ \\tilde D= D - \\hat r_0(X), \\\\ \\tilde Z = Z- \\hat m_0(X).\n", + "$$\n", + "The estimates should be of a cross-validated form, as detailed in the algorithm below.\n", + "\n", + "Estimate $\\theta_0$ by the the intstrumental\n", + "variable regression of $\\tilde Y$ on $\\tilde D$ using $\\tilde Z$ as an instrument.\n", + "Use the conventional inference for the IV regression estimator, ignoring\n", + "the estimation error in these residuals.\n", + "\n", + "The reason we work with this residualized form is that it eliminates the bias\n", + "arising when solving the prediction problem in stage 1. The role of cross-validation\n", + "is to avoid another source of bias due to potential overfitting.\n", + "\n", + "The estimator is adaptive,\n", + "in the sense that the first stage estimation errors do not affect the second\n", + "stage errors.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "_kg_hide-output": true, + "execution": { + "iopub.execute_input": "2021-04-23T10:41:33.973149Z", + "iopub.status.busy": "2021-04-23T10:41:33.971090Z", + "iopub.status.idle": "2021-04-23T10:42:08.602961Z", + "shell.execute_reply": "2021-04-23T10:42:08.601638Z" }, - { - "cell_type": "code", - "source": [ - "# DML with Post-Lasso:\n", - "modely = make_pipeline(StandardScaler(), RLasso(post=True))\n", - "modeld = make_pipeline(StandardScaler(), RLasso(post=True))\n", - "modelz = make_pipeline(StandardScaler(), RLasso(post=True))\n", - "\n", - "# Run DML model with nfolds folds of cross-fitting\n", - "res_rlasso = dml(x, z, d, y, modely, modeld, modelz, nfolds = 5)\n", - "table_rlasso = summary(*res_rlasso, x, d, z, y, name = 'RLasso')" - ], - "metadata": { - "id": "kpPe250uelE7" - }, - "execution_count": null, - "outputs": [] + "id": "yGW6JhG5zcL5", + "papermill": { + "duration": 34.670095, + "end_time": "2021-04-23T10:42:08.603197", + "exception": false, + "start_time": "2021-04-23T10:41:33.933102", + "status": "completed" }, - { - "cell_type": "code", - "source": [ - "# DML with Random Forests. RFs don't require scaling but we do it for consistency\n", - "modely = make_pipeline(StandardScaler(), RandomForestRegressor(n_estimators=100, min_samples_leaf=5, random_state=1234))\n", - "modeld = make_pipeline(StandardScaler(), RandomForestRegressor(n_estimators=100, min_samples_leaf=5, random_state=1234))\n", - "modelz = make_pipeline(StandardScaler(), RandomForestRegressor(n_estimators=100, min_samples_leaf=5, random_state=1234))\n", - "\n", - "# Run DML model with nfolds folds of cross-fitting (computationally intensive)\n", - "res_RF = dml(x, z, d, y, modely, modeld, modelz, nfolds = 5)\n", - "table_RF = summary(*res_RF, x, d, z, y, name = 'RF')" - ], - "metadata": { - "id": "pdKuGFpXglp8" - }, - "execution_count": null, - "outputs": [] + "tags": [] + }, + "outputs": [], + "source": [ + "# Import relevant packages\n", + "import pandas as pd\n", + "import numpy as np\n", + "import matplotlib.pyplot as plt\n", + "from sklearn.model_selection import cross_val_predict, KFold\n", + "from sklearn.pipeline import make_pipeline\n", + "from sklearn.preprocessing import StandardScaler\n", + "from sklearn.ensemble import RandomForestRegressor\n", + "import patsy\n", + "import warnings\n", + "from sklearn.base import BaseEstimator\n", + "import statsmodels.api as sm\n", + "from statsmodels.tools import add_constant\n", + "warnings.simplefilter('ignore')\n", + "np.random.seed(1234)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "execution": { + "iopub.execute_input": "2021-04-23T10:42:08.664371Z", + "iopub.status.busy": "2021-04-23T10:42:08.629661Z", + "iopub.status.idle": "2021-04-23T10:42:10.458175Z", + "shell.execute_reply": "2021-04-23T10:42:10.456976Z" }, - { - "cell_type": "code", - "source": [ - "results = pd.concat([table_rlasso, table_RF], axis=0)\n", - "results" - ], - "metadata": { - "id": "ijo8p0Moig_F" - }, - "execution_count": null, - "outputs": [] + "id": "j2WVUbBDzcL-", + "papermill": { + "duration": 1.846109, + "end_time": "2021-04-23T10:42:10.458406", + "exception": false, + "start_time": "2021-04-23T10:42:08.612297", + "status": "completed" }, - { - "cell_type": "markdown", - "metadata": { - "papermill": { - "duration": 0.013703, - "end_time": "2021-04-23T10:42:16.269251", - "exception": false, - "start_time": "2021-04-23T10:42:16.255548", - "status": "completed" - }, - "tags": [], - "id": "r7l6cFi2zcMA" - }, - "source": [ - "# Weak Instruments?" - ] + "tags": [] + }, + "outputs": [], + "source": [ + "def dml(X, Z, D, y, modely, modeld, modelz, *, nfolds, classifier=False):\n", + " '''\n", + " DML for the Partially Linear Model setting with cross-fitting\n", + "\n", + " Input\n", + " -----\n", + " X: the controls\n", + " Z: the instrument\n", + " D: the treatment\n", + " y: the outcome\n", + " modely: the ML model for predicting the outcome y\n", + " modeld: the ML model for predicting the treatment D\n", + " modelz: the ML model for predicting the instrument Z\n", + " nfolds: the number of folds in cross-fitting\n", + " classifier: bool, whether the modeld is a classifier or a regressor\n", + "\n", + " Output\n", + " ------\n", + " point: the point estimate of the treatment effect of D on y\n", + " stderr: the standard error of the treatment effect\n", + " yhat: the cross-fitted predictions for the outcome y\n", + " Dhat: the cross-fitted predictions for the treatment D\n", + " Zhat: the cross-fitted predictions for the instrument Z\n", + " resy: the outcome residuals\n", + " resD: the treatment residuals\n", + " resZ: the instrument residuals\n", + " epsilon: the final residual-on-residual OLS regression residual\n", + " '''\n", + " cv = KFold(n_splits=nfolds, shuffle=True, random_state=123) # shuffled k-folds\n", + " yhat = cross_val_predict(modely, X, y, cv=cv, n_jobs=-1) # out-of-fold predictions for y\n", + " # out-of-fold predictions for D\n", + " # use predict or predict_proba dependent on classifier or regressor for D\n", + " if classifier:\n", + " Dhat = cross_val_predict(modeld, X, D, cv=cv, method='predict_proba', n_jobs=-1)[:, 1]\n", + " Zhat = cross_val_predict(modelz, X, Z, cv=cv, method='predict_proba', n_jobs=-1)[:, 1]\n", + " else:\n", + " Dhat = cross_val_predict(modeld, X, D, cv=cv, n_jobs=-1)\n", + " Zhat = cross_val_predict(modelz, X, Z, cv=cv, n_jobs=-1)\n", + " # calculate outcome and treatment residuals\n", + " resy = y - yhat\n", + " resD = D - Dhat\n", + " resZ = Z - Zhat\n", + " # rename for downstream tasks\n", + " resy = resy.rename(\"resy\")\n", + " resD = resD.rename(\"resD\")\n", + " resZ = resZ.rename(\"resZ\")\n", + "\n", + " # final stage ols based point estimate and standard error\n", + " point = np.mean(resy * resZ) / np.mean(resD * resZ)\n", + " epsilon = resy - point * resD\n", + " var = np.mean(epsilon**2 * resZ**2) / np.mean(resD * resZ)**2\n", + " stderr = np.sqrt(var / X.shape[0])\n", + " return point, stderr, yhat, Dhat, Zhat, resy, resD, resZ, epsilon" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "F6U6fnzne-77" + }, + "outputs": [], + "source": [ + "def summary(point, stderr, yhat, Dhat, Zhat, resy, resD, resZ, epsilon, X, D, Z, y, *, name):\n", + " '''\n", + " Convenience summary function that takes the results of the DML IV function\n", + " and summarizes several estimation quantities and performance metrics.\n", + " '''\n", + " return pd.DataFrame({'estimate': point, # point estimate\n", + " 'stderr': stderr, # standard error\n", + " 'lower': point - 1.96 * stderr, # lower end of 95% confidence interval\n", + " 'upper': point + 1.96 * stderr, # upper end of 95% confidence interval\n", + " 'rmse y': np.sqrt(np.mean(resy**2)), # RMSE of model that predicts outcome y\n", + " 'rmse Z': np.sqrt(np.mean(resZ**2)), # RMSE of model that predicts instrument Z\n", + " 'rmse D': np.sqrt(np.mean(resD**2)) # RMSE of model that predicts treatment D\n", + " }, index=[name])" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "x1g3XjsIzcL_", + "papermill": { + "duration": 0.011698, + "end_time": "2021-04-23T10:42:10.482689", + "exception": false, + "start_time": "2021-04-23T10:42:10.470991", + "status": "completed" }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "execution": { - "iopub.execute_input": "2021-04-23T10:42:16.301968Z", - "iopub.status.busy": "2021-04-23T10:42:16.300698Z", - "iopub.status.idle": "2021-04-23T10:42:32.390351Z", - "shell.execute_reply": "2021-04-23T10:42:32.388488Z" - }, - "papermill": { - "duration": 16.107321, - "end_time": "2021-04-23T10:42:32.390552", - "exception": false, - "start_time": "2021-04-23T10:42:16.283231", - "status": "completed" - }, - "tags": [], - "id": "rsUnPDfpzcMB" - }, - "outputs": [], - "source": [ - "# Summary for RLasso\n", - "resy_rlasso, resD_rlasso, resZ_rlasso = res_rlasso[5], res_rlasso[6], res_rlasso[7]\n", - "regDZ_rlasso = sm.OLS(resD_rlasso, add_constant(resZ_rlasso)).fit(cov_type='HC3', use_t=True) # Using HC3 robust standard errors\n", - "print(regDZ_rlasso.summary())\n", - "\n", - "# Summary for RF\n", - "resy_RF, resD_RF, resZ_RF = res_RF[5], res_RF[6],res_RF[7]\n", - "regDZ_RF = sm.OLS(resD_RF, add_constant(resZ_RF)).fit(cov_type='HC3', use_t=True) # Using HC3 robust standard errors\n", - "print(regDZ_RF.summary())" - ] + "tags": [] + }, + "source": [ + "-----\n", + "\n", + "\n", + "# Emprical Example: Acemoglu, Johnson, Robinson (AER).\n", + "\n", + "* Y is log GDP;\n", + "* D is a measure of Protection from Expropriation, a proxy for\n", + "quality of insitutions;\n", + "* Z is the log of Settler's mortality;\n", + "* W are geographical variables (latitude, latitude squared, continent dummies as well as interactions)\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "0Pc6OCp24rji" + }, + "outputs": [], + "source": [ + "file = \"https://raw.githubusercontent.com/CausalAIBook/MetricsMLNotebooks/main/data/AJR.csv\"\n", + "data = pd.read_csv(file)\n", + "data.shape" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "3rGw2hAqSx2Y" + }, + "source": [ + "Run the following command to install hdmpy for rigorous lasso:\n", + "\n", + "``` !pip install multiprocess ```\n", + "\n", + "\n", + "```!git clone https://github.com/maxhuppertz/hdmpy.git ```" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "O5Rknnz2Rh9O" + }, + "outputs": [], + "source": [ + "!pip install multiprocess\n", + "!git clone https://github.com/maxhuppertz/hdmpy.git" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "Kb9N0c5uS47n" + }, + "outputs": [], + "source": [ + "import hdmpy\n", + "\n", + "\n", + "class RLasso(BaseEstimator):\n", + "\n", + " def __init__(self, *, post=True):\n", + " self.post = post\n", + "\n", + " def fit(self, X, y):\n", + " self.rlasso_ = hdmpy.rlasso(X, y, post=self.post)\n", + " return self\n", + "\n", + " def predict(self, X):\n", + " return np.array(X) @ np.array(self.rlasso_.est['beta']).flatten() + np.array(self.rlasso_.est['intercept'])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "nO6a6tJWS-FO" + }, + "outputs": [], + "source": [ + "y = data['GDP']\n", + "d = data['Exprop']\n", + "z = data['logMort']\n", + "\n", + "controls = data.drop(['GDP', 'Exprop', 'logMort'], axis=1)\n", + "x = patsy.dmatrix('-1 + (Latitude + Latitude2 + Africa + Asia + Namer + Samer)**2',\n", + " data=controls, return_type='dataframe')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "kpPe250uelE7" + }, + "outputs": [], + "source": [ + "# DML with Post-Lasso:\n", + "modely = make_pipeline(StandardScaler(), RLasso(post=True))\n", + "modeld = make_pipeline(StandardScaler(), RLasso(post=True))\n", + "modelz = make_pipeline(StandardScaler(), RLasso(post=True))\n", + "\n", + "# Run DML model with nfolds folds of cross-fitting\n", + "res_rlasso = dml(x, z, d, y, modely, modeld, modelz, nfolds=5)\n", + "table_rlasso = summary(*res_rlasso, x, d, z, y, name='RLasso')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "pdKuGFpXglp8" + }, + "outputs": [], + "source": [ + "# DML with Random Forests. RFs don't require scaling but we do it for consistency\n", + "modely = make_pipeline(StandardScaler(),\n", + " RandomForestRegressor(n_estimators=100, min_samples_leaf=5, random_state=1234))\n", + "modeld = make_pipeline(StandardScaler(),\n", + " RandomForestRegressor(n_estimators=100, min_samples_leaf=5, random_state=1234))\n", + "modelz = make_pipeline(StandardScaler(),\n", + " RandomForestRegressor(n_estimators=100, min_samples_leaf=5, random_state=1234))\n", + "\n", + "# Run DML model with nfolds folds of cross-fitting (computationally intensive)\n", + "res_RF = dml(x, z, d, y, modely, modeld, modelz, nfolds=5)\n", + "table_RF = summary(*res_RF, x, d, z, y, name='RF')" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "ijo8p0Moig_F" + }, + "outputs": [], + "source": [ + "results = pd.concat([table_rlasso, table_RF], axis=0)\n", + "results" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "r7l6cFi2zcMA", + "papermill": { + "duration": 0.013703, + "end_time": "2021-04-23T10:42:16.269251", + "exception": false, + "start_time": "2021-04-23T10:42:16.255548", + "status": "completed" }, - { - "cell_type": "markdown", - "metadata": { - "papermill": { - "duration": 0.015919, - "end_time": "2021-04-23T10:42:32.423865", - "exception": false, - "start_time": "2021-04-23T10:42:32.407946", - "status": "completed" - }, - "tags": [], - "id": "L3fDDOmfzcMD" - }, - "source": [ - "## Anderson-Rubin Idea for Inference with Weak Instruments\n", - "\n", - "As shown above, we may have weak instruments because the t-stat in the regression $\\tilde D \\sim \\tilde Z$ is small relative to standard rules of thumb -- for example Stock and Yogo (2005) suggest accounting for weak instruments if the first stage F-statistic is less than 10 (and more recent work suggests even larger cutoffs).\n", - "\n", - "\n", - " Here, we consider one specific approach (from Anderson and Rubin (1949)) to doing valid inference under weak identification based upon the statistic:\n", - "$$\n", - "C(\\theta) = \\frac{ |E_n [(\\tilde Y - \\theta \\tilde D) \\tilde Z]|^2}{ \\mathbb{V}_n [(\\tilde Y - \\theta \\tilde D) \\tilde Z ]/n}.\n", - "$$\n", - "The empirical variance $\\mathbb{V}_n$ is defined as:\n", - "\\begin{align*}\n", - "\\mathbb{V}_{n}[ g(W_i)] &:= \\mathbb{E}_{n}g(W_i)g(W_i)' - \\mathbb{E}_{n}[ g(W_i)]\\mathbb{E}_{n}[ g(W_i)]'.\n", - "\\end{align*}\n", - "\n", - "If $\\theta_0 = \\theta$, then $C(\\theta) \\overset{a}{\\sim} N(0,1)^2 = \\chi^2(1)$. Therefore, we can reject the hypothesis $\\theta_0 = \\theta$ at level $a$ (for example $a = .05$ for a 5\\% level test) if $C(\\theta)> c(1-a)$ where $c(1-a)$ is the $(1-a)$- quantile of a $\\chi^2(1)$ variable. The probability of falsely rejecting the true hypothesis is approximately $a \\times 100\\%$. \n", - "To construct a $(1-a) \\times 100$\\% confidence region for $\\theta$, we can then simply invert the test by collecting all parameter values that are not rejected at the $a$ level:\n", - "$$\n", - "CR(\\theta) = \\{ \\theta \\in \\Theta: C(\\theta) \\leq c(1-a)\\}.\n", - "$$\n", - "\n", - "\n", - "In more complex settings with many controls or controls that enter with unknown functional form, we can simply replace the residuals $\\tilde Y$, $\\tilde D$, and $\\tilde Z$ by machine\n", - "learned cross-fitted residuals $\\check Y$, $\\check D$, and $ \\check Z$. Thanks to the orthogonality of the IV moment condition underlying the formulation outlined above, we can formally assert that the properties of $C(\\theta)$ and the subsequent testing procedure and confidence region for $\\theta$ continue to hold when using cross-fitted residuals. We will further be able to apply the general procedure to cases where $D$\n", - "is a vector, with a suitable adjustment of the statistic $C(\\theta)$.\n" - ] + "tags": [] + }, + "source": [ + "# Weak Instruments?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "execution": { + "iopub.execute_input": "2021-04-23T10:42:16.301968Z", + "iopub.status.busy": "2021-04-23T10:42:16.300698Z", + "iopub.status.idle": "2021-04-23T10:42:32.390351Z", + "shell.execute_reply": "2021-04-23T10:42:32.388488Z" }, - { - "cell_type": "markdown", - "metadata": { - "papermill": { - "duration": 0.015943, - "end_time": "2021-04-23T10:42:32.455719", - "exception": false, - "start_time": "2021-04-23T10:42:32.439776", - "status": "completed" - }, - "tags": [], - "id": "PG2tPuFzzcMD" - }, - "source": [ - "So let's carry out DML inference combined with Anderson-Rubin Idea" - ] + "id": "rsUnPDfpzcMB", + "papermill": { + "duration": 16.107321, + "end_time": "2021-04-23T10:42:32.390552", + "exception": false, + "start_time": "2021-04-23T10:42:16.283231", + "status": "completed" }, - { - "cell_type": "code", - "source": [ - "from scipy.stats import chi2\n", - "\n", - "def DML_AR_PLIV(rY, rD, rZ, grid, alpha=0.05):\n", - " n = len(rY)\n", - " Cstat = np.zeros(len(grid))\n", - "\n", - " for i in range(len(grid)):\n", - " term1 = (rY - grid[i] * rD) * rZ\n", - " Cstat[i] = n * (np.mean(term1))**2 / np.var(term1)\n", - "\n", - " Cstat = np.nan_to_num(Cstat) # Replace any NaNs with zeros\n", - "\n", - " lower_bound = min(grid[Cstat < chi2.ppf(1 - alpha, 1)])\n", - " upper_bound = max(grid[Cstat < chi2.ppf(1 - alpha, 1)])\n", - "\n", - " plt.plot(grid, Cstat, linestyle='-', color='black')\n", - " plt.axhline(y=chi2.ppf(1 - alpha, 1), linestyle='--', color='blue')\n", - " plt.axvline(x=lower_bound, linestyle='--', color='red')\n", - " plt.axvline(x=upper_bound, linestyle='--', color='red')\n", - " plt.xlabel('Effect of institutions')\n", - " plt.ylabel('Statistic')\n", - " plt.title('')\n", - " plt.show()\n", - "\n", - " return {'UB': upper_bound, 'LB': lower_bound}" - ], - "metadata": { - "id": "6weyaIdJ3pWs" - }, - "execution_count": null, - "outputs": [] + "tags": [] + }, + "outputs": [], + "source": [ + "# Summary for RLasso\n", + "resy_rlasso, resD_rlasso, resZ_rlasso = res_rlasso[5], res_rlasso[6], res_rlasso[7]\n", + "# Using HC3 robust standard errors\n", + "regDZ_rlasso = sm.OLS(resD_rlasso, add_constant(resZ_rlasso)).fit(cov_type='HC3', use_t=True)\n", + "print(regDZ_rlasso.summary())\n", + "\n", + "# Summary for RF\n", + "resy_RF, resD_RF, resZ_RF = res_RF[5], res_RF[6], res_RF[7]\n", + "# Using HC3 robust standard errors\n", + "regDZ_RF = sm.OLS(resD_RF, add_constant(resZ_RF)).fit(cov_type='HC3', use_t=True)\n", + "print(regDZ_RF.summary())" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "L3fDDOmfzcMD", + "papermill": { + "duration": 0.015919, + "end_time": "2021-04-23T10:42:32.423865", + "exception": false, + "start_time": "2021-04-23T10:42:32.407946", + "status": "completed" }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "execution": { - "iopub.execute_input": "2021-04-23T10:42:32.545479Z", - "iopub.status.busy": "2021-04-23T10:42:32.543976Z", - "iopub.status.idle": "2021-04-23T10:42:33.002321Z", - "shell.execute_reply": "2021-04-23T10:42:33.001039Z" - }, - "papermill": { - "duration": 0.478528, - "end_time": "2021-04-23T10:42:33.002469", - "exception": false, - "start_time": "2021-04-23T10:42:32.523941", - "status": "completed" - }, - "tags": [], - "id": "k9bB2O13zcME" - }, - "outputs": [], - "source": [ - "# Using RLasso\n", - "dml_ar_rlasso = DML_AR_PLIV(rY=resy_rlasso, rD=resD_rlasso, rZ=resZ_rlasso, grid=np.arange(-2, 2, 0.01))\n", - "print(\"RLasso UB and LB: \", dml_ar_rlasso)\n", - "# Using RF\n", - "dml_ar_RF = DML_AR_PLIV(rY=resy_RF, rD=resD_RF, rZ=resZ_RF, grid=np.arange(-2, 2, 0.01))\n", - "print(\"RF UB and LB: \", dml_ar_RF)" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python 3", - "name": "python3" + "tags": [] + }, + "source": [ + "## Anderson-Rubin Idea for Inference with Weak Instruments\n", + "\n", + "As shown above, we may have weak instruments because the t-stat in the regression $\\tilde D \\sim \\tilde Z$ is small relative to standard rules of thumb -- for example Stock and Yogo (2005) suggest accounting for weak instruments if the first stage F-statistic is less than 10 (and more recent work suggests even larger cutoffs).\n", + "\n", + "\n", + " Here, we consider one specific approach (from Anderson and Rubin (1949)) to doing valid inference under weak identification based upon the statistic:\n", + "$$\n", + "C(\\theta) = \\frac{ |E_n [(\\tilde Y - \\theta \\tilde D) \\tilde Z]|^2}{ \\mathbb{V}_n [(\\tilde Y - \\theta \\tilde D) \\tilde Z ]/n}.\n", + "$$\n", + "The empirical variance $\\mathbb{V}_n$ is defined as:\n", + "\\begin{align*}\n", + "\\mathbb{V}_{n}[ g(W_i)] &:= \\mathbb{E}_{n}g(W_i)g(W_i)' - \\mathbb{E}_{n}[ g(W_i)]\\mathbb{E}_{n}[ g(W_i)]'.\n", + "\\end{align*}\n", + "\n", + "If $\\theta_0 = \\theta$, then $C(\\theta) \\overset{a}{\\sim} N(0,1)^2 = \\chi^2(1)$. Therefore, we can reject the hypothesis $\\theta_0 = \\theta$ at level $a$ (for example $a = .05$ for a 5\\% level test) if $C(\\theta)> c(1-a)$ where $c(1-a)$ is the $(1-a)$- quantile of a $\\chi^2(1)$ variable. The probability of falsely rejecting the true hypothesis is approximately $a \\times 100\\%$. \n", + "To construct a $(1-a) \\times 100$\\% confidence region for $\\theta$, we can then simply invert the test by collecting all parameter values that are not rejected at the $a$ level:\n", + "$$\n", + "CR(\\theta) = \\{ \\theta \\in \\Theta: C(\\theta) \\leq c(1-a)\\}.\n", + "$$\n", + "\n", + "\n", + "In more complex settings with many controls or controls that enter with unknown functional form, we can simply replace the residuals $\\tilde Y$, $\\tilde D$, and $\\tilde Z$ by machine\n", + "learned cross-fitted residuals $\\check Y$, $\\check D$, and $ \\check Z$. Thanks to the orthogonality of the IV moment condition underlying the formulation outlined above, we can formally assert that the properties of $C(\\theta)$ and the subsequent testing procedure and confidence region for $\\theta$ continue to hold when using cross-fitted residuals. We will further be able to apply the general procedure to cases where $D$\n", + "is a vector, with a suitable adjustment of the statistic $C(\\theta)$.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "PG2tPuFzzcMD", + "papermill": { + "duration": 0.015943, + "end_time": "2021-04-23T10:42:32.455719", + "exception": false, + "start_time": "2021-04-23T10:42:32.439776", + "status": "completed" }, - "language_info": { - "name": "python" + "tags": [] + }, + "source": [ + "So let's carry out DML inference combined with Anderson-Rubin Idea" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "6weyaIdJ3pWs" + }, + "outputs": [], + "source": [ + "from scipy.stats import chi2\n", + "\n", + "\n", + "def DML_AR_PLIV(rY, rD, rZ, grid, alpha=0.05):\n", + " n = len(rY)\n", + " Cstat = np.zeros(len(grid))\n", + "\n", + " for i in range(len(grid)):\n", + " term1 = (rY - grid[i] * rD) * rZ\n", + " Cstat[i] = n * (np.mean(term1))**2 / np.var(term1)\n", + "\n", + " Cstat = np.nan_to_num(Cstat) # Replace any NaNs with zeros\n", + "\n", + " lower_bound = min(grid[Cstat < chi2.ppf(1 - alpha, 1)])\n", + " upper_bound = max(grid[Cstat < chi2.ppf(1 - alpha, 1)])\n", + "\n", + " plt.plot(grid, Cstat, linestyle='-', color='black')\n", + " plt.axhline(y=chi2.ppf(1 - alpha, 1), linestyle='--', color='blue')\n", + " plt.axvline(x=lower_bound, linestyle='--', color='red')\n", + " plt.axvline(x=upper_bound, linestyle='--', color='red')\n", + " plt.xlabel('Effect of institutions')\n", + " plt.ylabel('Statistic')\n", + " plt.title('')\n", + " plt.show()\n", + "\n", + " return {'UB': upper_bound, 'LB': lower_bound}" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "execution": { + "iopub.execute_input": "2021-04-23T10:42:32.545479Z", + "iopub.status.busy": "2021-04-23T10:42:32.543976Z", + "iopub.status.idle": "2021-04-23T10:42:33.002321Z", + "shell.execute_reply": "2021-04-23T10:42:33.001039Z" }, + "id": "k9bB2O13zcME", "papermill": { - "duration": 62.79319, - "end_time": "2021-04-23T10:42:34.038582", - "environment_variables": {}, - "exception": null, - "input_path": "__notebook__.ipynb", - "output_path": "__notebook__.ipynb", - "parameters": {}, - "start_time": "2021-04-23T10:41:31.245392", - "version": "2.1.3" + "duration": 0.478528, + "end_time": "2021-04-23T10:42:33.002469", + "exception": false, + "start_time": "2021-04-23T10:42:32.523941", + "status": "completed" }, - "colab": { - "provenance": [] - } + "tags": [] + }, + "outputs": [], + "source": [ + "# Using RLasso\n", + "dml_ar_rlasso = DML_AR_PLIV(rY=resy_rlasso, rD=resD_rlasso, rZ=resZ_rlasso, grid=np.arange(-2, 2, 0.01))\n", + "print(\"RLasso UB and LB: \", dml_ar_rlasso)\n", + "# Using RF\n", + "dml_ar_RF = DML_AR_PLIV(rY=resy_RF, rD=resD_RF, rZ=resZ_RF, grid=np.arange(-2, 2, 0.01))\n", + "print(\"RF UB and LB: \", dml_ar_RF)" + ] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.5" }, - "nbformat": 4, - "nbformat_minor": 0 -} \ No newline at end of file + "papermill": { + "duration": 62.79319, + "end_time": "2021-04-23T10:42:34.038582", + "environment_variables": {}, + "exception": null, + "input_path": "__notebook__.ipynb", + "output_path": "__notebook__.ipynb", + "parameters": {}, + "start_time": "2021-04-23T10:41:31.245392", + "version": "2.1.3" + } + }, + "nbformat": 4, + "nbformat_minor": 1 +} diff --git a/AC2/python-dml-401k-IV.ipynb b/AC2/python-dml-401k-IV.ipynb index b102302b..a4bb8b41 100644 --- a/AC2/python-dml-401k-IV.ipynb +++ b/AC2/python-dml-401k-IV.ipynb @@ -86,20 +86,17 @@ "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", - "from sklearn.model_selection import train_test_split\n", - "from sklearn.model_selection import cross_val_score, cross_val_predict, KFold\n", + "from sklearn.model_selection import cross_val_predict, KFold\n", "from sklearn.pipeline import make_pipeline\n", - "from sklearn.linear_model import LassoCV, RidgeCV, ElasticNetCV, LinearRegression, Ridge, Lasso, LogisticRegressionCV\n", + "from sklearn.linear_model import LassoCV, LogisticRegressionCV\n", "from sklearn.preprocessing import StandardScaler\n", "from sklearn.ensemble import GradientBoostingRegressor, GradientBoostingClassifier\n", "from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier\n", "from sklearn.tree import DecisionTreeRegressor, DecisionTreeClassifier\n", - "import patsy\n", + "import scipy.stats\n", "import warnings\n", "from sklearn.base import BaseEstimator, clone\n", - "import statsmodels.api as sm\n", "from IPython.display import Markdown\n", - "import os\n", "import wget\n", "import seaborn as sns\n", "warnings.simplefilter('ignore')\n", @@ -154,7 +151,7 @@ "source": [ "readme = \"https://raw.githubusercontent.com/CausalAIBook/MetricsMLNotebooks/main/data/401k.md\"\n", "filename = wget.download(readme)\n", - "display(Markdown(open(filename, 'r').read()))" + "Markdown(open(filename, 'r').read())" ] }, { @@ -371,9 +368,10 @@ }, "outputs": [], "source": [ - "from sklearn.base import TransformerMixin, BaseEstimator\n", + "from sklearn.base import TransformerMixin\n", "from formulaic import Formula\n", "\n", + "\n", "class FormulaTransformer(TransformerMixin, BaseEstimator):\n", "\n", " def __init__(self, formula, array=False):\n", @@ -502,8 +500,8 @@ "outputs": [], "source": [ "resy = y - modely.fit(X, y).predict(X)\n", - "resZ = Z - modelz.fit(X, Z).predict(X) # instrument is e401k (eligibility)\n", - "resD = D - modeld.fit(X, D).predict(X) # treatment is p401k (participation)" + "resZ = Z - modelz.fit(X, Z).predict(X) # instrument is e401k (eligibility)\n", + "resD = D - modeld.fit(X, D).predict(X) # treatment is p401k (participation)" ] }, { @@ -575,8 +573,8 @@ " resZ: the instrument residuals\n", " epsilon: the final residual-on-residual OLS regression residual\n", " '''\n", - " cv = KFold(n_splits=nfolds, shuffle=True, random_state=123) # shuffled k-folds\n", - " yhat = cross_val_predict(modely, X, y, cv=cv, n_jobs=-1) # out-of-fold predictions for y\n", + " cv = KFold(n_splits=nfolds, shuffle=True, random_state=123) # shuffled k-folds\n", + " yhat = cross_val_predict(modely, X, y, cv=cv, n_jobs=-1) # out-of-fold predictions for y\n", " # out-of-fold predictions for D\n", " # use predict or predict_proba dependent on classifier or regressor for D\n", " if classifier:\n", @@ -590,9 +588,9 @@ " resD = D - Dhat\n", " resZ = Z - Zhat\n", " # final stage ols based point estimate and standard error\n", - " point = np.mean(resy * resZ) / np.mean(resD*resZ)\n", + " point = np.mean(resy * resZ) / np.mean(resD * resZ)\n", " epsilon = resy - point * resD\n", - " var = np.mean(epsilon**2 * resZ**2) / np.mean(resD*resZ)**2\n", + " var = np.mean(epsilon**2 * resZ**2) / np.mean(resD * resZ)**2\n", " stderr = np.sqrt(var / X.shape[0])\n", " return point, stderr, yhat, Dhat, Zhat, resy, resD, resZ, epsilon" ] @@ -611,15 +609,15 @@ " Convenience summary function that takes the results of the DML function\n", " and summarizes several estimation quantities and performance metrics.\n", " '''\n", - " return pd.DataFrame({'estimate': point, # point estimate\n", - " 'stderr': stderr, # standard error\n", - " 'lower': point - 1.96*stderr, # lower end of 95% confidence interval\n", - " 'upper': point + 1.96*stderr, # upper end of 95% confidence interval\n", - " 'rmse y': np.sqrt(np.mean(resy**2)), # RMSE of model that predicts outcome y\n", - " 'rmse D': np.sqrt(np.mean(resD**2)), # RMSE of model that predicts treatment D\n", - " 'rmse Z': np.sqrt(np.mean(resZ**2)), # RMSE of model that predicts treatment D\n", - " 'accuracy D': np.mean(np.abs(resD) < .5), # binary classification accuracy of model for D\n", - " 'accuracy Z': np.mean(np.abs(resZ) < .5), # binary classification accuracy of model for Z\n", + " return pd.DataFrame({'estimate': point, # point estimate\n", + " 'stderr': stderr, # standard error\n", + " 'lower': point - 1.96 * stderr, # lower end of 95% confidence interval\n", + " 'upper': point + 1.96 * stderr, # upper end of 95% confidence interval\n", + " 'rmse y': np.sqrt(np.mean(resy**2)), # RMSE of model that predicts outcome y\n", + " 'rmse D': np.sqrt(np.mean(resD**2)), # RMSE of model that predicts treatment D\n", + " 'rmse Z': np.sqrt(np.mean(resZ**2)), # RMSE of model that predicts treatment D\n", + " 'accuracy D': np.mean(np.abs(resD) < .5), # binary classification accuracy of model for D\n", + " 'accuracy Z': np.mean(np.abs(resZ) < .5), # binary classification accuracy of model for Z\n", " }, index=[name])" ] }, @@ -707,7 +705,7 @@ }, "outputs": [], "source": [ - "table = pd.concat([table , summary(*result, X, Z, D, y, name='lasso/logistic')])\n", + "table = pd.concat([table, summary(*result, X, Z, D, y, name='lasso/logistic')])\n", "table" ] }, @@ -745,7 +743,7 @@ }, "outputs": [], "source": [ - "table = pd.concat([table , summary(*result, X, Z, D, y, name='random forest')])\n", + "table = pd.concat([table, summary(*result, X, Z, D, y, name='random forest')])\n", "table" ] }, @@ -783,7 +781,7 @@ }, "outputs": [], "source": [ - "table = pd.concat([table , summary(*result, X, Z, D, y, name='decision tree')])\n", + "table = pd.concat([table, summary(*result, X, Z, D, y, name='decision tree')])\n", "table" ] }, @@ -821,7 +819,7 @@ }, "outputs": [], "source": [ - "table = pd.concat([table , summary(*result, X, Z, D, y, name='boosted forest')])\n", + "table = pd.concat([table, summary(*result, X, Z, D, y, name='boosted forest')])\n", "table" ] }, @@ -859,7 +857,7 @@ "from flaml import AutoML\n", "\n", "flamly = make_pipeline(transformer, AutoML(time_budget=100, task='regression', early_stop=True,\n", - " eval_method='cv', n_splits=3, metric='r2', verbose=0))\n", + " eval_method='cv', n_splits=3, metric='r2', verbose=0))\n", "flamld = make_pipeline(transformer, AutoML(time_budget=100, task='classification', early_stop=True,\n", " eval_method='cv', n_splits=3, metric='r2', verbose=0))\n", "flamlz = make_pipeline(transformer, AutoML(time_budget=100, task='classification', early_stop=True,\n", @@ -926,7 +924,7 @@ }, "outputs": [], "source": [ - "table = pd.concat([table , summary(*result, X, Z, D, y, name='automl (semi-cfit)')])\n", + "table = pd.concat([table, summary(*result, X, Z, D, y, name='automl (semi-cfit)')])\n", "table" ] }, @@ -949,8 +947,6 @@ }, "outputs": [], "source": [ - "import scipy.stats\n", - "\n", "def robust_inference(point, stderr, yhat, Dhat, Zhat, resy, resD, resZ, epsilon, X, Z, D, y, *, grid, alpha=0.05):\n", " '''\n", " Inference in the partially linear IV model that is robust to weak identification.\n", @@ -962,7 +958,6 @@ " accept = []\n", " for theta in grid:\n", " moment = (resy - theta * resD) * resZ\n", - " accept.append(np.sum(moment ** 2) < thr)\n", " test = n * np.mean(moment)**2 / np.var(moment)\n", " if test <= thr:\n", " accept.append(theta)\n", @@ -990,8 +985,7 @@ }, "outputs": [], "source": [ - "filtered_region = [x for x in region if x != False]\n", - "np.min(filtered_region), np.max(filtered_region)" + "np.min(region), np.max(region)" ] }, { @@ -1134,16 +1128,16 @@ " # and a separate model for D==1.\n", " for train, test in cv.split(X, y):\n", " # train an outcome model on training data that received Z=0 and predict outcome on all data in test set\n", - " yhat0[test] = clone(modely0).fit(X.iloc[train][Z[train]==0], y[train][Z[train]==0]).predict(X.iloc[test])\n", + " yhat0[test] = clone(modely0).fit(X.iloc[train][Z[train] == 0], y[train][Z[train] == 0]).predict(X.iloc[test])\n", " # train an outcome model on training data that received Z=1 and predict outcome on all data in test set\n", - " yhat1[test] = clone(modely1).fit(X.iloc[train][Z[train]==1], y[train][Z[train]==1]).predict(X.iloc[test])\n", + " yhat1[test] = clone(modely1).fit(X.iloc[train][Z[train] == 1], y[train][Z[train] == 1]).predict(X.iloc[test])\n", " # train a treatment model on training data that received Z=0 and predict treatment on all data in test set\n", - " if np.mean(D[train][Z[train]==0]) > 0: # it could be that D=0, whenever Z=0 deterministically\n", - " modeld0_ = clone(modeld0).fit(X.iloc[train][Z[train]==0], D[train][Z[train]==0])\n", + " if np.mean(D[train][Z[train] == 0]) > 0: # it could be that D=0, whenever Z=0 deterministically\n", + " modeld0_ = clone(modeld0).fit(X.iloc[train][Z[train] == 0], D[train][Z[train] == 0])\n", " Dhat0[test] = modeld0_.predict_proba(X.iloc[test])[:, 1]\n", " # train a treamtent model on training data that received Z=1 and predict treatment on all data in test set\n", - " if np.mean(D[train][Z[train]==1]) < 1: # it could be that D=1, whenever Z=1 deterministically\n", - " modeld1_ = clone(modeld1).fit(X.iloc[train][Z[train]==1], D[train][Z[train]==1])\n", + " if np.mean(D[train][Z[train] == 1]) < 1: # it could be that D=1, whenever Z=1 deterministically\n", + " modeld1_ = clone(modeld1).fit(X.iloc[train][Z[train] == 1], D[train][Z[train] == 1])\n", " Dhat1[test] = modeld1_.predict_proba(X.iloc[test])[:, 1]\n", " else:\n", " Dhat1[test] = 1\n", @@ -1155,7 +1149,7 @@ " Zhat = cross_val_predict(modelz, X, Z, cv=cv, method='predict_proba', n_jobs=-1)[:, 1]\n", " Zhat = np.clip(Zhat, trimming, 1 - trimming)\n", " # doubly robust quantity for every sample\n", - " HZ = Z/Zhat - (1 - Z)/(1 - Zhat)\n", + " HZ = Z / Zhat - (1 - Z) / (1 - Zhat)\n", " drZ = yhat1 - yhat0 + (y - yhat) * HZ\n", " drD = Dhat1 - Dhat0 + (D - Dhat) * HZ\n", " point = np.mean(drZ) / np.mean(drD)\n", @@ -1180,15 +1174,15 @@ " Convenience summary function that takes the results of the DML function\n", " and summarizes several estimation quantities and performance metrics.\n", " '''\n", - " return pd.DataFrame({'estimate': point, # point estimate\n", - " 'stderr': stderr, # standard error\n", - " 'lower': point - 1.96*stderr, # lower end of 95% confidence interval\n", - " 'upper': point + 1.96*stderr, # upper end of 95% confidence interval\n", - " 'rmse y': np.sqrt(np.mean(resy**2)), # RMSE of model that predicts outcome y\n", - " 'rmse D': np.sqrt(np.mean(resD**2)), # RMSE of model that predicts treatment D\n", - " 'rmse Z': np.sqrt(np.mean(resZ**2)), # RMSE of model that predicts treatment D\n", - " 'accuracy D': np.mean(np.abs(resD) < .5), # binary classification accuracy of model for D\n", - " 'accuracy Z': np.mean(np.abs(resZ) < .5), # binary classification accuracy of model for Z\n", + " return pd.DataFrame({'estimate': point, # point estimate\n", + " 'stderr': stderr, # standard error\n", + " 'lower': point - 1.96 * stderr, # lower end of 95% confidence interval\n", + " 'upper': point + 1.96 * stderr, # upper end of 95% confidence interval\n", + " 'rmse y': np.sqrt(np.mean(resy**2)), # RMSE of model that predicts outcome y\n", + " 'rmse D': np.sqrt(np.mean(resD**2)), # RMSE of model that predicts treatment D\n", + " 'rmse Z': np.sqrt(np.mean(resZ**2)), # RMSE of model that predicts treatment D\n", + " 'accuracy D': np.mean(np.abs(resD) < .5), # binary classification accuracy of model for D\n", + " 'accuracy Z': np.mean(np.abs(resZ) < .5), # binary classification accuracy of model for Z\n", " }, index=[name])" ] }, @@ -1245,7 +1239,7 @@ }, "outputs": [], "source": [ - "tableiiv = pd.concat([tableiiv , summary(*result, X, Z, D, y, name='random forest')])\n", + "tableiiv = pd.concat([tableiiv, summary(*result, X, Z, D, y, name='random forest')])\n", "tableiiv" ] }, @@ -1273,7 +1267,7 @@ }, "outputs": [], "source": [ - "tableiiv = pd.concat([tableiiv , summary(*result, X, Z, D, y, name='decision trees')])\n", + "tableiiv = pd.concat([tableiiv, summary(*result, X, Z, D, y, name='decision trees')])\n", "tableiiv" ] }, @@ -1301,7 +1295,7 @@ }, "outputs": [], "source": [ - "tableiiv = pd.concat([tableiiv , summary(*result, X, Z, D, y, name='boosted trees')])\n", + "tableiiv = pd.concat([tableiiv, summary(*result, X, Z, D, y, name='boosted trees')])\n", "tableiiv" ] }, @@ -1326,11 +1320,11 @@ "source": [ "from flaml import AutoML\n", "flamly0 = make_pipeline(transformer, AutoML(time_budget=60, task='regression', early_stop=True,\n", - " eval_method='cv', n_splits=3, metric='r2', verbose=0))\n", + " eval_method='cv', n_splits=3, metric='r2', verbose=0))\n", "flamly1 = make_pipeline(transformer, AutoML(time_budget=60, task='regression', early_stop=True,\n", - " eval_method='cv', n_splits=3, metric='r2', verbose=0))\n", + " eval_method='cv', n_splits=3, metric='r2', verbose=0))\n", "flamld1 = make_pipeline(transformer, AutoML(time_budget=60, task='classification', early_stop=True,\n", - " eval_method='cv', n_splits=3, metric='r2', verbose=0))\n", + " eval_method='cv', n_splits=3, metric='r2', verbose=0))\n", "flamlz = make_pipeline(transformer, AutoML(time_budget=60, task='classification', early_stop=True,\n", " eval_method='cv', n_splits=3, metric='r2', verbose=0))" ] @@ -1344,7 +1338,7 @@ }, "outputs": [], "source": [ - "flamly0.fit(X[Z==0], y[Z==0])\n", + "flamly0.fit(X[Z == 0], y[Z == 0])\n", "besty0 = make_pipeline(transformer, clone(flamly0[-1].best_model_for_estimator(flamly0[-1].best_estimator)))" ] }, @@ -1357,7 +1351,7 @@ }, "outputs": [], "source": [ - "flamly1.fit(X[Z==1], y[Z==1])\n", + "flamly1.fit(X[Z == 1], y[Z == 1])\n", "besty1 = make_pipeline(transformer, clone(flamly1[-1].best_model_for_estimator(flamly1[-1].best_estimator)))" ] }, @@ -1371,7 +1365,7 @@ "outputs": [], "source": [ "from sklearn.dummy import DummyClassifier\n", - "bestd0 = DummyClassifier() # since D=0 whenever Z=0" + "bestd0 = DummyClassifier() # since D=0 whenever Z=0" ] }, { @@ -1383,7 +1377,7 @@ }, "outputs": [], "source": [ - "flamld1.fit(X[Z==1], D[Z==1])\n", + "flamld1.fit(X[Z == 1], D[Z == 1])\n", "bestd1 = make_pipeline(transformer, clone(flamld1[-1].best_model_for_estimator(flamld1[-1].best_estimator)))" ] }, @@ -1421,7 +1415,7 @@ }, "outputs": [], "source": [ - "tableiiv = pd.concat([tableiiv , summary(*result, X, Z, D, y, name='automl (semi-cfit)')])\n", + "tableiiv = pd.concat([tableiiv, summary(*result, X, Z, D, y, name='automl (semi-cfit)')])\n", "tableiiv" ] }, @@ -1466,9 +1460,8 @@ }, "outputs": [], "source": [ - "import scipy.stats\n", - "\n", - "def iivm_robust_inference(point, stderr, yhat, Dhat, Zhat, resy, resD, resZ, drZ, drD, X, Z, D, y, *, grid, alpha=0.05):\n", + "def iivm_robust_inference(point, stderr, yhat, Dhat, Zhat, resy, resD, resZ, drZ, drD, X, Z, D, y, *,\n", + " grid, alpha=0.05):\n", " '''\n", " Inference in the partially linear IV model that is robust to weak identification.\n", " grid: grid of theta values to search over when trying to identify the confidence region\n", @@ -1543,7 +1536,7 @@ "# code verified with this version of econml\n", "# If you want to try out the latest econml version remove the version number;\n", "# albeit mistakes could potentially be raised due to library updates”\n", - "!pip install econml==0.14.1 #" + "!pip install econml==0.14.1" ] }, { @@ -1721,6 +1714,7 @@ "source": [ "import doubleml as dml\n", "\n", + "\n", "class RegWrapper(BaseEstimator):\n", "\n", " def __init__(self, clf):\n", @@ -1733,6 +1727,7 @@ " def predict(self, X):\n", " return self.clf_.predict_proba(X)[:, 1]\n", "\n", + "\n", "dml_plr_obj = dml.DoubleMLPLIV(dml_data,\n", " LassoCV(cv=cv),\n", " RegWrapper(LogisticRegressionCV(cv=cv)),\n", @@ -1752,6 +1747,7 @@ "source": [ "import doubleml as dml\n", "\n", + "\n", "class Wrapper(BaseEstimator):\n", "\n", " def __init__(self, clf):\n", @@ -1778,6 +1774,7 @@ " def predict(self, X):\n", " return self.predict_proba(X)[:, 1] >= .5\n", "\n", + "\n", "dml_plr_obj = dml.DoubleMLIIVM(dml_data,\n", " LassoCV(cv=cv),\n", " Wrapper(LogisticRegressionCV(cv=cv)),\n", diff --git a/AC2/python-weak-iv-experiments.ipynb b/AC2/python-weak-iv-experiments.ipynb index 1f4e50cf..8f4a41df 100644 --- a/AC2/python-weak-iv-experiments.ipynb +++ b/AC2/python-weak-iv-experiments.ipynb @@ -1,174 +1,189 @@ { - "nbformat": 4, - "nbformat_minor": 0, - "metadata": { - "colab": { - "provenance": [] - }, - "kernelspec": { - "name": "python3", - "display_name": "Python 3" - }, - "language_info": { - "name": "python" - } + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "nBty5iAb_VyE" + }, + "source": [ + "# A Simple Example of Properties of IV estimator when Instruments are Weak" + ] }, - "cells": [ - { - "cell_type": "markdown", - "source": [ - "# A Simple Example of Properties of IV estimator when Instruments are Weak" - ], - "metadata": { - "id": "nBty5iAb_VyE" - } - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "id": "atRvUXGn_VFW" - }, - "outputs": [], - "source": [ - "import numpy as np\n", - "from scipy.stats import norm\n", - "from scipy.stats import gaussian_kde\n", - "import statsmodels.api as sm\n", - "from statsmodels.sandbox.regression.gmm import IV2SLS\n", - "import matplotlib.pyplot as plt" - ] - }, - { - "cell_type": "markdown", - "source": [ - "Simulation Design" - ], - "metadata": { - "id": "wLevLetuBzrs" - } - }, - { - "cell_type": "code", - "source": [ - "np.random.seed(123)\n", - "\n", - "n = 100\n", - "beta = 0.1 # 0.1 weak IV\n", - "# beta = 1 # 1 strong IV\n", - "\n", - "# One realization\n", - "U = norm.rvs(size=n)\n", - "Z = norm.rvs(size=n) # generate instrument\n", - "D = beta * Z + U # generate endogenous variable\n", - "Y = D + U # the true causal effect is 1\n", - "\n", - "# First stage regression\n", - "Z1 = sm.add_constant(Z)\n", - "model_first_stage = sm.OLS(D, Z1)\n", - "results_first_stage = model_first_stage.fit()\n", - "print(results_first_stage.summary()) # first stage is very weak here when we set beta = .1\n", - "\n", - "# IV regression\n", - "D1 = sm.add_constant(D)\n", - "model_iv = IV2SLS(Y, D1, Z1)\n", - "results_iv = model_iv.fit()\n", - "print(results_iv.summary())" - ], - "metadata": { - "id": "7oacBDrQB1AU" - }, - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "source": [ - "Note that the instrument is weak here (strength of the instrument is controlled by setting $\\beta$) -- the t-stat is smaller than any rule-of-thumb suggested in the literature (e.g. $\\sqrt{10}$) ." - ], - "metadata": { - "id": "-Y7yS_mvGyt9" - } - }, - { - "cell_type": "markdown", - "source": [ - "# Run 10000 trials to evaluate distribution of the IV estimator" - ], - "metadata": { - "id": "f4wBKUdTG0O2" - } - }, - { - "cell_type": "code", - "source": [ - "np.random.seed(123)\n", - "\n", - "B= 10000 # trials\n", - "IVEst = np.zeros(B)\n", - "\n", - "for i in range(B):\n", - " U = norm.rvs(size=n)\n", - " Z = norm.rvs(size=n) # generate instrument\n", - " D = beta * Z + U # generate endogenous variable\n", - " Y = D + U # the true causal effect is 1\n", - "\n", - " # IV regression\n", - " model_iv = IV2SLS(Y, D, Z)\n", - " results_iv = model_iv.fit()\n", - " IVEst[i] = results_iv.params[0] # Coefficient of D\n", - "\n", - "\n" - ], - "metadata": { - "id": "el9qoJqbG258" - }, - "execution_count": null, - "outputs": [] - }, - { - "cell_type": "markdown", - "source": [ - "# Plot the Actual Distribution against the Normal Approximation (based on Strong Instrument Assumption)" - ], - "metadata": { - "id": "aXLgc8TkIzYf" - } - }, - { - "cell_type": "code", - "source": [ - "# Plotting the density of IVEst\n", - "plt.figure(figsize=(8, 6))\n", - "plt.xlim(-5, 5)\n", - "plt.xlabel(\"IV Estimator - True Effect\")\n", - "plt.title(\"Actual Distribution vs Gaussian\")\n", - "\n", - "# Plotting density estimate of simulated IV coefficients\n", - "val = np.linspace(-5, 5, 200)\n", - "kde = gaussian_kde(IVEst - 1, bw_method = .001) # Need to play with bandwidth depending on problem features.\n", - "density = kde(np.linspace(-5, 5, 1000))\n", - "plt.plot(np.linspace(-5, 5, 1000), density, color='blue', label='Actual Distribution')\n", - "\n", - "# Plotting Gaussian distribution\n", - "var = (1 / (beta ** 2)) * (1 / n) # theoretical variance of IV\n", - "sd = np.sqrt(var)\n", - "gaussian = norm.pdf(val, scale=sd)\n", - "plt.plot(val, gaussian, color='red', linestyle='--', label='Gaussian Distribution')\n", - "\n", - "plt.legend()\n", - "\n", - "# Calculating rejection frequency\n", - "rejection_frequency = np.sum(np.abs(IVEst - 1) / sd > 1.96) / B\n", - "print(\"Rejection Frequency is\", rejection_frequency, \"while we expect it to be around 0.05\")\n", - "\n", - "plt.show()" - ], - "metadata": { - "id": "DDr-glMeI2Fk" - }, - "execution_count": null, - "outputs": [] - } - ] + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "atRvUXGn_VFW" + }, + "outputs": [], + "source": [ + "import numpy as np\n", + "from scipy.stats import norm\n", + "from scipy.stats import gaussian_kde\n", + "import statsmodels.api as sm\n", + "from statsmodels.sandbox.regression.gmm import IV2SLS\n", + "import matplotlib.pyplot as plt" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "wLevLetuBzrs" + }, + "source": [ + "Simulation Design" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "7oacBDrQB1AU" + }, + "outputs": [], + "source": [ + "np.random.seed(123)\n", + "\n", + "n = 100\n", + "beta = 0.1 # 0.1 weak IV\n", + "# beta = 1 # 1 strong IV\n", + "\n", + "# One realization\n", + "U = norm.rvs(size=n)\n", + "Z = norm.rvs(size=n) # generate instrument\n", + "D = beta * Z + U # generate endogenous variable\n", + "Y = D + U # the true causal effect is 1\n", + "\n", + "# First stage regression\n", + "Z1 = sm.add_constant(Z)\n", + "model_first_stage = sm.OLS(D, Z1)\n", + "results_first_stage = model_first_stage.fit()\n", + "print(results_first_stage.summary()) # first stage is very weak here when we set beta = .1\n", + "\n", + "# IV regression\n", + "D1 = sm.add_constant(D)\n", + "model_iv = IV2SLS(Y, D1, Z1)\n", + "results_iv = model_iv.fit()\n", + "print(results_iv.summary())" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-Y7yS_mvGyt9" + }, + "source": [ + "Note that the instrument is weak here (strength of the instrument is controlled by setting $\\beta$) -- the t-stat is smaller than any rule-of-thumb suggested in the literature (e.g. $\\sqrt{10}$) ." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "f4wBKUdTG0O2" + }, + "source": [ + "# Run 10000 trials to evaluate distribution of the IV estimator" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "el9qoJqbG258" + }, + "outputs": [], + "source": [ + "np.random.seed(123)\n", + "\n", + "B = 10000 # trials\n", + "IVEst = np.zeros(B)\n", + "\n", + "for i in range(B):\n", + " U = norm.rvs(size=n)\n", + " Z = norm.rvs(size=n) # generate instrument\n", + " D = beta * Z + U # generate endogenous variable\n", + " Y = D + U # the true causal effect is 1\n", + "\n", + " # IV regression\n", + " model_iv = IV2SLS(Y, D, Z)\n", + " results_iv = model_iv.fit()\n", + " IVEst[i] = results_iv.params[0] # Coefficient of D" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "aXLgc8TkIzYf" + }, + "source": [ + "# Plot the Actual Distribution against the Normal Approximation (based on Strong Instrument Assumption)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "id": "DDr-glMeI2Fk" + }, + "outputs": [], + "source": [ + "# Plotting the density of IVEst\n", + "plt.figure(figsize=(8, 6))\n", + "plt.xlim(-5, 5)\n", + "plt.xlabel(\"IV Estimator - True Effect\")\n", + "plt.title(\"Actual Distribution vs Gaussian\")\n", + "\n", + "# Plotting density estimate of simulated IV coefficients\n", + "val = np.linspace(-5, 5, 200)\n", + "kde = gaussian_kde(IVEst - 1, bw_method=.001) # Need to play with bandwidth depending on problem features.\n", + "density = kde(np.linspace(-5, 5, 1000))\n", + "plt.plot(np.linspace(-5, 5, 1000), density, color='blue', label='Actual Distribution')\n", + "\n", + "# Plotting Gaussian distribution\n", + "var = (1 / (beta ** 2)) * (1 / n) # theoretical variance of IV\n", + "sd = np.sqrt(var)\n", + "gaussian = norm.pdf(val, scale=sd)\n", + "plt.plot(val, gaussian, color='red', linestyle='--', label='Gaussian Distribution')\n", + "\n", + "plt.legend()\n", + "\n", + "# Calculating rejection frequency\n", + "rejection_frequency = np.sum(np.abs(IVEst - 1) / sd > 1.96) / B\n", + "print(\"Rejection Frequency is\", rejection_frequency, \"while we expect it to be around 0.05\")\n", + "\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.5" + } + }, + "nbformat": 4, + "nbformat_minor": 1 }