diff --git a/AC1/r-proxy-controls.Rmd b/AC1/r-proxy-controls.Rmd new file mode 100644 index 00000000..21be891b --- /dev/null +++ b/AC1/r-proxy-controls.Rmd @@ -0,0 +1,200 @@ +--- +title: An R Markdown document converted from "AC1/r-proxy-controls.irnb" +output: html_document +--- + +# Negative (Proxy) Controls for Unobserved Confounding + +Consider the following SEM, where $Y$ is the outcome, $D$ is the treatment, $A$ is some unobserved confounding, and $Q$, $X$, $S$ are the observed covariates. In particular, $Q$ is considered to be the proxy control treatment as it a priori has no effect on the actual outcome $Y$, and $S$ is considered to be the proxy control outcome as it a priori is not affected by the actual treatment $D$. See also [An Introduction to Proximal Causal Learning](https://arxiv.org/pdf/2009.10982.pdf), for more information on this setting. + +![proxy_dag.png](https://raw.githubusercontent.com/stanford-msande228/winter23/main/proxy_dag.png) + +Under linearity assumptions, the average treatment effect can be estimated by solving the vector of moment equations: +\begin{align} +E\left[(\tilde{Y} - \alpha \tilde{D} - \delta \tilde{S}) \left(\begin{aligned}\tilde{D}\\ \tilde{Q}\end{aligned}\right) \right] = 0 +\end{align} +where for every variable $V$ we denote with $\tilde{V} = V - E[V|X]$. + +When the dimension of the proxy treatment variables $Q$ is larger than the dimension of proxy outcome variables $S$, then the above system of equations is over-identified. In these settings, we first project the "technical instrument" variables $\tilde{V}=(\tilde{D}, \tilde{Q})$ onto the space of "technical treatment" variables $\tilde{W}=(\tilde{D}, \tilde{S})$ and use the projected $\tilde{V}$ as a new "technical instrument". In particular, we run an OLS regression of $\tilde{W}$ on $\tilde{V},$ and define $\tilde{Z} = E[\tilde{W}\mid \tilde{V}] = B \tilde{V}$, where the $t$-th row $\beta_t$ of the matrix $B$ is the OLS coefficient in the regression of $\tilde{W}_t$ on $\tilde{V}$. These new variables $\tilde{Z}$, can also be viewed as engineered technical instrumental variables. Then we have the exactly identified system of equations: +\begin{align} +E\left[(\tilde{Y} - \alpha \tilde{D} - \delta \tilde{S}) \tilde{Z} \right] := E\left[(\tilde{Y} - \alpha \tilde{D} - \delta \tilde{S}) B \left(\begin{aligned}\tilde{D}\\ \tilde{Q}\end{aligned}\right) \right] = 0 +\end{align} + +The solution to this system of equations is numerically equivalent to the following two stage algorithm: +- Run OLS of $\tilde{W}=(\tilde{D}, \tilde{S})$ on $\tilde{V}=(\tilde{D}, \tilde{Q})$ +- Define $\tilde{Z}$ as the predictions of the OLS model +- Run OLS of $\tilde{Y}$ on $\tilde{Z}$. +This is the well-known Two-Stage-Least-Squares (2SLS) algorithm for instrumental variable regression. + +Since we're considering only linear models and in a low-dimensional setting, we'll focus on just using linear IV methods. + +```{r} +install.packages("hdm") +``` + +```{r} +library(hdm) + +set.seed(1) +``` + +# Analyzing Simulated Data + +First, let's evaluate the methods on simulated data generated from a linear SEM characterized by the above DAG. For this simulation, we'll set the ATE to 2. + +```{r} +gen_data <- function(n, ate) { + X <- matrix(rnorm(n * 10), ncol = 10) + A <- 2 * X[, 1] + rnorm(n) + Q <- 10 * A + 2 * X[, 1] + rnorm(n) + S <- 5 * A + X[, 1] + rnorm(n) + D <- Q - A + 2 * X[, 1] + rnorm(n) + Y <- ate * D + 5 * A + 2 * S + 0.5 * X[, 1] + rnorm(n) + return(list(X, A, Q, S, D, Y)) +} +``` + +```{r} +data_list <- gen_data(5000, 2) +X <- data_list[[1]] +A <- data_list[[2]] +Q <- data_list[[3]] +S <- data_list[[4]] +D <- data_list[[5]] +Y <- data_list[[6]] +``` + +We define the technical instrument $V=(D, Q)$ and technical treatment $W=(D, S)$. Estimating the treatement effect is then just a matter of solving an instrument variable regression problem with instruments $V$ and treatments $W$ and looking at the first coefficient associated with $D$. + +```{r} +W <- cbind(D, S) +V <- cbind(D, Q) +``` + +```{r} +piv <- tsls(X, W, Y, V, homoscedastic = FALSE) +cat("Estimated coefficient:", piv$coefficients["D", 1], "\n") +cat("Standard error:", piv$se["D"], "\n") +``` + +# With Cross-Fitting + +We can also consider partialling out the controls using DML with cross-fitting + +```{r} +lm_dml_for_proxyiv <- function(x, d, q, s, y, dreg, qreg, yreg, sreg, nfold = 5) { + # this implements DML for a partially linear IV model + nobs <- nrow(x) + foldid <- rep.int(1:nfold, times = ceiling(nobs / nfold))[sample.int(nobs)] + I <- split(1:nobs, foldid) + # create residualized objects to fill + ytil <- dtil <- qtil <- stil <- rep(NA, nobs) + # obtain cross-fitted residuals + cat("fold: ") + for (b in seq_along(I)) { + dfit <- dreg(x[-I[[b]], ], d[-I[[b]]]) # take a fold out + qfit <- qreg(x[-I[[b]], ], q[-I[[b]]]) # take a fold out + yfit <- yreg(x[-I[[b]], ], y[-I[[b]]]) # take a fold out + sfit <- sreg(x[-I[[b]], ], s[-I[[b]]]) # take a fold out + dtil[I[[b]]] <- (d[I[[b]]] - x[I[[b]], ] %*% as.matrix(dfit$coefficients)) # record residual + qtil[I[[b]]] <- (q[I[[b]]] - x[I[[b]], ] %*% as.matrix(qfit$coefficients)) # record residual + ytil[I[[b]]] <- (y[I[[b]]] - x[I[[b]], ] %*% as.matrix(yfit$coefficients)) # record residial + stil[I[[b]]] <- (s[I[[b]]] - x[I[[b]], ] %*% as.matrix(sfit$coefficients)) # record residual + cat(b, " ") + } + ivfit <- tsls(y = ytil, d = cbind(dtil, stil), x = NULL, z = cbind(dtil, qtil), + intercept = FALSE, homoscedastic = FALSE) + coef_est <- ivfit$coef[1] # extract coefficient + se <- ivfit$se[1] # record standard error + cat(sprintf("\ncoef (se) = %g (%g)\n", coef_est, se)) + return(list(coef_est = coef_est, se = se, dtil = dtil, qtil = qtil, + ytil = ytil, stil = stil, foldid = foldid, spI = I)) +} +``` + +We'll just use OLS for partialling out again. We could of course try something more elaborate if we wanted. + +```{r} +dreg <- function(x, d) { + lm.fit(x, d) +} # ML method=ols +qreg <- function(x, q) { + lm.fit(x, q) +} # ML method=ols +yreg <- function(x, y) { + lm.fit(x, y) +} # ML method=ols +sreg <- function(x, s) { + lm.fit(x, s) +} # ML method=ols + +dml_piv <- lm_dml_for_proxyiv(X, D, Q, S, Y, dreg, qreg, yreg, sreg, nfold = 5) +dml_piv +``` + +## Real Data - Effects of Smoking on Birth Weight + +In this study, we will be studying the effects of smoking on baby weight. We will consider the following stylized setup: + +Outcome ($Y$): baby weight + +Treatment ($D$): smoking + +Unobserved confounding ($A$): family income + +The observed covariates are put in to 3 groups: + + +* Proxy treatment control ($Q$): mother's education +* Proxy outcome control ($S$): parity (total number of previous pregnancies) +* Other observed covariates ($X$): mother's race and age + + +Education serves as a proxy treatment control $Q$ because it reflects unobserved confounding due to household income $A$ but has no direct medical effect on birth weight $Y$. Parity serves as a proxy outcome control $S$ because family size reflects household income $A$ but is not directly caused by smoking $D$ or education $Q$. + +A description of the data used can be found [here](https://www.stat.berkeley.edu/users/statlabs/data/babies.readme). + +```{r} +data <- read.table("https://www.stat.berkeley.edu/users/statlabs/data/babies23.data", header = TRUE) +summary(data) +``` + +```{r} +# Filter data to exclude entries where income, number of cigarettes smoked, +# race, age are not asked or not known +data <- data[data$race != 99, ] +data <- data[!(data$number %in% c(98, 99)), ] +data <- data[!(data$inc %in% c(98, 99)), ] +data <- data[data$age != 99, ] +dim(data) +``` + +```{r} +# Create matrices for X, D, Q, S, A, Y +X <- model.matrix(~ 0 + C(race) + age, data) +D <- model.matrix(~ 0 + number, data) +Q <- model.matrix(~ 0 + ed, data) +S <- model.matrix(~ 0 + parity, data) +A <- model.matrix(~ 0 + inc, data) +Y <- model.matrix(~ 0 + wt, data) +``` + +```{r} +# Use cross-fitting with OLS to estimate treatment effect within linear model context +dreg <- function(x, d) { + lm.fit(x, d) +} # ML method=ols +qreg <- function(x, q) { + lm.fit(x, q) +} # ML method=ols +yreg <- function(x, y) { + lm.fit(x, y) +} # ML method=ols +sreg <- function(x, s) { + lm.fit(x, s) +} # ML method=ols + +dml_bw_piv <- lm_dml_for_proxyiv(X, D, Q, S, Y, dreg, qreg, yreg, sreg, nfold = 5) +dml_bw_piv +``` + diff --git a/AC1/r-sensitivity-analysis-with-sensmakr-and-debiased-ml.Rmd b/AC1/r-sensitivity-analysis-with-sensmakr-and-debiased-ml.Rmd new file mode 100644 index 00000000..96f7e7d7 --- /dev/null +++ b/AC1/r-sensitivity-analysis-with-sensmakr-and-debiased-ml.Rmd @@ -0,0 +1,290 @@ +--- +title: An R Markdown document converted from "AC1/r-sensitivity-analysis-with-sensmakr-and-debiased-ml.irnb" +output: html_document +--- + +# Sensitivity Analysis for Unobserved Confounder with DML and Sensmakr + +Here we experiment with using package "sensemakr" in conjunction with debiased ML. + +## Partially Linear SEM + +Consider the SEM +\begin{align} +Y :=~& \alpha D + \delta A + f_Y(X) + \epsilon_Y, \\ +D :=~& \gamma A + f_D(X) + \epsilon_D, \\ +A :=~& f_A(X) + \epsilon_A, \\ +X :=~& \epsilon_X, +\end{align} +where, conditional on $X$, $\epsilon_Y, \epsilon_D, \epsilon_A$ are mean zero +and mutually uncorrelated. We further normalize +$$ +E[\epsilon_A^2] =1. +$$ +The key structural +parameter is $\alpha$: $$\alpha = \partial_d Y(d)$$ +where $$Y(d) := (Y: do (D=d)).$$ + +To give context to our example, we can interpret $Y$ as earnings, +$D$ as education, $A$ as ability, and $X$ as a set of observed background variables. In this example, we can interpret $\alpha$ as the returns to schooling. + +We start by applying the partialling out operator to get rid of the $X$'s in all of the equations. Define the partialling out operation of any random vector $V$ with respect to another random vector $X$ as the residual that is left after subtracting the best predictor of $V$ given $X$: +$$\tilde V = V - E [V \mid X].$$ +If $f$'s are linear, we can replace $E [V \mid X]$ +by linear projection. After partialling out, we have a simplified system: +\begin{align} +\tilde Y :=~& \alpha \tilde D + \delta \tilde A + \epsilon_Y, \\ +\tilde D :=~& \gamma \tilde A + \epsilon_D, \\ +\tilde A :=~& \epsilon_A, +\end{align} +where $\epsilon_Y$, $\epsilon_D$, and $\epsilon_A$ are uncorrelated. + +Then the projection of $\tilde Y$ on $\tilde D$ recovers +$$ +\beta = E [\tilde Y \tilde D]/ E [\tilde D^2] = \alpha + \phi, +$$ +where +$$ +\phi = \delta \gamma/ E \left[(\gamma^2 + \epsilon^2_D)\right], +$$ +is the omitted confounder bias or omitted variable bias. + +The formula follows from inserting the expression for $\tilde D$ into the definition of $\beta$ and then simplifying the resulting expression using the assumptions on the $\epsilon$'s. + +We can use this formula to bound $\phi$ directly by making assumptions on the size of $\delta$ +and $\gamma$. An alternative approach can be based on the following characterization, +based on partial $R^2$'s. This characterization essentially follows +from Cinelli and Hazlett, with the slight difference that we have adapted +the result to the partially linear model. + +*Theorem* [Omitted Confounder Bias in Terms of Partial $R^2$'s] + +In the partially linear SEM setting above, +$$ +\phi^2 = \frac{R^2_{\tilde Y \sim \tilde A \mid \tilde D} R^2_{\tilde D \sim \tilde A} }{ (1 - R^2_{\tilde D \sim \tilde A}) } \ +\frac{E \left[ (\tilde Y - \beta \tilde D)^2 \right] }{E \left[ ( \tilde D )^2 \right]}, +$$ +where $R^2_{V \sim W \mid X}$ denotes the population $R^2$ in the linear regression of $V$ on $W$, after partialling out $X$ from $V$ and $W$ linearly. + + +Therefore, if we place bounds on how much of the variation in $\tilde Y$ and in $\tilde D$ +the unobserved confounder $\tilde A$ is able to explain, we can bound the omitted confounder bias by $$\sqrt{\phi^2}.$$ + +# Empirical Example + +We consider an empirical example based on data surrounding the Darfur war. Specifically, we are interested in the effect of having experienced direct war violence on attitudes towards peace. Data is described here +https://cran.r-project.org/web/packages/sensemakr/vignettes/sensemakr.html + +The main outcome is attitude towards peace -- ``peacefactor``. +The key variable of interest is whether the responders were directly harmed (``directlyharmed``). +We want to know if being directly harmed in the conflict causes people to support peace-enforcing measures. +The measured confounders include dummy variables ``female``, ``farmer``, ``herder``, ``pastvoted``, along with ``age`` and household size (``hhsize``). +There is also a village indicator. We deal with village by and partialling out a full set of village dummy variables before conducting +the analysis. The standard errors will be clustered at the village level. + + +## Outline + +We will: +- mimic the partialling out procedure with machine learning tools; +- invoke Sensmakr to compute $\phi^2$ and plot sensitivity results. + +```{r} +# load package +install.packages("sensemakr") +install.packages("lfe") +install.packages("hdm") +install.packages("randomForest") +``` + +```{r} +library(sensemakr) +library(lfe) +library(hdm) +library(randomForest) + +set.seed(1) +``` + +```{r} +file <- "https://raw.githubusercontent.com/CausalAIBook/MetricsMLNotebooks/main/data/darfur.csv" +data <- read.csv(file) +dim(data) +attach(darfur) +``` + +## Preprocessing +Take out village fixed effects and run basic linear analysis + +```{r} +# get rid of village fixed effects +peacefactorR <- lm(peacefactor ~ village)$res +directlyharmedR <- lm(directlyharmed ~ village)$res +femaleR <- lm(female ~ village)$res +ageR <- lm(age ~ village)$res +farmerR <- lm(farmer_dar ~ village)$res +herderR <- lm(herder_dar ~ village)$res +pastvotedR <- lm(pastvoted ~ village)$res +hhsizeR <- lm(hhsize_darfur ~ village)$res + + +# Preliminary linear model analysis +summary(felm(peacefactorR ~ directlyharmedR + femaleR + + ageR + farmerR + herderR + pastvotedR + hhsizeR | 0 | 0 | village)) + +# here we are clustering standard errors at the village level +summary(felm(peacefactorR ~ femaleR + + ageR + farmerR + herderR + pastvotedR + hhsizeR | 0 | 0 | village)) + +summary(felm(directlyharmedR ~ femaleR + + ageR + farmerR + herderR + pastvotedR + hhsizeR | 0 | 0 | village)) +``` + +For benchmarking, we want to understand the "partial $R^2$" of the controls on the outcome after partialling out the variable of interest. + +```{r} +# partial out variable of interest +peacefactorR2 <- lm(peacefactorR ~ directlyharmedR)$res +femaleR2 <- lm(femaleR ~ directlyharmedR)$res +ageR2 <- lm(ageR ~ directlyharmedR)$res +farmerR2 <- lm(farmerR ~ directlyharmedR)$res +herderR2 <- lm(herderR ~ directlyharmedR)$res +pastvotedR2 <- lm(pastvotedR ~ directlyharmedR)$res +hhsizeR2 <- lm(hhsizeR ~ directlyharmedR)$res + + +# R^2 of controls after partialling out variable of interest +summary(lm(peacefactorR2 ~ femaleR2 + + ageR2 + farmerR2 + herderR2 + pastvotedR2 + hhsizeR2)) +``` + +## Lasso for partialling out controls + +```{r} +resY <- rlasso(peacefactorR ~ (femaleR + ageR + farmerR + herderR + pastvotedR + hhsizeR)^3, + post = FALSE)$res + +resD <- rlasso(directlyharmedR ~ (femaleR + ageR + farmerR + herderR + pastvotedR + hhsizeR)^3, + post = FALSE)$res + +print(c("Controls explain the following fraction of variance of Outcome", 1 - var(resY) / var(peacefactorR))) +print(c("Controls explain the following fraction of variance of Treatment", 1 - var(resD) / var(directlyharmedR))) + +dml_darfur_model <- felm(resY ~ resD | 0 | 0 | village) # cluster SEs by village + +summary(dml_darfur_model, robust = TRUE) # cluster SE by village + +dml_darfur_model <- lm(resY ~ resD) # linear model to use as input in sensemakr +``` + +## Manual Bias Analysis + +```{r} +# Main estimate +beta <- dml_darfur_model$coef[2] + +# Hypothetical values of partial R2s +r2_yc <- .13 +r2_dc <- .01 + +# Elements of the bias equation +kappa <- (r2_yc * r2_dc) / (1 - r2_dc) +varianceRatio <- mean(dml_darfur_model$res^2) / mean(resD^2) + +# Compute square bias +BiasSq <- kappa * varianceRatio + +# Compute absolute value of the bias +print(cat("absolute value of the bias:", sqrt(BiasSq)[1])) + +# plotting +gridr2_dc <- seq(0, .3, by = .001) +gridr2_yc <- kappa * (1 - gridr2_dc) / gridr2_dc +gridr2_yc <- ifelse(gridr2_yc > 1, 1, gridr2_yc) +plot(gridr2_dc, gridr2_yc, + type = "l", col = 4, xlab = "Partial R2 of Treatment with Confounder", + ylab = "Partial R2 of Outcome with Confounder", + main = paste("Combination of R2 such that |Bias| < ", round(sqrt(BiasSq), digits = 4)) +) +``` + +## Bias Analysis with Sensemakr + +```{r} +dml_darfur_sensitivity <- sensemakr( + model = dml_darfur_model, + treatment = "resD" +) +summary(dml_darfur_sensitivity) +plot(dml_darfur_sensitivity, nlevels = 15) +``` + +## Random Forest for partialling out + +The following code does DML with clustered standard errors by village + +```{r} +dml2_for_plm <- function(x, d, y, dreg, yreg, nfold = 2, clusterID) { + nobs <- nrow(x) # number of observations + foldid <- rep.int(1:nfold, times = ceiling(nobs / nfold))[sample.int(nobs)] # define folds indices + I <- split(1:nobs, foldid) # split observation indices into folds + ytil <- dtil <- rep(NA, nobs) + cat("fold: ") + for (b in seq_along(I)) { + dfit <- dreg(x[-I[[b]], ], d[-I[[b]]]) # take a fold out + yfit <- yreg(x[-I[[b]], ], y[-I[[b]]]) # take a foldt out + dhat <- predict(dfit, x[I[[b]], ], type = "response") # predict the left-out fold + yhat <- predict(yfit, x[I[[b]], ], type = "response") # predict the left-out fold + dtil[I[[b]]] <- (d[I[[b]]] - dhat) # record residual for the left-out fold + ytil[I[[b]]] <- (y[I[[b]]] - yhat) # record residial for the left-out fold + cat(b, " ") + } + rfit <- felm(ytil ~ dtil | 0 | 0 | clusterID) # get clustered standard errors using felm + rfitSummary <- summary(rfit) + coef_est <- rfitSummary$coef[2] # extract coefficient + se <- rfitSummary$coef[2, 2] # record robust standard error + cat(sprintf("\ncoef (se) = %g (%g)\n", coef_est, se)) # printing output + return(list(coef_est = coef_est, se = se, dtil = dtil, ytil = ytil)) # save output and residuals +} +``` + +```{r} +x <- model.matrix(~ femaleR + ageR + farmerR + herderR + pastvotedR + hhsizeR) +dim(x) + +d <- directlyharmedR +y <- peacefactorR +# DML with Random Forest: +dreg <- function(x, d) { + randomForest(x, d) +} # ML method=Forest +yreg <- function(x, y) { + randomForest(x, y) +} # ML method=Forest + +dml2_rf <- dml2_for_plm(x, d, y, dreg, yreg, nfold = 10, clusterID = village) + +resY <- dml2_rf$ytil +resD <- dml2_rf$dtil + +print(c("Controls explain the following fraction of variance of Outcome", + max(1 - var(resY) / var(peacefactorR), 0))) +print(c("Controls explain the following fraction of variance of Treatment", + max(1 - var(resD) / var(directlyharmedR), 0))) + +dml_darfur_model <- lm(resY ~ resD) +``` + +## Bias Analysis with Sensemakr + + +```{r} +dml_darfur_sensitivity <- sensemakr( + model = dml_darfur_model, + treatment = "resD" +) +summary(dml_darfur_sensitivity) + +plot(dml_darfur_sensitivity, nlevels = 15) +``` + diff --git a/AC2/r-debiased-ml-for-partially-linear-iv-model.Rmd b/AC2/r-debiased-ml-for-partially-linear-iv-model.Rmd new file mode 100644 index 00000000..ce73b72a --- /dev/null +++ b/AC2/r-debiased-ml-for-partially-linear-iv-model.Rmd @@ -0,0 +1,286 @@ +--- +title: An R Markdown document converted from "AC2/r-debiased-ml-for-partially-linear-iv-model.irnb" +output: html_document +--- + +# Double/Debiased ML for Partially Linear IV Model + +This is a simple implementation of Debiased Machine Learning for the Partially Linear +IV Regression Model, which provides an application of DML IV inference. + + +Reference: + +- https://arxiv.org/abs/1608.00060 +- https://www.amazon.com/Business-Data-Science-Combining-Accelerate/dp/1260452778 + +The code is based on the book. + + +# Partially Linear IV Model + +We consider the partially linear structural equation model: +\begin{align} + Y :=~& D\theta_0 + g_0(X) + \zeta, & E[\zeta \mid Z,X]= 0,\\ + Z :=~& m_0(X) + V, & E[V \mid X] = 0. +\end{align} + + +Note that this model is not a regression model unless $Z=D$. The model is a canonical +model in causal inference, going back to P. Wright's work on IV methods for estimating demand/supply equations, with the modern difference being that $g_0$ and $m_0$ are nonlinear, potentially complicated functions of high-dimensional $X$. + + +The idea of this model is that there is a structural or causal relation between $Y$ and $D$, captured by $\theta_0$, and $g_0(X) + \zeta$ is the stochastic error, partly explained by covariates $X$. $V$ and $\zeta$ are stochastic errors that are not explained by $X$. Since $Y$ and $D$ are jointly determined, we need an external factor, commonly referred to as an instrument, $Z$ to create exogenous variation +in $D$. Note that $Z$ should affect $D$. The $X$ here serve again as confounding factors, so we can think of variation in $Z$ as being exogenous only conditional on $X$. + + +The causal DAG this model corresponds to is given by: +$$ +Z \to D, X \to (Y, Z, D), L \to (Y,D), +$$ +where $L$ is the latent confounder affecting both $Y$ and $D$, but not $Z$. + + + +--- + +# Example + +A simple contextual example is from biostatistics, where $Y$ is a health outcome and $D$ is indicator of smoking. Thus, $\theta_0$ is captures the effect of smoking on health. Health outcome $Y$ and smoking behavior $D$ are treated as being jointly determined. $X$ represents patient characteristics, and $Z$ could be a doctor's advice not to smoke (or another behavioral treatment) that may affect the outcome $Y$ only through shifting the behavior $D$, conditional on characteristics $X$. + +---- + + + +# PLIVM in Residualized Form + +The PLIV model above can be rewritten in the following residualized form: +$$ + \tilde Y = \tilde D \theta_0 + \zeta, \quad E[\zeta \mid V,X]= 0, +$$ +where +$$ + \tilde Y = (Y- \ell_0(X)), \quad \ell_0(X) = E[Y \mid X] \\ + \tilde D = (D - r_0(X)), \quad r_0(X) = E[D \mid X] \\ + \tilde Z = (Z- m_0(X)), \quad m_0(X) = E[Z \mid X]. +$$ + The "tilde" variables (e.g. $\tilde Y$) above represent original variables after taking out or "partialling out" + the effect of $X$. Note that $\theta_0$ is identified from this equation if $V$ + and $U$ have non-zero correlation, which automatically means that $U$ and $V$ + must have non-zero variation. + + + +----- + +# DML for PLIV Model + +Given identification, DML proceeds as follows + +Compute the estimates $\hat \ell_0$, $\hat r_0$, and $\hat m_0$ , which amounts +to solving the three problems of predicting $Y$, $D$, and $Z$ using +$X$, using any generic ML method, giving us estimated residuals +$$ +\tilde Y = Y - \hat \ell_0(X), \\ \tilde D= D - \hat r_0(X), \\ \tilde Z = Z- \hat m_0(X). +$$ +The estimates should be of a cross-validated form, as detailed in the algorithm below. + +Estimate $\theta_0$ by the the intstrumental +variable regression of $\tilde Y$ on $\tilde D$ using $\tilde Z$ as an instrument. +Use the conventional inference for the IV regression estimator, ignoring +the estimation error in these residuals. + +The reason we work with this residualized form is that it eliminates the bias +arising when solving the prediction problem in stage 1. The role of cross-validation +is to avoid another source of bias due to potential overfitting. + +The estimator is adaptive, +in the sense that the first stage estimation errors do not affect the second +stage errors. + +```{r} +install.packages("hdm") +install.packages("AER") +install.packages("randomForest") +install.packages("lfe") +install.packages("glmnet") +``` + +```{r} +library(hdm) +library(AER) # applied econometrics library +library(randomForest) # random Forest library +library(lfe) # high-dimensional econometrics library +library(glmnet) # glm net + +set.seed(1) +``` + +```{r} +# DML for PLIVM + +dml2_for_plivm <- function(x, d, z, y, dreg, yreg, zreg, nfold = 5) { + # this implements DML2 algorithm, where there moments are estimated via DML, before constructing + # the pooled estimate of theta randomly split data into folds + nobs <- nrow(x) + foldid <- rep.int(1:nfold, times = ceiling(nobs / nfold))[sample.int(nobs)] + I <- split(1:nobs, foldid) + # create residualized objects to fill + ytil <- dtil <- ztil <- rep(NA, nobs) + # obtain cross-fitted residuals + cat("fold: ") + for (b in seq_along(I)) { + dfit <- dreg(x[-I[[b]], ], d[-I[[b]]]) # take a fold out + zfit <- zreg(x[-I[[b]], ], z[-I[[b]]]) # take a fold out + yfit <- yreg(x[-I[[b]], ], y[-I[[b]]]) # take a folot out + dhat <- predict(dfit, x[I[[b]], ], type = "response") # predict the fold out + zhat <- predict(zfit, x[I[[b]], ], type = "response") # predict the fold out + yhat <- predict(yfit, x[I[[b]], ], type = "response") # predict the fold out + dtil[I[[b]]] <- (d[I[[b]]] - dhat) # record residual + ztil[I[[b]]] <- (z[I[[b]]] - zhat) # record residual + ytil[I[[b]]] <- (y[I[[b]]] - yhat) # record residial + cat(b, " ") + } + ivfit <- tsls(y = ytil, d = dtil, x = NULL, z = ztil, intercept = FALSE) + coef_est <- ivfit$coef # extract coefficient + se <- ivfit$se # record standard error + cat(sprintf("\ncoef (se) = %g (%g)\n", coef_est, se)) + return(list(coef_est = coef_est, se = se, dtil = dtil, ytil = ytil, ztil = ztil)) +} +``` + +----- + +# Emprical Example: Acemoglu, Johnson, Robinson (AER). + + +* Y is log GDP; +* D is a measure of Protection from Expropriation, a proxy for +quality of insitutions; +* Z is the log of Settler's mortality; +* W are geographical variables (latitude, latitude squared, continent dummies as well as interactions) + + +```{r} +file <- "https://raw.githubusercontent.com/CausalAIBook/MetricsMLNotebooks/main/data/AJR.csv" +AJR <- read.csv(file) +dim(AJR) +``` + +```{r} +y <- AJR$GDP +d <- AJR$Exprop +z <- AJR$logMort +xraw <- model.matrix(~ Latitude + Africa + Asia + Namer + Samer, data = AJR) +x <- model.matrix(~ -1 + (Latitude + Latitude2 + Africa + Asia + Namer + Samer)^2, data = AJR) +dim(x) +``` + +```{r} +set.seed(1) +# DML with PostLasso +cat(sprintf("\n DML with Post-Lasso \n")) + +dreg <- function(x, d) { + hdm::rlasso(x, d) +} # ML method=lasso +yreg <- function(x, y) { + hdm::rlasso(x, y) +} # ML method=lasso +zreg <- function(x, z) { + hdm::rlasso(x, z) +} # ML method=lasso + +dml2_lasso <- dml2_for_plivm(x, d, z, y, dreg, yreg, zreg, nfold = 5) + + +# DML with Random Forest +cat(sprintf("\n DML with Random Forest \n")) + +dreg <- function(x, d) { + randomForest(x, d) +} # ML method=Forest +yreg <- function(x, y) { + randomForest(x, y) +} # ML method=Forest +zreg <- function(x, z) { + randomForest(x, z) +} # ML method=Forest + +dml2_rf <- dml2_for_plivm(xraw, d, z, y, dreg, yreg, zreg, nfold = 5) + +# Compare Forest vs Lasso +comp_tab <- matrix(NA, 3, 2) +comp_tab[1, ] <- c(sqrt(mean((dml2_rf$ytil)^2)), sqrt(mean((dml2_lasso$ytil)^2))) +comp_tab[2, ] <- c(sqrt(mean((dml2_rf$dtil)^2)), sqrt(mean((dml2_lasso$dtil)^2))) +comp_tab[3, ] <- c(sqrt(mean((dml2_rf$ztil)^2)), sqrt(mean((dml2_lasso$ztil)^2))) +rownames(comp_tab) <- c("RMSE for Y:", "RMSE for D:", "RMSE for Z:") +colnames(comp_tab) <- c("RF", "LASSO") +print(comp_tab, digits = 3) +``` + +# Weak Instruments? + +```{r} +# using lfe package +summary(felm(dml2_lasso$dtil ~ dml2_lasso$ztil), robust = TRUE) +summary(felm(dml2_rf$dtil ~ dml2_rf$ztil), robust = TRUE) +``` + +## Anderson-Rubin Idea for Inference with Weak Instruments + +As shown above, we may have weak instruments because the t-stat in the regression $\tilde D \sim \tilde Z$ is small relative to standard rules of thumb -- for example Stock and Yogo (2005) suggest accounting for weak instruments if the first stage F-statistic is less than 10 (and more recent work suggests even larger cutoffs). + + + Here, we consider one specific approach (from Anderson and Rubin (1949)) to doing valid inference under weak identification based upon the statistic: +$$ +C(\theta) = \frac{ |E_n [(\tilde Y - \theta \tilde D) \tilde Z]|^2}{ \mathbb{V}_n [(\tilde Y - \theta \tilde D) \tilde Z ]/n}. +$$ +The empirical variance $\mathbb{V}_n$ is defined as: +\begin{align*} +\mathbb{V}_{n}[ g(W_i)] &:= \mathbb{E}_{n}g(W_i)g(W_i)' - \mathbb{E}_{n}[ g(W_i)]\mathbb{E}_{n}[ g(W_i)]'. +\end{align*} + +If $\theta_0 = \theta$, then $C(\theta) \overset{a}{\sim} N(0,1)^2 = \chi^2(1)$. Therefore, we can reject the hypothesis $\theta_0 = \theta$ at level $a$ (for example $a = .05$ for a 5\% level test) if $C(\theta)> c(1-a)$ where $c(1-a)$ is the $(1-a)$- quantile of a $\chi^2(1)$ variable. The probability of falsely rejecting the true hypothesis is approximately $a \times 100\%$. +To construct a $(1-a) \times 100$\% confidence region for $\theta$, we can then simply invert the test by collecting all parameter values that are not rejected at the $a$ level: +$$ +CR(\theta) = \{ \theta \in \Theta: C(\theta) \leq c(1-a)\}. +$$ + + +In more complex settings with many controls or controls that enter with unknown functional form, we can simply replace the residuals $\tilde Y$, $\tilde D$, and $\tilde Z$ by machine +learned cross-fitted residuals $\check Y$, $\check D$, and $ \check Z$. Thanks to the orthogonality of the IV moment condition underlying the formulation outlined above, we can formally assert that the properties of $C(\theta)$ and the subsequent testing procedure and confidence region for $\theta$ continue to hold when using cross-fitted residuals. We will further be able to apply the general procedure to cases where $D$ +is a vector, with a suitable adjustment of the statistic $C(\theta)$. + +```{r} +# DML-AR (DML with Anderson-Rubin) + +dml_ar_pliv <- function(rY, rD, rZ, grid, alpha = .05) { + n <- length(rY) + Cstat <- rep(0, length(grid)) + for (i in seq_along(grid)) { + Cstat[i] <- n * (mean((rY - grid[i] * rD) * rZ))^2 / var((rY - grid[i] * rD) * rZ) + } + LB <- min(grid[Cstat < qchisq(1 - alpha, 1)]) + UB <- max(grid[Cstat < qchisq(1 - alpha, 1)]) + plot(range(grid), range(c(Cstat)), type = "n", xlab = "Effect of institutions", ylab = "Statistic", main = " ") + lines(grid, Cstat, lty = 1, col = 1) + abline(h = qchisq(1 - alpha, 1), lty = 3, col = 4) + abline(v = LB, lty = 3, col = 2) + abline(v = UB, lty = 3, col = 2) + return(list(UB = UB, LB = LB)) +} +``` + +```{r} +dml_ar_pliv( + rY = dml2_lasso$ytil, rD = dml2_lasso$dtil, rZ = dml2_lasso$ztil, + grid = seq(-2, 2, by = .01) +) + +dml_ar_pliv( + rY = dml2_rf$ytil, rD = dml2_rf$dtil, rZ = dml2_rf$ztil, + grid = seq(-2, 2, by = .01) +) +``` + diff --git a/AC2/r-dml-401k-IV.Rmd b/AC2/r-dml-401k-IV.Rmd new file mode 100644 index 00000000..602944aa --- /dev/null +++ b/AC2/r-dml-401k-IV.Rmd @@ -0,0 +1,1309 @@ +--- +title: An R Markdown document converted from "AC2/r-dml-401k-IV.irnb" +output: html_document +--- + +# Inference on Predictive and Causal Effects in High-Dimensional Nonlinear Models + +## Impact of 401(k) on Financial Wealth + +We consider estimation of the effect of 401(k) participation +on accumulated assets. 401(k) plans are pension accounts sponsored by employers. The key problem in determining the effect of participation in 401(k) plans on accumulated assets is saver heterogeneity coupled with the fact that the decision to enroll in a 401(k) is non-random. It is generally recognized that some people have a higher preference for saving than others. It also seems likely that those individuals with high unobserved preference for saving would be most likely to choose to participate in tax-advantaged retirement savings plans and would tend to have otherwise high amounts of accumulated assets. The presence of unobserved savings preferences with these properties then implies that conventional estimates that do not account for saver heterogeneity and endogeneity of participation will be biased upward, tending to overstate the savings effects of 401(k) participation. + +One can argue that eligibility for enrolling in a 401(k) plan in this data can be taken as exogenous after conditioning on a few observables of which the most important for their argument is income. The basic idea is that, at least around the time 401(k)’s initially became available, people were unlikely to be basing their employment decisions on whether an employer offered a 401(k) but would instead focus on income and other aspects of the job. + +```{r} +install.packages("xtable") +install.packages("hdm") +install.packages("sandwich") +install.packages("ggplot2") +install.packages("randomForest") +install.packages("glmnet") +install.packages("rpart") +install.packages("data.table") +install.packages("gbm") +``` + +```{r} +library(xtable) +library(hdm) +library(sandwich) +library(ggplot2) +library(randomForest) +library(data.table) +library(glmnet) +library(rpart) +library(gbm) + +set.seed(123) +``` + +### Data + +The raw dataset can be found [here](https://raw.githubusercontent.com/CausalAIBook/MetricsMLNotebooks/main/data/401k.csv). +The data set can be loaded from the `hdm` package for R directly by typing: + + +```{r} +data(pension) +data <- pension +dim(data) +``` + +See the "Details" section on the description of the data set, which can be accessed by + +```{r} +help(pension) +``` + +The data consist of 9,915 observations at the household level drawn from the 1991 Survey of Income and Program Participation (SIPP). All the variables are referred to 1990. We use net financial assets (*net\_tfa*) as the outcome variable, $Y$, in our analysis. The net financial assets are computed as the sum of IRA balances, 401(k) balances, checking accounts, saving bonds, other interest-earning accounts, other interest-earning assets, stocks, and mutual funds less non mortgage debts. + +Among the $9915$ individuals, $3682$ are eligible to participate in the program. The variable *e401* indicates eligibility and *p401* indicates participation, respectively. + +```{r} +hist_e401 <- ggplot(data, aes(x = e401, fill = factor(e401))) + + geom_bar() +hist_e401 +``` + +Eligibility is highly associated with financial wealth: + +```{r} +dens_net_tfa <- ggplot(data, aes(x = net_tfa, color = factor(e401), fill = factor(e401))) + + geom_density() + + xlim(c(-20000, 150000)) + + facet_wrap(. ~ e401) + +dens_net_tfa +``` + +The unconditional APE of e401 is about $19559$: + +```{r} +e1 <- data[data$e401 == 1, ] +e0 <- data[data$e401 == 0, ] +round(mean(e1$net_tfa) - mean(e0$net_tfa), 0) +``` + +Among the $3682$ individuals that are eligible, $2594$ decided to participate in the program. The unconditional APE of p401 is about $27372$: + +```{r} +p1 <- data[data$p401 == 1, ] +p0 <- data[data$p401 == 0, ] +round(mean(p1$net_tfa) - mean(p0$net_tfa), 0) +``` + +As discussed, these estimates are biased since they do not account for saver heterogeneity and endogeneity of participation. + +```{r} +# instrument variable +Z <- data[, "e401"] +# treatment variable +D <- data[, "p401"] +# outcome variable +y <- data[, "net_tfa"] +``` + +### We construct the engineered features for controls + +```{r} +# Constructing the controls +x_formula <- paste("~ poly(age, 6, raw=TRUE) + poly(inc, 8, raw=TRUE) + poly(educ, 4, raw=TRUE) + ", + "poly(fsize, 2, raw=TRUE) + male + marr + twoearn + db + pira + hown") +X <- as.data.table(model.frame(x_formula, pension)) +head(X) +``` + +# Instrumental Variables: Effect of 401k Participation on Financial Assets + +## Double ML IV under Partial Linearity + +Now, we consider estimation of average treatment effects of participation in 401k, i.e. `p401`, with the binary instrument being eligibility in 401k, i.e. `e401`. As before, $Y$ denotes the outcome `net_tfa`, and $X$ is the vector of covariates. We consider a partially linear structural equation model: +\begin{align} +Y :=~& g_Y(\epsilon_Y) D + f_Y(A, X, \epsilon_Y), \\ +D :=~& f_D(Z, X, A, \epsilon_D), \\ +Z :=~& f_Z(X, \epsilon_Z),\\ +A :=~& f_A(X, \epsilon_A), \\ +X :=~& \epsilon_X, +\end{align} +where $A$ is a vector of un-observed confounders. + +Under this structural equation model, the average treatment effect: +\begin{align} +\alpha = E[Y(1) - Y(0)] +\end{align} +can be identified by the moment restriction: +\begin{align} +E[(\tilde{Y} - \alpha \tilde{D}) \tilde{Z}] = 0 +\end{align} +where for any variable $V$, we denote with $\tilde{V} = V - E[V|X]$. + +```{r} +set.seed(1) +# family gaussian means that we'll be using square loss +yfit_lasso_cv <- cv.glmnet(as.matrix(X), y, family = "gaussian", alpha = 1) +# family gaussian means that we'll be using square loss +dfit_lasso_cv <- cv.glmnet(as.matrix(X), D, family = "gaussian", alpha = 1) +# family gaussian means that we'll be using square loss +zfit_lasso_cv <- cv.glmnet(as.matrix(X), Z, family = "gaussian", alpha = 1) + + +yhat_lasso_cv <- predict(yfit_lasso_cv, newx = as.matrix(X)) # predictions +dhat_lasso_cv <- predict(dfit_lasso_cv, newx = as.matrix(X)) # predictions +zhat_lasso_cv <- predict(zfit_lasso_cv, newx = as.matrix(X)) # predictions + +resy <- y - yhat_lasso_cv +resD <- D - dhat_lasso_cv +resZ <- Z - zhat_lasso_cv + +# Estimate +mean(resy * resZ) / mean(resZ * resD) +``` + +Recall if we want to do inference, we need to either use the theoretically driven penalty paramter for Lasso or perform cross-fitting. + +### DML with Non-Linear ML Models and Cross-fitting + +```{r} +# DML for PLIVM with D and Z as classifiers or regressors +dml2_for_plivm <- function(x, d, z, y, dreg, yreg, zreg, nfold = 5, method = "regression") { + nobs <- nrow(x) + foldid <- rep.int(1:nfold, times = ceiling(nobs / nfold))[sample.int(nobs)] + I <- split(1:nobs, foldid) + # create residualized objects to fill + ytil <- dtil <- ztil <- rep(NA, nobs) + # obtain cross-fitted residuals + cat("fold: ") + for (b in seq_along(I)) { + if (method == "randomforest") { + # take a fold out + dfit <- dreg(x[-I[[b]], ], d[-I[[b]]]) + zfit <- zreg(x[-I[[b]], ], z[-I[[b]]]) + yfit <- yreg(x[-I[[b]], ], y[-I[[b]]]) + # predict the fold out + dhat <- predict(dfit, x[I[[b]], ], type = "prob")[, 2] # type = "prob" is like predict_proba in scikitlearn + zhat <- predict(zfit, x[I[[b]], ], type = "prob")[, 2] + # default type = "response" for regression for RF, type = "vector" for regression for Decision Trees + yhat <- predict(yfit, x[I[[b]], ]) + # record residual + # as.numeric will turn d = as.factor(d) from 0,1 to 1,2 so subtract 1! + dtil[I[[b]]] <- (as.numeric(d[I[[b]]]) - 1 - dhat) + ztil[I[[b]]] <- (as.numeric(z[I[[b]]]) - 1 - zhat) + ytil[I[[b]]] <- (y[I[[b]]] - yhat) + } else if (method == "regression") { # works for both boosted trees and glmnet + # take a fold out + dfit <- dreg(x[-I[[b]], ], d[-I[[b]]]) + zfit <- zreg(x[-I[[b]], ], z[-I[[b]]]) + yfit <- yreg(x[-I[[b]], ], y[-I[[b]]]) + # predict the fold out + dhat <- predict(dfit, x[I[[b]], ], type = "response") + zhat <- predict(zfit, x[I[[b]], ], type = "response") + yhat <- predict(yfit, x[I[[b]], ], type = "response") + # record residual + dtil[I[[b]]] <- (d[I[[b]]] - dhat) + ztil[I[[b]]] <- (z[I[[b]]] - zhat) + ytil[I[[b]]] <- (y[I[[b]]] - yhat) + } else if (method == "decisiontrees") { + # take a fold out + dfit <- dreg(x[-I[[b]], ], as.factor(d)[-I[[b]]]) + zfit <- zreg(x[-I[[b]], ], as.factor(z)[-I[[b]]]) + yfit <- yreg(x[-I[[b]], ], y[-I[[b]]]) + # predict the fold out + dhat <- predict(dfit, x[I[[b]], ])[, 2] + zhat <- predict(zfit, x[I[[b]], ])[, 2] + yhat <- predict(yfit, x[I[[b]], ]) + # record residual + dtil[I[[b]]] <- (d[I[[b]]] - dhat) + ztil[I[[b]]] <- (z[I[[b]]] - zhat) + ytil[I[[b]]] <- (y[I[[b]]] - yhat) + } + + cat(b, " ") + } + ivfit <- tsls(y = ytil, d = dtil, x = NULL, z = ztil, intercept = FALSE) + coef_est <- ivfit$coef # extract coefficient + se <- ivfit$se # record standard error + cat(sprintf("\ncoef (se) = %g (%g)\n", coef_est, se)) + + return(list(coef_est = coef_est, se = se, dtil = dtil, ytil = ytil, ztil = ztil)) +} +``` + +```{r} +summary_for_plivm <- function(point, stderr, resy, resD, resZ, name) { + data <- data.frame( + estimate = point, # point estimate + stderr = stderr, # standard error + lower = point - 1.96 * stderr, # lower end of 95% confidence interval + upper = point + 1.96 * stderr, # upper end of 95% confidence interval + `rmse y` = sqrt(mean(resy^2)), # RMSE of model that predicts outcome y + `rmse D` = sqrt(mean(resD^2)), # RMSE of model that predicts treatment D + `rmse Z` = sqrt(mean(resZ^2)), # RMSE of model that predicts treatment D + `accuracy D` = mean(abs(resD) < 0.5), # binary classification accuracy of model for D + `accuracy Z` = mean(abs(resZ) < 0.5) # binary classification accuracy of model for Z + ) + rownames(data) <- name + return(data) +} +``` + +#### Double Lasso with Cross-Fitting + +```{r} +# DML with LassoCV +set.seed(123) +cat(sprintf("\nDML with Lasso CV \n")) + +dreg_lasso_cv <- function(x, d) { + cv.glmnet(x, d, family = "gaussian", alpha = 1, nfolds = 5) +} +yreg_lasso_cv <- function(x, y) { + cv.glmnet(x, y, family = "gaussian", alpha = 1, nfolds = 5) +} +zreg_lasso_cv <- function(x, z) { + cv.glmnet(x, z, family = "gaussian", alpha = 1, nfolds = 5) +} + +dml2_results <- dml2_for_plivm(as.matrix(X), D, Z, y, dreg_lasso_cv, yreg_lasso_cv, zreg_lasso_cv, + nfold = 5, method = "regression") +sum_lasso_cv <- summary_for_plivm(dml2_results$coef_est, dml2_results$se, dml2_results$ytil, dml2_results$dtil, + dml2_results$ztil, name = "LassoCV") +tableplr <- data.frame() +tableplr <- rbind(sum_lasso_cv) +tableplr + +ytil_lasso <- dml2_results$ytil +dtil_lasso <- dml2_results$dtil +ztil_lasso <- dml2_results$ztil +``` + +#### Using a $\ell_2$ Penalized Logistic Regression for D and Z + +```{r} +# DML with Lasso/Ridge +set.seed(123) +cat(sprintf("\nDML with Lasso/Logistic \n")) + +dreg_ridge_cv <- function(x, d) { + cv.glmnet(x, d, family = "binomial", alpha = 0, nfolds = 5) +} +yreg_ridge_cv <- function(x, y) { + cv.glmnet(x, y, family = "gaussian", alpha = 1, nfolds = 5) +} +zreg_ridge_cv <- function(x, z) { + cv.glmnet(x, z, family = "binomial", alpha = 0, nfolds = 5) +} + +dml2_results <- dml2_for_plivm(as.matrix(X), D, Z, y, dreg_ridge_cv, yreg_ridge_cv, zreg_ridge_cv, + nfold = 5, method = "regression") +sum_lasso_ridge_cv <- summary_for_plivm(dml2_results$coef_est, dml2_results$se, dml2_results$ytil, dml2_results$dtil, + dml2_results$ztil, name = "LassoCV/LogisticCV") +tableplr <- rbind(tableplr, sum_lasso_ridge_cv) +tableplr + +ytil_ridge <- dml2_results$ytil +dtil_ridge <- dml2_results$dtil +ztil_ridge <- dml2_results$ztil +``` + +### Random Forests + +```{r} +# DML with Random Forest +set.seed(123) +cat(sprintf("\nDML with Random Forest \n")) + +dreg_rf <- function(x, d) { + randomForest(x, d, ntree = 1000, nodesize = 10) +} # ML method=Forest +yreg_rf <- function(x, y) { + randomForest(x, y, ntree = 1000, nodesize = 10) +} # ML method=Forest +zreg_rf <- function(x, z) { + randomForest(x, z, ntree = 1000, nodesize = 10) +} # ML method=Forest + +dml2_results <- dml2_for_plivm(as.matrix(X), as.factor(D), as.factor(Z), y, dreg_rf, yreg_rf, zreg_rf, + nfold = 5, method = "randomforest") +sum_rf <- summary_for_plivm(dml2_results$coef_est, dml2_results$se, dml2_results$ytil, dml2_results$dtil, + dml2_results$ztil, name = "RF") +tableplr <- rbind(tableplr, sum_rf) +tableplr + +ytil_rf <- dml2_results$ytil +dtil_rf <- dml2_results$dtil +ztil_rf <- dml2_results$ztil +``` + +### Decision Trees + +```{r} +# DML with Decision Trees +set.seed(123) +cat(sprintf("\nDML with Decision Trees \n")) + +# decision tree takes in X as dataframe, not matrix/array +dreg_tr <- function(x, d) { + rpart(as.formula("D~."), cbind(data.frame(D = d), x), method = "class", minbucket = 10, cp = 0.001) +} +yreg_tr <- function(x, y) { + rpart(as.formula("y~."), cbind(data.frame(y = y), x), minbucket = 10, cp = 0.001) +} +zreg_tr <- function(x, z) { + rpart(as.formula("Z~."), cbind(data.frame(Z = z), x), method = "class", minbucket = 10, cp = 0.001) +} + +dml2_results <- dml2_for_plivm(X, D, Z, y, dreg_tr, yreg_tr, zreg_tr, + nfold = 5, method = "decisiontrees") +sum_tr <- summary_for_plivm(dml2_results$coef_est, dml2_results$se, dml2_results$ytil, dml2_results$dtil, + dml2_results$ztil, name = "Decision Trees") +tableplr <- rbind(tableplr, sum_tr) +tableplr + +ytil_tr <- dml2_results$ytil +dtil_tr <- dml2_results$dtil +ztil_tr <- dml2_results$ztil +``` + +### Boosted Trees + +```{r} +# DML with Boosted Trees +set.seed(123) +cat(sprintf("\nDML with Boosted Trees \n")) + +# NB: early stopping cannot easily be implemented with gbm +## set n.trees = best, where best <- gbm.perf(dreg_boost, plot.it = FALSE) +dreg_boost <- function(x, d) { + gbm(as.formula("D~."), cbind(data.frame(D = d), x), distribution = "bernoulli", + interaction.depth = 2, n.trees = 100, shrinkage = .1) +} +yreg_boost <- function(x, y) { + gbm(as.formula("y~."), cbind(data.frame(y = y), x), distribution = "gaussian", + interaction.depth = 2, n.trees = 100, shrinkage = .1) +} +zreg_boost <- function(x, z) { + gbm(as.formula("Z~."), cbind(data.frame(Z = z), x), distribution = "bernoulli", + interaction.depth = 2, n.trees = 100, shrinkage = .1) +} + +# passing these through regression as type="response", and D and Z should not be factors! +dml2_results <- dml2_for_plivm(X, D, Z, y, dreg_boost, yreg_boost, zreg_boost, + nfold = 5, method = "regression") +sum_boost <- summary_for_plivm(dml2_results$coef_est, dml2_results$se, dml2_results$ytil, dml2_results$dtil, + dml2_results$ztil, name = "Boosted Trees") +tableplr <- rbind(tableplr, sum_boost) +tableplr + +ytil_boost <- dml2_results$ytil +dtil_boost <- dml2_results$dtil +ztil_boost <- dml2_results$ztil +``` + +## Ensembles + +Boosted trees give the best RMSE for Y, D, and Z, so the ensemble based on choosing the best performing prediction rule is identical to boosting in this case. + +```{r} +# Best fit is boosted trees for D, Z, Y + +sum_best <- summary_for_plivm(dml2_results$coef_est, dml2_results$se, dml2_results$ytil, dml2_results$dtil, + dml2_results$ztil, name = "Best") +tableplr <- rbind(tableplr, sum_best) +tableplr +``` + +We'll form a model average with unconstrained least squares weights. + +```{r} +# Least squares model average + +dhat_lasso <- D - dtil_lasso +dhat_ridge <- D - dtil_ridge +dhat_rf <- D - dtil_rf +dhat_tr <- D - dtil_tr +dhat_boost <- D - dtil_boost + +yhat_lasso <- y - ytil_lasso +yhat_ridge <- y - ytil_ridge +yhat_rf <- y - ytil_rf +yhat_tr <- y - ytil_tr +yhat_boost <- y - ytil_boost + +zhat_lasso <- Z - ztil_lasso +zhat_ridge <- Z - ztil_ridge +zhat_rf <- Z - ztil_rf +zhat_tr <- Z - ztil_tr +zhat_boost <- Z - ztil_boost + +ma_dtil <- lm(D ~ dhat_lasso + dhat_ridge + dhat_rf + dhat_tr + dhat_boost)$residuals +ma_ytil <- lm(y ~ yhat_lasso + yhat_ridge + yhat_rf + yhat_tr + yhat_boost)$residuals +ma_ztil <- lm(Z ~ zhat_lasso + zhat_ridge + zhat_rf + zhat_tr + zhat_boost)$residuals + +ivfit <- tsls(y = ma_ytil, d = ma_dtil, x = NULL, z = ma_ztil, intercept = FALSE) +coef_est <- ivfit$coef # extract coefficient +se <- ivfit$se # record standard error + +sum_ma <- summary_for_plivm(coef_est, se, ma_ytil, ma_dtil, ma_ztil, name = "Model Average") +tableplr <- rbind(tableplr, sum_ma) +tableplr +``` + +## Inference Robust to Weak Identification + +Now we turn toward robustness when the instrument is weak. + +Ideally, we would do (semi) cross-fitting with AutoML in order to find good first-stage models and re-run DML with these models. Unfortunately this is not easy to do in R. In the case of semi cross-fitting, we can use R's H20 AutoML trained on the entire training set $y\sim X$, $D \sim X$, $Z\sim X$ to determine the best model (eg ensemble), but H20 does not allow you to extract the best model so we can re-use that in DML. + +Thus, in the below analysis of robust inference, we choose Boosted Trees as they perform well. + +```{r} +robust_inference <- function(point, stderr, resD, resy, resZ, grid, alpha = 0.05) { + # Inference in the partially linear IV model that is robust to weak identification. + # grid: grid of theta values to search over when trying to identify the confidence region + # alpha: confidence level + + n <- dim(X)[1] + thr <- qchisq(1 - alpha, df = 1) + accept <- c() + + for (theta in grid) { + moment <- (resy - theta * resD) * resZ + test <- n * mean(moment)^2 / var(moment) + if (test <= thr) { + accept <- c(accept, theta) + } + } + + return(accept) +} +``` + +```{r} +grid <- seq(0, 20000, length.out = 10000) +region <- robust_inference(dml2_results$coef_est, dml2_results$stderr, dml2_results$dtil, dml2_results$ytil, + dml2_results$ztil, grid = grid) +``` + +```{r} +grid <- seq(0, 20000, length.out = 10000) +region <- robust_inference(dml2_results$coef_est, dml2_results$stderr, dml2_results$dtil, dml2_results$ytil, + dml2_results$ztil, grid = grid) +# Calculate min and max +min_region <- min(region) +max_region <- max(region) + +print(min_region) +print(max_region) +``` + +# Interactive IV Model and LATE + +Now, we consider estimation of local average treatment effects (LATE) of participation `p401`, with the binary instrument `e401`. As before, $Y$ denotes the outcome `net_tfa`, and $X$ is the vector of covariates. Here the structural equation model is: +\begin{align} +Y :=~& f_Y (D, X, A, \epsilon_Y) \\ +D :=~& f_D(Z, X, A, \epsilon_D) \in \{0,1\}, \\ +Z :=~& f_Z(X,\epsilon_Z) \in \{0,1\}, \\ +X :=~& \epsilon_X, \quad A = \epsilon_A, +\end{align} +where $\epsilon$'s are all exogenous and independent, +and +$$ +z \mapsto f_D(z , A, X, \epsilon_D) \text{ is weakly increasing (weakly monotone)}. +$$ +and $A$ is a vector of unobserved confounders. Note that in our setting monotonicity is satisfied, since participation is only feasible when it is eligible. Thus we have that $D=0$ whenever $Z=0$. Thus it can only be that $f_D(1, A, X, \epsilon_D) \geq 0 = f_D(0, A, X, \epsilon_D)$. + +In this case, we can estimate the local average treatment effect (LATE): +$$ +\alpha = E[Y(1) - Y(0) | D(1) > D(0)] +$$ +This can be identified using the Neyman orthogonal moment equation: +\begin{align} +E\left[g(1, X) - g(0, X) + H(Z) (Y - g(Z, X)) - \alpha \cdot (m(1, X) - m(0, X) + H(Z) (D - m(Z, X))\right] = 0 +\end{align} +where +\begin{align} +g(Z,X) =~& E[Y|Z,X],\\ +m(Z,X) =~& E[D|Z,X],\\ +H(Z) =~& \frac{Z}{Pr(Z=1|X)} - \frac{1 - Z}{1 - Pr(Z=1|X)} +\end{align} + +```{r} +get_dhat0 <- function(XZ0, DZ0, Xb, dreg0, type = NULL, DZ0factor = NULL) { + # train a treatment model on training data that received Z=0 and predict treatment on all data in test set + if (mean(DZ0) > 0) { # it could be that D=0, whenever Z=0 deterministically + dreg0_ <- dreg0 + if (is.null(DZ0factor)) { + dfit0 <- dreg0_((XZ0), DZ0) + } else { + dfit0 <- dreg0_((XZ0), DZ0factor) + } + if (is.null(type)) { + return(predict(dfit0, (Xb))[, 2]) + } else if (type == "prob") { + return(predict(dfit0, (Xb), type = "prob")[, 2]) + } else if (type == "reponse") { + return(predict(dfit0, (Xb), type = "response")) + } else { + stop("Invalid argument `type`.") + } + } else { + return(0) + } +} + +get_dhat1 <- function(XZ1, DZ1, Xb, dreg1, type = NULL, DZ1factor = NULL) { + # train a treamtent model on training data that received Z=1 and predict treatment on all data in test set + if (mean(DZ1) < 1) { # it could be that D=1, whenever Z=1 deterministically + dreg1_ <- dreg1 + if (is.null(DZ1factor)) { + dfit1 <- dreg1_((XZ1), DZ1) + } else { + dfit1 <- dreg1_((XZ1), DZ1factor) + } + if (is.null(type)) { + return(predict(dfit1, (Xb))[, 2]) + } else if (type == "prob") { + return(predict(dfit1, (Xb), type = "prob")[, 2]) + } else if (type == "response") { + return(predict(dfit1, (Xb), type = "response")) + } else { + stop("Invalid argument `type`.") + } + } else { + return(1) + } +} + +# DML for IIVM with D and Z as classifiers or regressors +dml2_for_iivm <- function(x, d, z, y, dreg0, dreg1, yreg0, yreg1, zreg, + trimming = 0.01, nfold = 5, method = "classification", dt = 0, bt = 0) { + # this implements DML2 algorithm, where there moments are estimated via DML, before constructing + # the pooled estimate of theta randomly split data into folds + + ## NB This method has many if statements to accommodate the various estimators we will use. + ## Unlike Python's sklearn, all methods have idfferent default arguments in their predict functions. + ## See official R documentation for details. + + yhat0 <- rep(0, length(y)) + yhat1 <- rep(0, length(y)) + dhat0 <- rep(0, length(d)) + dhat1 <- rep(0, length(d)) + zhat <- rep(0, length(Z)) + + nobs <- nrow(X) + foldid <- rep.int(1:nfold, times = ceiling(nobs / nfold))[sample.int(nobs)] + I <- split(1:nobs, foldid) + # create residualized objects to fill + ytil <- dtil <- ztil <- rep(NA, nobs) + + # obtain cross-fitted residuals + cat("fold: ") + for (b in seq_along(I)) { + # define helpful variables + Xb <- X[I[[b]], ] + Xnotb <- X[-I[[b]], ] + Znotb <- Z[-I[[b]]] + + # training dfs subsetted on the -I[[b]] fold + XZ0 <- X[-I[[b]], ][Z[-I[[b]]] == 0] + yZ0 <- y[-I[[b]]][Z[-I[[b]]] == 0] + XZ1 <- X[-I[[b]], ][Z[-I[[b]]] == 1] + yZ1 <- y[-I[[b]]][Z[-I[[b]]] == 1] + DZ0 <- d[-I[[b]]][Z[-I[[b]]] == 0] + DZ1 <- d[-I[[b]]][Z[-I[[b]]] == 1] + + + if (method == "regression") { + XZ0 <- as.matrix(XZ0) + XZ1 <- as.matrix(XZ1) + Xb <- as.matrix(Xb) + Xnotb <- as.matrix(Xnotb) + + # Train an outcome model on training data that received Z=0 and predict outcome on all data in the test set + yfit0 <- yreg0((XZ0), yZ0) + yhat0[I[[b]]] <- predict(yfit0, (Xb)) # default type = "response" + + # train an outcome model on training data that received Z=1 and predict outcome on all data in test set + yfit1 <- yreg1((XZ1), yZ1) + yhat1[I[[b]]] <- predict(yfit1, (Xb)) + + # train a treatment model on training data that received Z=0 and predict treatment on all data in test set + # default type = "response", but for family binomial it's logg odds + dhat0[I[[b]]] <- get_dhat0(XZ0, DZ0, Xb, dreg0, type = "response") + dhat1[I[[b]]] <- get_dhat1(XZ1, DZ1, Xb, dreg1, type = "response") + + } else if (method == "randomforest") { + DZ0factor <- as.factor(D)[-I[[b]]][Z[-I[[b]]] == 0] + DZ1factor <- as.factor(D)[-I[[b]]][Z[-I[[b]]] == 1] + Znotb <- as.factor(Znotb) + + yfit0 <- yreg0((XZ0), yZ0) + yhat0[I[[b]]] <- predict(yfit0, (Xb), type = "response") + yfit1 <- yreg1((XZ1), yZ1) + yhat1[I[[b]]] <- predict(yfit1, (Xb), type = "response") + + dhat0[I[[b]]] <- get_dhat0(XZ0, DZ0, Xb, dreg0, type = "prob", DZ0factor = DZ0factor) + dhat1[I[[b]]] <- get_dhat1(XZ1, DZ1, Xb, dreg1, type = "prob", DZ1factor = DZ1factor) + + } else if (method == "decisiontrees") { + XZ0 <- as.data.frame(XZ0) + XZ1 <- as.data.frame(XZ1) + Xb <- as.data.frame(Xb) + Xnotb <- as.data.frame(Xnotb) + + yfit0 <- yreg0((XZ0), yZ0) + # default type = "response" for decision trees for continuous response + yhat0[I[[b]]] <- predict(yfit0, (Xb)) + + yfit1 <- yreg1((XZ1), yZ1) + yhat1[I[[b]]] <- predict(yfit1, (Xb)) + + dhat0[I[[b]]] <- get_dhat0(XZ0, DZ0, Xb, dreg0, DZ0factor = as.factor(DZ0)) + dhat1[I[[b]]] <- get_dhat1(XZ1, DZ1, Xb, dreg1, DZ1factor = as.factor(DZ1)) + + } else if (method == "boostedtrees") { + XZ0 <- as.data.frame(XZ0) + XZ1 <- as.data.frame(XZ1) + Xb <- as.data.frame(Xb) + Xnotb <- as.data.frame(Xnotb) + + yfit0 <- yreg0((XZ0), yZ0) + yhat0[I[[b]]] <- predict(yfit0, (Xb)) # default type = "response" for boosted trees + yfit1 <- yreg1((XZ1), yZ1) + yhat1[I[[b]]] <- predict(yfit1, (Xb)) + + dhat0[I[[b]]] <- get_dhat0(XZ0, DZ0, Xb, dreg0, type = "response") + dhat1[I[[b]]] <- get_dhat1(XZ1, DZ1, Xb, dreg1, type = "response") + + } + + # propensity scores: + if (method == "regression") { + zfit_b <- zreg((Xnotb), Znotb) + zhat_b <- predict(zfit_b, (Xb), type = "response") + } else if (method == "randomforest") { + zfit_b <- zreg((Xnotb), Znotb) + zhat_b <- predict(zfit_b, (Xb), type = "prob")[, 2] + } else if (method == "decisiontrees") { + zfit_b <- zreg((Xnotb), as.factor(Znotb)) + zhat_b <- predict(zfit_b, (Xb)) # default is prob, so get second column + zhat_b <- zhat_b[, 2] + } else if (method == "boostedtrees") { + zfit_b <- zreg((Xnotb), Znotb) + zhat_b <- predict(zfit_b, (Xb), type = "response") + } + zhat_b <- pmax(pmin(zhat_b, 1 - trimming), trimming) # trimming so scores are between [trimming, (1-trimming)] + zhat[I[[b]]] <- zhat_b + + cat(b, " ") + } + + + # Prediction of treatment and outcome for observed instrument + yhat <- yhat0 * (1 - Z) + yhat1 * Z + dhat <- dhat0 * (1 - Z) + dhat1 * Z + + # residuals + ytil <- y - yhat + dtil <- D - dhat + ztil <- Z - zhat + + # doubly robust quantity for every sample + HZ <- Z / zhat - (1 - Z) / (1 - zhat) + drZ <- yhat1 - yhat0 + (y - yhat) * HZ + drD <- dhat1 - dhat0 + (D - dhat) * HZ + coef_est <- mean(drZ) / mean(drD) + cat("point", coef_est) + psi <- drZ - coef_est * drD + Jhat <- mean(drD) + variance <- mean(psi^2) / Jhat^2 + se <- sqrt(variance / nrow(X)) + cat("se", se) + + return(list(coef_est = coef_est, se = se, yhat = yhat, dhat = dhat, zhat = zhat, ytil = ytil, + dtil = dtil, ztil = ztil, drZ = drZ, drD = drD, + yhat0 = yhat0, yhat1 = yhat1, dhat0 = dhat0, dhat1 = dhat1)) +} +``` + +```{r} +summary_for_iivm <- function(coef_est, se, yhat, dhat, zhat, ytil, dtil, ztil, drZ, drD, X, Z, D, y, name) { + summary_data <- data.frame( + estimate = coef_est, # point estimate + se = se, # standard error + lower = coef_est - 1.96 * se, # lower end of 95% confidence interval + upper = coef_est + 1.96 * se, # upper end of 95% confidence interval + rmse_y = sqrt(mean(ytil^2)), # res of model that predicts outcome y + rmse_D = sqrt(mean(dtil^2)), # res of model that predicts treatment D + rmse_Z = sqrt(mean(ztil^2)), # res of model that predicts instrument Z + accuracy_D = mean(abs(dtil) < 0.5), # binary classification accuracy of model for D + accuracy_Z = mean(abs(ztil) < 0.5) # binary classification accuracy of model for Z + ) + row.names(summary_data) <- name + return(summary_data) +} +``` + +```{r} +# DML with Lasso/Ridge +set.seed(123) +cat(sprintf("\nDML with Lasso/Logistic \n")) +# DML with Lasso/Ridge +dreg0 <- function(x, d) { + cv.glmnet(x, d, family = "binomial", alpha = 0, nfolds = 5) +} +dreg1 <- function(x, d) { + cv.glmnet(x, d, family = "binomial", alpha = 0, nfolds = 5) +} +yreg0 <- function(x, y) { + cv.glmnet(x, y, family = "gaussian", alpha = 1, nfolds = 5) +} +yreg1 <- function(x, y) { + cv.glmnet(x, y, family = "gaussian", alpha = 1, nfolds = 5) +} +zreg <- function(x, z) { + cv.glmnet(x, z, family = "binomial", alpha = 0, nfolds = 5) +} + +dml2_results <- dml2_for_iivm(as.matrix(X), D, Z, y, dreg0, dreg1, yreg0, yreg1, zreg, + trimming = 0.01, nfold = 5, method = "regression") +sum_lasso_ridge_cv <- summary_for_iivm(dml2_results$coef_est, dml2_results$se, dml2_results$yhat, dml2_results$dhat, + dml2_results$zhat, dml2_results$ytil, dml2_results$dtil, dml2_results$ztil, + dml2_results$drZ, dml2_results$drD, name = "LassoCV/LogisticCV") +table <- data.frame() +table <- rbind(table, sum_lasso_ridge_cv) +table + +yhat_lasso <- dml2_results$yhat +dhat_lasso <- dml2_results$dhat +yhat0_lasso <- dml2_results$yhat0 +yhat1_lasso <- dml2_results$yhat1 +dhat0_lasso <- dml2_results$dhat0 +dhat1_lasso <- dml2_results$dhat1 +zhat_lasso <- dml2_results$zhat +``` + +```{r} +# DML with Random Forest +set.seed(123) +cat(sprintf("\nDML with Random Forest \n")) + +dreg0 <- function(x, d) { + randomForest(x, d, ntree = 1000, nodesize = 10) +} # ML method=Forest +dreg1 <- function(x, d) { + randomForest(x, d, ntree = 1000, nodesize = 10) +} # ML method=Forest +yreg0 <- function(x, y) { + randomForest(x, y, ntree = 1000, nodesize = 10) +} # ML method=Forest +yreg1 <- function(x, y) { + randomForest(x, y, ntree = 1000, nodesize = 10) +} # ML method=Forest +zreg <- function(x, z) { + randomForest(x, z, ntree = 1000, nodesize = 10) +} # ML method=Forest + +dml2_results <- dml2_for_iivm(X, D, Z, y, dreg0, dreg1, yreg0, yreg1, zreg, + trimming = 0.01, nfold = 5, method = "randomforest") +sum_rf <- summary_for_iivm(dml2_results$coef_est, dml2_results$se, dml2_results$yhat, dml2_results$dhat, + dml2_results$zhat, dml2_results$ytil, dml2_results$dtil, dml2_results$ztil, + dml2_results$drZ, dml2_results$drD, name = "RF") +table <- rbind(table, sum_rf) +table + +yhat_rf <- dml2_results$yhat +dhat_rf <- dml2_results$dhat +yhat0_rf <- dml2_results$yhat0 +yhat1_rf <- dml2_results$yhat1 +dhat0_rf <- dml2_results$dhat0 +dhat1_rf <- dml2_results$dhat1 +zhat_rf <- dml2_results$zhat +``` + +```{r} +# DML with Decision Trees +set.seed(123) +cat(sprintf("\nDML with Decision Trees \n")) + +dreg0 <- function(x, d) { + rpart(as.formula("D ~ ."), cbind(data.frame(D = d), x), method = "class", minbucket = 10, cp = 0.001) +} +dreg1 <- function(x, d) { + rpart(as.formula("D ~ ."), cbind(data.frame(D = d), x), method = "class", minbucket = 10, cp = 0.001) +} +yreg0 <- function(x, y) { + rpart(as.formula("y ~ ."), cbind(data.frame(y = y), x), minbucket = 10, cp = 0.001) +} +yreg1 <- function(x, y) { + rpart(as.formula("y ~ ."), cbind(data.frame(y = y), x), minbucket = 10, cp = 0.001) +} +zreg <- function(x, z) { + rpart(as.formula("Z ~ ."), cbind(data.frame(Z = z), x), method = "class", minbucket = 10, cp = 0.001) +} + +dml2_results <- dml2_for_iivm(X, D, Z, y, dreg0, dreg1, yreg0, yreg1, zreg, + trimming = 0.01, nfold = 5, method = "decisiontrees") +sum_tr <- summary_for_iivm(dml2_results$coef_est, dml2_results$se, dml2_results$yhat, dml2_results$dhat, + dml2_results$zhat, dml2_results$ytil, dml2_results$dtil, dml2_results$ztil, + dml2_results$drZ, dml2_results$drD, name = "Decision Trees") +table <- rbind(table, sum_tr) +table + +yhat_tr <- dml2_results$yhat +dhat_tr <- dml2_results$dhat +yhat0_tr <- dml2_results$yhat0 +yhat1_tr <- dml2_results$yhat1 +dhat0_tr <- dml2_results$dhat0 +dhat1_tr <- dml2_results$dhat1 +zhat_tr <- dml2_results$zhat +``` + +```{r} +# DML with Boosted Trees +set.seed(123) +cat(sprintf("\nDML with Boosted Trees \n")) + +# NB: early stopping cannot easily be implemented with gbm +## set n.trees = best, where best <- gbm.perf(dreg_boost, plot.it = FALSE) +dreg0 <- function(x, d) { + gbm(as.formula("D ~ ."), cbind(data.frame(D = d), x), distribution = "bernoulli", + interaction.depth = 2, n.trees = 100, shrinkage = .1) +} +dreg1 <- function(x, d) { + gbm(as.formula("D ~ ."), cbind(data.frame(D = d), x), distribution = "bernoulli", + interaction.depth = 2, n.trees = 100, shrinkage = .1) +} +yreg0 <- function(x, y) { + gbm(as.formula("y ~ ."), cbind(data.frame(y = y), x), distribution = "gaussian", + interaction.depth = 2, n.trees = 100, shrinkage = .1) +} +yreg1 <- function(x, y) { + gbm(as.formula("y ~ ."), cbind(data.frame(y = y), x), distribution = "gaussian", + interaction.depth = 2, n.trees = 100, shrinkage = .1) +} +zreg <- function(x, z) { + gbm(as.formula("Z ~ ."), cbind(data.frame(Z = z), x), distribution = "bernoulli", + interaction.depth = 2, n.trees = 100, shrinkage = .1) +} + +# passing these through regression as type="response", and D and Z should not be factors! +dml2_results <- dml2_for_iivm(X, D, Z, y, dreg0, dreg1, yreg0, yreg1, zreg, + trimming = 0.01, nfold = 5, method = "boostedtrees") +sum_boost <- summary_for_iivm(dml2_results$coef_est, dml2_results$se, dml2_results$yhat, dml2_results$dhat, + dml2_results$zhat, dml2_results$ytil, dml2_results$dtil, dml2_results$ztil, + dml2_results$drZ, dml2_results$drD, name = "Boosted Trees") +table <- rbind(table, sum_boost) +table + +yhat_boost <- dml2_results$yhat +dhat_boost <- dml2_results$dhat +yhat0_boost <- dml2_results$yhat0 +yhat1_boost <- dml2_results$yhat1 +dhat0_boost <- dml2_results$dhat0 +dhat1_boost <- dml2_results$dhat1 +zhat_boost <- dml2_results$zhat +``` + +## Ensembles + +Boosted trees give the best RMSE for D and Z and random forests give the best RMSE for Y. + +```{r} +# Best fit is boosted trees for D, Z and random forests for Y + +best_yhat0 <- yhat0_rf +best_yhat1 <- yhat1_rf +best_yhat <- yhat_rf + +best_dhat0 <- dhat0_boost +best_dhat1 <- dhat1_boost +best_dhat <- dhat_boost + +best_zhat <- zhat_boost + +ytil_best <- y - best_yhat +dtil_best <- D - best_dhat +ztil_best <- Z - best_zhat + +# doubly robust quantity for every sample +HZ <- Z / best_zhat - (1 - Z) / (1 - best_zhat) +drZ <- best_yhat1 - best_yhat0 + (y - best_yhat) * HZ +drD <- best_dhat1 - best_dhat0 + (D - best_dhat) * HZ +coef_est <- mean(drZ) / mean(drD) +psi <- drZ - coef_est * drD +Jhat <- mean(drD) +variance <- mean(psi^2) / Jhat^2 +se <- sqrt(variance / nrow(X)) + +sum_best <- summary_for_iivm(coef_est, se, best_yhat, best_dhat, best_zhat, + ytil_best, dtil_best, ztil_best, drZ, drD, name = "Best") +table <- rbind(table, sum_best) +table +``` + +We'll form a model average with unconstrained least squares weights. + +```{r} +# Least squares model average +ma_dcoef <- lm(D ~ dhat_lasso + dhat_rf + dhat_tr + dhat_boost - 1)$coef +ma_ycoef <- lm(y ~ yhat_lasso + yhat_rf + yhat_tr + yhat_boost - 1)$coef +ma_zcoef <- lm(Z ~ zhat_lasso + zhat_rf + zhat_tr + zhat_boost - 1)$coef + +ma_yhat0 <- cbind(yhat0_lasso, yhat0_rf, yhat0_tr, yhat0_boost) %*% as.matrix(ma_ycoef) +ma_yhat1 <- cbind(yhat1_lasso, yhat1_rf, yhat1_tr, yhat1_boost) %*% as.matrix(ma_ycoef) +ma_dhat0 <- cbind(dhat0_lasso, dhat0_rf, dhat0_tr, dhat0_boost) %*% as.matrix(ma_dcoef) +ma_dhat1 <- cbind(dhat1_lasso, dhat1_rf, dhat1_tr, dhat1_boost) %*% as.matrix(ma_dcoef) +ma_zhat <- cbind(zhat_lasso, zhat_rf, zhat_tr, zhat_boost) %*% as.matrix(ma_zcoef) + +# Prediction of treatment and outcome for observed instrument +ma_yhat <- ma_yhat0 * (1 - Z) + ma_yhat1 * Z +ma_dhat <- ma_dhat0 * (1 - Z) + ma_dhat1 * Z + +# residuals +ma_ytil <- y - ma_yhat +ma_dtil <- D - ma_dhat +ma_ztil <- Z - ma_zhat + +# doubly robust quantity for every sample +HZ <- Z / ma_zhat - (1 - Z) / (1 - ma_zhat) +drZ <- ma_yhat1 - ma_yhat0 + (y - ma_yhat) * HZ +drD <- ma_dhat1 - ma_dhat0 + (D - ma_dhat) * HZ +coef_est <- mean(drZ) / mean(drD) +psi <- drZ - coef_est * drD +Jhat <- mean(drD) +variance <- mean(psi^2) / Jhat^2 +se <- sqrt(variance / nrow(X)) + +sum_ma <- summary_for_iivm(coef_est, se, ma_yhat, ma_dhat, ma_zhat, + ma_ytil, ma_dtil, ma_ztil, drZ, drD, name = "Model Average") +table <- rbind(table, sum_ma) +table +``` + +Comparing with the PLR model + +```{r} +tableplr +``` + +We find that the PLR model overestimates the effect by around 1k; though both sets of results have overlapping confidence intervals. + + +Again as before, ideally we would do (semi) cross-fitting with AutoML in order to find good first-stage models and re-run DML with these models. Unfortunately this is not easy to do in R. + +As before, in the below analysis of robust inference, we choose Boosted Trees as they perform well in RMSE and accuracy on first-stage models. + +```{r} +iivm_robust_inference <- function(point, stderr, yhat, Dhat, Zhat, resy, resD, resZ, + drZ, drD, X, Z, D, y, grid, alpha = 0.05) { + # Inference in the partially linear IV model that is robust to weak identification. + # grid: grid of theta values to search over when trying to identify the confidence region + # alpha: confidence level + + n <- dim(X)[1] + thr <- qchisq(1 - alpha, df = 1) + accept <- c() + + for (theta in grid) { + moment <- drZ - theta * drD + test <- n * mean(moment)^2 / var(moment) + if (test <= thr) { + accept <- c(accept, theta) + } + } + + return(accept) +} +``` + +```{r} +grid <- seq(0, 20000, length.out = 10000) +region <- iivm_robust_inference(point = dml2_results$coef_est, stderr = dml2_results$se, + yhat = dml2_results$yhat, Dhat = dml2_results$dhat, Zhat = dml2_results$zhat, + resy = dml2_results$ytil, resD = dml2_results$dtil, resZ = dml2_results$ztil, + drZ = dml2_results$drZ, drD = dml2_results$drD, + X = X, Z = Z, D = D, y = y, grid = grid) + +# Calculate min and max +min_region <- min(region) +max_region <- max(region) + +print(min_region) +print(max_region) +``` + +We find again that the robust inference confidence region is almost identical to the normal based inference. We are most probably in the strong instrument regime. We can check the t-statistic for the effect of the instrument on the treatment to verify this. + +# DoubleML package + +There exist nice packages out there that can help us do our estimation with the simple call of a function. Such packages include `EconML` (Python) and `DoubleML` (Python and R). + +We run through IIVM using `DoubleML` below to illustrate. The `DoubleML` package internally builds on `mlr3`. We use the meta package `mlr3` to generate predictions with machine learning methods. A comprehensive introduction and description of the `mlr3` package is provided in the [mlr3book](https://mlr3book.mlr-org.com/). A list of all learners that you can use in `mlr3` can be found [here](https://mlr3extralearners.mlr-org.com/articles/learners/list_learners.html). The entry in the columns *mlr3 Package* and *Packages* indicate which packages must be installed/loaded in your R session. + +You find additional information about `DoubleML` on the package on the package website https://docs.doubleml.org/ and the R documentation page https://docs.doubleml.org/r/stable/. + +```{r} +install.packages("DoubleML") +install.packages("mlr3learners") +install.packages("mlr3") +install.packages("data.table") +install.packages("ranger") +``` + +```{r} +library(DoubleML) +library(mlr3learners) +library(mlr3) +library(data.table) +library(ranger) +``` + +## Local Average Treatment Effects of 401(k) Participation on Net Financial Assets + +## Interactive IV Model (IIVM) + +Now, we consider estimation of local average treatment effects (LATE) of participation with the binary instrument `e401`. As before, $Y$ denotes the outcome `net_tfa`, and $X$ is the vector of covariates. Here the structural equation model is: + +\begin{align} +Y :=~& g_0(Z,X) + U, &\quad E[U\mid Z,X] = 0,\\ +D :=~& r_0(Z,X) + V, &\quad E[V\mid Z, X] = 0,\\ +Z :=~& m_0(X) + \zeta, &\quad E[\zeta \mid X] = 0. +\end{align} + +```{r} +# Constructing the data (as DoubleMLData) +formula_flex2 <- paste("net_tfa ~ p401+ e401 + poly(age, 6, raw=TRUE) + poly(inc, 8, raw=TRUE) + ", + "poly(educ, 4, raw=TRUE) + poly(fsize, 2, raw=TRUE) + male + marr + twoearn + db + pira + hown") +model_flex2 <- as.data.table(model.frame(formula_flex2, data)) +x_cols <- colnames(model_flex2)[-c(1, 2, 3)] +data_iv <- DoubleMLData$new(model_flex2, y_col = "net_tfa", d_cols = "p401", z_cols = "e401", x_cols = x_cols) +``` + +```{r} +lgr::get_logger("mlr3")$set_threshold("warn") +lasso <- lrn("regr.cv_glmnet", nfolds = 5, s = "lambda.min") +lasso_class <- lrn("classif.cv_glmnet", nfolds = 5, s = "lambda.min") +dml_mliivm <- DoubleMLIIVM$new(data_iv, + ml_g = lasso, + ml_m = lasso_class, ml_r = lasso_class, n_folds = 5, subgroups = list( + always_takers = FALSE, + never_takers = TRUE + ) +) +dml_mliivm$fit(store_predictions = TRUE) +dml_mliivm$summary() +lasso_mliivm <- dml_mliivm$coef +lasso_std_mliivm <- dml_mliivm$se +``` + +The confidence interval for the local average treatment effect of participation is given by + +```{r} +dml_mliivm$confint(level = 0.95) +``` + +Here we can also check the accuracy of the model: + +```{r} +# variables +y <- as.matrix(pension$net_tfa) # true observations +d <- as.matrix(pension$p401) +z <- as.matrix(pension$e401) + +# predictions +dml_mliivm$params_names() +g0_hat <- as.matrix(dml_mliivm$predictions$ml_g0) # predictions of g_0(z=0, X) +g1_hat <- as.matrix(dml_mliivm$predictions$ml_g1) # predictions of g_0(z=1, X) +g_hat <- z * g1_hat + (1 - z) * g0_hat # predictions of g_0 +r0_hat <- as.matrix(dml_mliivm$predictions$ml_r0) # predictions of r_0(z=0, X) +r1_hat <- as.matrix(dml_mliivm$predictions$ml_r1) # predictions of r_0(z=1, X) +r_hat <- z * r1_hat + (1 - z) * r0_hat # predictions of r_0 +m_hat <- as.matrix(dml_mliivm$predictions$ml_m) # predictions of m_o +``` + +```{r} +# cross-fitted RMSE: outcome +lasso_y_mliivm <- sqrt(mean((y - g_hat)^2)) +lasso_y_mliivm + +# cross-fitted RMSE: treatment +lasso_d_mliivm <- sqrt(mean((d - r_hat)^2)) +lasso_d_mliivm + +# cross-fitted RMSE: instrument +lasso_z_mliivm <- sqrt(mean((z - m_hat)^2)) +lasso_z_mliivm +``` + +Again, we repeat the procedure for the other machine learning methods: + +```{r} +# needed to run boosting +remotes::install_github("mlr-org/mlr3extralearners") +install.packages("mlr3extralearners") +install.packages("mboost") +``` + +```{r} +library(mlr3extralearners) +library(mboost) +``` + +```{r} +# Forest +randomForest <- lrn("regr.ranger") +random_forest_class <- lrn("classif.ranger") + +# Trees +trees <- lrn("regr.rpart") +trees_class <- lrn("classif.rpart") + +# Boosting +boost <- lrn("regr.glmboost") +boost_class <- lrn("classif.glmboost") +``` + +```{r} +### random forest ### + +lgr::get_logger("mlr3")$set_threshold("warn") +dml_mliivm <- DoubleMLIIVM$new(data_iv, + ml_g = randomForest, + ml_m = random_forest_class, ml_r = random_forest_class, n_folds = 3, subgroups = list( + always_takers = FALSE, + never_takers = TRUE + ) +) +dml_mliivm$fit(store_predictions = TRUE) +dml_mliivm$summary() +forest_mliivm <- dml_mliivm$coef +forest_std_mliivm <- dml_mliivm$se + +# predictions +g0_hat <- as.matrix(dml_mliivm$predictions$ml_g0) # predictions of g_0(Z=0, X) +g1_hat <- as.matrix(dml_mliivm$predictions$ml_g1) # predictions of g_0(Z=1, X) +g_hat <- z * g1_hat + (1 - z) * g0_hat # predictions of g_0 +r0_hat <- as.matrix(dml_mliivm$predictions$ml_r0) # predictions of r_0(Z=0, X) +r1_hat <- as.matrix(dml_mliivm$predictions$ml_r1) # predictions of r_0(Z=1, X) +r_hat <- z * r1_hat + (1 - z) * r0_hat # predictions of r_0 +m_hat <- as.matrix(dml_mliivm$predictions$ml_m) # predictions of m_o + +# cross-fitted RMSE: outcome +forest_y_mliivm <- sqrt(mean((y - g_hat)^2)) +forest_y_mliivm + +# cross-fitted RMSE: treatment +forest_d_mliivm <- sqrt(mean((d - r_hat)^2)) +forest_d_mliivm + +# cross-fitted RMSE: instrument +forest_z_mliivm <- sqrt(mean((z - m_hat)^2)) +forest_z_mliivm + +### trees ### + +dml_mliivm <- DoubleMLIIVM$new(data_iv, + ml_g = trees, + ml_m = trees_class, ml_r = trees_class, n_folds = 3, subgroups = list( + always_takers = FALSE, + never_takers = TRUE + ) +) +dml_mliivm$fit(store_predictions = TRUE) +dml_mliivm$summary() +tree_mliivm <- dml_mliivm$coef +tree_std_mliivm <- dml_mliivm$se + +# predictions +g0_hat <- as.matrix(dml_mliivm$predictions$ml_g0) # predictions of g_0(Z=0, X) +g1_hat <- as.matrix(dml_mliivm$predictions$ml_g1) # predictions of g_0(Z=1, X) +g_hat <- z * g1_hat + (1 - z) * g0_hat # predictions of g_0 +r0_hat <- as.matrix(dml_mliivm$predictions$ml_r0) # predictions of r_0(Z=0, X) +r1_hat <- as.matrix(dml_mliivm$predictions$ml_r1) # predictions of r_0(Z=1, X) +r_hat <- z * r1_hat + (1 - z) * r0_hat # predictions of r_0 +m_hat <- as.matrix(dml_mliivm$predictions$ml_m) # predictions of m_o + +# cross-fitted RMSE: outcome +tree_y_mliivm <- sqrt(mean((y - g_hat)^2)) +tree_y_mliivm + +# cross-fitted RMSE: treatment +tree_d_mliivm <- sqrt(mean((d - r_hat)^2)) +tree_d_mliivm + +# cross-fitted RMSE: instrument +tree_z_mliivm <- sqrt(mean((z - m_hat)^2)) +tree_z_mliivm + + +### boosting ### +dml_mliivm <- DoubleMLIIVM$new(data_iv, + ml_g = boost, + ml_m = boost_class, ml_r = boost_class, n_folds = 3, subgroups = list( + always_takers = FALSE, + never_takers = TRUE + ) +) +dml_mliivm$fit(store_predictions = TRUE) +dml_mliivm$summary() +boost_mliivm <- dml_mliivm$coef +boost_std_mliivm <- dml_mliivm$se + +# predictions +g0_hat <- as.matrix(dml_mliivm$predictions$ml_g0) # predictions of g_0(Z=0, X) +g1_hat <- as.matrix(dml_mliivm$predictions$ml_g1) # predictions of g_0(Z=1, X) +g_hat <- z * g1_hat + (1 - z) * g0_hat # predictions of g_0 +r0_hat <- as.matrix(dml_mliivm$predictions$ml_r0) # predictions of r_0(Z=0, X) +r1_hat <- as.matrix(dml_mliivm$predictions$ml_r1) # predictions of r_0(Z=1, X) +r_hat <- z * r1_hat + (1 - z) * r0_hat # predictions of r_0 +m_hat <- as.matrix(dml_mliivm$predictions$ml_m) # predictions of m_o + +# cross-fitted RMSE: outcome +boost_y_mliivm <- sqrt(mean((y - g_hat)^2)) +boost_y_mliivm + +# cross-fitted RMSE: treatment +boost_d_mliivm <- sqrt(mean((d - r_hat)^2)) +boost_d_mliivm + +# cross-fitted RMSE: instrument +boost_z_mliivm <- sqrt(mean((z - m_hat)^2)) +boost_z_mliivm +``` + +```{r} +table <- matrix(0, 5, 4) +table[1, 1:4] <- c(lasso_mliivm, forest_mliivm, tree_mliivm, boost_mliivm) +table[2, 1:4] <- c(lasso_std_mliivm, forest_std_mliivm, tree_std_mliivm, boost_std_mliivm) +table[3, 1:4] <- c(lasso_y_mliivm, forest_y_mliivm, tree_y_mliivm, boost_y_mliivm) +table[4, 1:4] <- c(lasso_d_mliivm, forest_d_mliivm, tree_d_mliivm, boost_d_mliivm) +table[5, 1:4] <- c(lasso_z_mliivm, forest_z_mliivm, tree_z_mliivm, boost_z_mliivm) +rownames(table) <- c("Estimate", "Std.Error", "RMSE Y", "RMSE D", "RMSE Z") +colnames(table) <- c("Lasso", "Random Forest", "Trees", "Boosting") +tab <- xtable(table, digits = 2) +tab +``` + +We report results based on four ML methods for estimating the nuisance functions used in +forming the orthogonal estimating equations. We find again that the estimates of the treatment effect are stable across ML methods. The estimates are highly significant, hence we would reject the hypothesis +that the effect of 401(k) participation has no effect on financial health. + +We might rerun the model using the best ML method for each equation to get a final estimate for the treatment effect of participation: + +```{r} +lgr::get_logger("mlr3")$set_threshold("warn") +dml_mliivm <- DoubleMLIIVM$new(data_iv, + ml_g = randomForest, + ml_m = lasso_class, ml_r = lasso_class, n_folds = 5, subgroups = list( + always_takers = FALSE, + never_takers = TRUE + ) +) +dml_mliivm$fit(store_predictions = TRUE) +dml_mliivm$summary() +best_mliivm <- dml_mliivm$coef +best_std_mliivm <- dml_mliivm$se +``` + diff --git a/AC2/r-weak-iv-experiments.Rmd b/AC2/r-weak-iv-experiments.Rmd new file mode 100644 index 00000000..daf93cc7 --- /dev/null +++ b/AC2/r-weak-iv-experiments.Rmd @@ -0,0 +1,75 @@ +--- +title: An R Markdown document converted from "AC2/r-weak-iv-experiments.irnb" +output: html_document +--- + +# A Simple Example of Properties of IV estimator when Instruments are Weak + +```{r} +install.packages("hdm") +``` + +```{r} +library(hdm) +``` + +Simulation Design + +```{r} +# Simulation Design +set.seed(1) + +n <- 100 +beta <- .1 # .1 weak IV (change to 1.0 for strong IV) + +# One realization +U <- rnorm(n) +Z <- rnorm(n) # generate instrument +D <- beta * Z + U # generate endogenougs variable +Y <- D + U # the true causal effect is 1 + +summary(lm(D ~ Z)) # first stage is very weak here when we set beta = .1 +``` + +```{r} +summary(tsls(x = NULL, d = D, y = Y, z = Z)) +``` + +Note that the instrument is weak here (strength of the instrument is controlled by setting $\beta$) -- the t-stat is smaller than any rule-of-thumb suggested in the literature (e.g. $\sqrt{10}$) . + +# Run 10000 trials to evaluate distribution of the IV estimator + +```{r} +# Simulation Design + +set.seed(1) +B <- 10000 # trials +IVEst <- rep(0, B) + +for (i in 1:B) { + U <- rnorm(n) + Z <- rnorm(n) # generate instrument + D <- beta * Z + U # generate endogenougs variable + Y <- D + U # the true causal effect is 1 + IVEst[i] <- coef(tsls(x = NULL, d = D, y = Y, z = Z))[1, 1] +} +``` + +# Plot the Actual Distribution against the Normal Approximation (based on Strong Instrument Assumption) + +```{r} +plot(density(IVEst - 1, n = 1000, from = -5, to = 5), + col = 4, xlim = c(-5, 5), + xlab = "IV Estimator -True Effect", main = "Actual Distribution vs Gaussian" +) + +val <- seq(-5, 5, by = .05) +var <- (1 / beta^2) * (1 / 100) # theoretical variance of IV +sd <- sqrt(var) +lines(val, dnorm(val, sd = sd), col = 2, lty = 2) + +rejection_frequency <- sum((abs(IVEst - 1) / sd > 1.96)) / B + +cat(c("Rejection Frequency is ", rejection_frequency, " while we expect it to be .05")) +``` + diff --git a/CM1/r-rct-penn-precision-adj.Rmd b/CM1/r-rct-penn-precision-adj.Rmd new file mode 100644 index 00000000..5a2dc0fa --- /dev/null +++ b/CM1/r-rct-penn-precision-adj.Rmd @@ -0,0 +1,238 @@ +--- +title: An R Markdown document converted from "CM1/r-rct-penn-precision-adj.irnb" +output: html_document +--- + +# Analyzing RCT data with Precision Adjustment + +```{r} +install.packages("sandwich") +install.packages("lmtest") +install.packages("xtable") +install.packages("hdm") +``` + +```{r} +library(sandwich) +library(lmtest) +library(xtable) +library(hdm) +``` + +## Data + +In this lab, we analyze the Pennsylvania re-employment bonus experiment, which was previously studied in "Sequential testing of duration data: the case of the Pennsylvania ‘reemployment bonus’ experiment" (Bilias, 2000), among others. These experiments were conducted in the 1980s by the U.S. Department of Labor to test the incentive effects of alternative compensation schemes for unemployment insurance (UI). In these experiments, UI claimants were randomly assigned either to a control group or one of five treatment groups. Actually, there are six treatment groups in the experiments. Here we focus on treatment group 4, but feel free to explore other treatment groups. In the control group the current rules of the UI applied. Individuals in the treatment groups were offered a cash bonus if they found a job within some pre-specified period of time (qualification period), provided that the job was retained for a specified duration. The treatments differed in the level of the bonus, the length of the qualification period, and whether the bonus was declining over time in the qualification period; see http://qed.econ.queensu.ca/jae/2000-v15.6/bilias/readme.b.txt for further details on data. + + +```{r} +## loading the data +file <- "https://raw.githubusercontent.com/CausalAIBook/MetricsMLNotebooks/main/data/penn_jae.dat" +Penn <- as.data.frame(read.table(file, header = TRUE)) + +n <- dim(Penn)[1] +p_1 <- dim(Penn)[2] +Penn <- subset(Penn, tg == 4 | tg == 0) +attach(Penn) +``` + +```{r} +T4 <- (tg == 4) +summary(T4) +``` + +```{r} +head(Penn) +``` + +### Model +To evaluate the impact of the treatments on unemployment duration, we consider the linear regression model: + +$$ +Y = D \beta_1 + W'\beta_2 + \varepsilon, \quad E \varepsilon (D,W')' = 0, +$$ + +where $Y$ is the log of duration of unemployment, $D$ is a treatment indicators, and $W$ is a set of controls including age group dummies, gender, race, number of dependents, quarter of the experiment, location within the state, existence of recall expectations, and type of occupation. Here $\beta_1$ is the ATE, if the RCT assumptions hold rigorously. + + +We also consider interactive regression model: + +$$ +Y = D \alpha_1 + D W' \alpha_2 + W'\beta_2 + \varepsilon, \quad E \varepsilon (D,W', DW')' = 0, +$$ +where $W$'s are demeaned (apart from the intercept), so that $\alpha_1$ is the ATE, if the RCT assumptions hold rigorously. + +Under RCT, the projection coefficient $\beta_1$ has +the interpretation of the causal effect of the treatment on +the average outcome. We thus refer to $\beta_1$ as the average +treatment effect (ATE). Note that the covariates, here are +independent of the treatment $D$, so we can identify $\beta_1$ by +just linear regression of $Y$ on $D$, without adding covariates. +However we do add covariates in an effort to improve the +precision of our estimates of the average treatment effect. + +### Analysis + +We consider + +* classical 2-sample approach, no adjustment (CL) +* classical linear regression adjustment (CRA) +* interactive regression adjusment (IRA) + +and carry out robust inference using the *estimatr* R packages. + +# Carry out covariate balance check + + +We first look at the coefficients individually with a $t$-test, and then we adjust the $p$-values to control for family-wise error. + +The regression below is done using "type='HC1'" which computes the correct Eicker-Huber-White standard errors, instead of the classical non-robust formula based on homoscedasticity. + +```{r} +data <- model.matrix(T4 ~ (female + black + othrace + factor(dep) + q2 + q3 + q4 + q5 + q6 + + agelt35 + agegt54 + durable + lusd + husd)^2) + +# individual t-tests +m <- lm(T4 ~ (female + black + othrace + factor(dep) + q2 + q3 + q4 + q5 + q6 + + agelt35 + agegt54 + durable + lusd + husd)^2, data = as.data.frame(data)) +coeftest(m, vcov = vcovHC(m, type = "HC1")) +``` + + + +To test balance conditions, we employ the Holm-Bonferroni step-down method. With 100+ hypotheses, the family-wise type I error, or the probability of making at least one type I error treating all hypotheses independently, is close to 1. To control for this, we adjust p-values with the following procedure. + +First, set $\alpha=0.05$ and denote the list of $n$ p-values from the regression with the vector $p$. + +1. Sort $p$ from smallest to largest, so $p_{(1)} \leq p_{(2)} \leq \cdots \leq p_{(n)}$. Denote the corresponding hypothesis for $p_{(i)}$ as $H_{(i)}$. +2. For $i=1,\ldots, n$, +- If $$p_{(i)} > \frac{\alpha}{n-i+1} $$ Break the loop and do not reject any $H_{(j)}$ for $j \geq i$. +- Else reject $H_{(i)}$ if $$p_{(i)} \leq \frac{\alpha}{n-i+1} $$ Increment $i := i+1$. + + + + +```{r} +holm_bonferroni <- function(p, alpha = 0.05) { + n <- length(p) + sig_beta <- c() + + for (i in 1:n) { + if (sort(p)[i] > alpha / (n - i + 1)) { + break + } else { + sig_beta <- c(sig_beta, order(p)[i]) + } + } + + return(sig_beta) +} + +p_values <- as.vector(coeftest(m, vcov = vcovHC(m, type = "HC1"))[, 4]) +significant_indices <- holm_bonferroni(p_values, alpha = 0.05) +print(paste("Significant Coefficients (Indices): ", significant_indices)) +``` + +There is also a built in R function to do this. + +```{r} +p_values <- as.vector(coeftest(m, vcov = vcovHC(m, type = "HC1"))[, 4]) +holm_reject <- p.adjust(sort(p_values), "holm") <= 0.05 +holm_reject +``` + +We see that that even though this is a randomized experiment, balance conditions are failed. + + +# Model Specification + +```{r} +# model specifications + +# no adjustment (2-sample approach) +formula_cl <- log(inuidur1) ~ T4 + +# adding controls +formula_cra <- log(inuidur1) ~ T4 + (female + black + othrace + factor(dep) + q2 + q3 + q4 + q5 + q6 + + agelt35 + agegt54 + durable + lusd + husd)^2 +# Omitted dummies: q1, nondurable, muld + +ols_cl <- lm(formula_cl) +ols_cra <- lm(formula_cra) + +ols_cl <- coeftest(ols_cl, vcov = vcovHC(ols_cl, type = "HC1")) +ols_cra <- coeftest(ols_cra, vcov = vcovHC(ols_cra, type = "HC1")) + +print(ols_cl) +print(ols_cra) +``` + +The interactive specificaiton corresponds to the approach introduced in Lin (2013). + +```{r} +# interactive regression model; + +X <- model.matrix(~ (female + black + othrace + factor(dep) + q2 + q3 + q4 + q5 + q6 + + agelt35 + agegt54 + durable + lusd + husd)^2)[, -1] +dim(X) + +demean <- function(x) { + x - mean(x) +} + +X <- apply(X, 2, demean) + +ols_ira <- lm(log(inuidur1) ~ T4 * X) +ols_ira <- coeftest(ols_ira, vcov = vcovHC(ols_ira, type = "HC1")) +print(ols_ira) +``` + +Next we try out partialling out with lasso + +```{r} +T4 <- demean(T4) + +DX <- model.matrix(~ T4 * X)[, -1] + +rlasso_ira <- summary(rlassoEffects(DX, log(inuidur1), index = 1)) + +print(rlasso_ira) +``` + +### Results + +```{r} +str(ols_ira) +ols_ira[2, 1] +``` + +```{r} +table <- matrix(0, 2, 4) +table[1, 1] <- ols_cl[2, 1] +table[1, 2] <- ols_cra[2, 1] +table[1, 3] <- ols_ira[2, 1] +table[1, 4] <- rlasso_ira[[1]][1] + +table[2, 1] <- ols_cl[2, 2] +table[2, 2] <- ols_cra[2, 2] +table[2, 3] <- ols_ira[2, 2] +table[2, 4] <- rlasso_ira[[1]][2] + + +colnames(table) <- c("CL", "CRA", "IRA", "IRA w Lasso") +rownames(table) <- c("estimate", "standard error") +tab <- xtable(table, digits = 5) +tab + +print(tab, type = "latex", digits = 5) +``` + +Treatment group 4 experiences an average decrease of about $7.8\%$ in the length of unemployment spell. + + +Observe that regression estimators delivers estimates that are slighly more efficient (lower standard errors) than the simple 2 mean estimator, but essentially all methods have very similar standard errors. From IRA results we also see that there is not any statistically detectable heterogeneity. We also see the regression estimators offer slightly lower estimates -- these difference occur perhaps to due minor imbalance in the treatment allocation, which the regression estimators try to correct. + + + diff --git a/CM1/r-rct-vaccines.Rmd b/CM1/r-rct-vaccines.Rmd new file mode 100644 index 00000000..43c78d85 --- /dev/null +++ b/CM1/r-rct-vaccines.Rmd @@ -0,0 +1,275 @@ +--- +title: An R Markdown document converted from "CM1/r-rct-vaccines.irnb" +output: html_document +--- + +This notebook contains some RCT examples for teaching. + +```{r} +install.packages("PropCIs") # Exact CI exploiting Bernoulli outcome using the Cornfield Procedure +``` + +```{r} +library(PropCIs) +``` + + +# Polio RCT + +One of the earliest randomized experiments were the Polio vaccination trials conducted by the Public Health Service in 1954. The question was whether Salk vaccine prevented polio. Children in the study were randomly assigned either a treatment (polio vaccine shot) or a placebo (saline solution shot), without knowing which one they received. The doctors in the study, making the diagnosis, did not know whether a child received a vaccine or not. In other words, the trial was a double-blind, randomized controlled trial. The trial had to be large, because the rate at which Polio occured in the population was 50 per 100,000. The treatment group saw 33 polio cases per 200,745; the control group saw 115 cases per 201,229. The estimated average treatment effect is about +$$ +-40 +$$ +with the 95% confidence band (based on approximate normality of the two sample means and their differences): +$$[-52, -28].$$ +As this is an RCT, the confidence band suggests that the Polio vaccine **caused** the reduction in the risk of polio. + +The interesting thing here is that we don't need the underlying individual data to evaluate the effectivess of the vaccine. This is because the outcomes are Bernoulli random variables, and we have enough information to compute the estimate of ATE as well as the confidence intervals from the group case counts. + +We also compute the Vaccine Efficacy metric, which refers to the following measure according to the [CDC](https://www.cdc.gov/csels/dsepd/ss1978/lesson3/section6.html): +$$ +VE = \frac{\text{Risk for Unvaccinated - Risk for Vaccinated}}{\text{Risk for Unvaccinated}}. +$$ +It describes the relative reduction in risk caused by vaccination. + + +It is staightforward to get the VE estimate by just plugging-in the numbers, but how do we get the approximate variance estimate? I am too lazy to do calculations for the delta method, so I will just use a simulation (a form of approximate bootstrap) to obtain the confidence intervals. + + +```{r} +NV <- 200745 # number of vaccinated (treated) +NU <- 201229 # number of unvaccinated (control) +RV <- 33 / NV # average outcome for vaccinated +RU <- 115 / NU # average outcome for unvaccinated +VE <- (RU - RV) / RU +# vaccine efficacy + +# incidence per 100000 +incidence_rv <- RV * 100000 +incidence_ru <- RU * 100000 + +print(paste("Incidence per 100000 among treated:", round(incidence_rv, 4))) + +print(paste("Incidence per 100000 among controlled:", round(incidence_ru, 4))) + +# treatment effect - estimated reduction in incidence per 100000 people +delta_hat <- 100000 * (RV - RU) + +print(paste("Estimated ATE of occurances per 100,000 is", round(delta_hat, 4))) + +# variance, standard deviation and confidence interval of ATE using that outcomes are Bernoulli +var_rv <- RV * (1 - RV) / NV +var_ru <- RU * (1 - RU) / NU +var_delta_hat <- 100000^2 * (var_rv + var_ru) +std_delta_hat <- sqrt(var_delta_hat) + +print(paste("Standard deviation for ATE is", round(std_delta_hat, 4))) + +ci_delta <- c( + delta_hat - 1.96 * sqrt(var_delta_hat), + delta_hat + 1.96 * sqrt(var_delta_hat) +) + +print(paste( + "95% confidence interval of ATE is [", round(ci_delta[1], 4), ",", + round(ci_delta[2], 4), "]" +)) + +print(paste("Overall VE is", round(VE, 4))) + +# we use an approximate bootstrap to find the confidence interval of vaccine efficacy +# via Monte Carlo draws +set.seed(1) +B <- 10000 # number of bootstraps +RVs <- RV + rnorm(B) * sqrt(var_rv) +RUs <- RU + rnorm(B) * sqrt(var_ru) +VEs <- (RUs - RVs) / RUs + +plot(density(VEs), col = 2, main = "Approximate Distribution of VE estimates") + +ci_ve <- quantile(VEs, c(.025, .975)) # use the empirical quantiles from the bootstraps + +print(paste( + "95% confidence interval of VE is [", round(ci_ve[1], 4), ",", + round(ci_ve[2], 4), "]" +)) +``` + +# Pfizer/BNTX Covid-19 RCT + +Here is a link to the FDA [briefing](https://www.fda.gov/media/144245/download) and an interesting [discussion]( +https://garycornell.com/2020/12/09/statistics-in-the-pfizer-data-how-good-is-the-vaccine/?fbclid=IwAR282lS0Vl3tWmicQDDhIJAQCMO8NIsCXyWbUWwTtPuKcnuJ2v0VWXRDQac), as well as data. + +Pfizer/BNTX was the first vaccine approved for emergency use to reduce the risk of Covid-19 decease. In studies to assess vaccine efficacy, volunteers were randomly assigned to receive either a treatment (2-dose vaccination) or a placebo, without knowing which they received. The doctors making the diagnoses did not know now whether a given volunteer received a vaccination or not. The results of the study are given in the following table. + +![](https://lh6.googleusercontent.com/oiO6gYom1UZyrOhgpFx2iq8ike979u3805JHiVygP-Efh1Yaz2ttyPcgWKlT1AqHDM4v46th3EPIkOvRLyXA0fNUloPL-mL9eOFmSAzfbNOHyCZSQ0DyzMhcFUtQuZ520R5Qd2lj): + +Here we see both the overall effects and the effects by age group. The confidence intervals for the overall ATE are tight and suggest high effectiveness of the vaccine. The confidence intervals for the age groups 65-74 and 75+ are much wider due to the relatively small number of people in these groups. We could group 65-74 and >75 groups to evaluate the effectiveness of the vaccine for this broader age group and narrow the width of the confidence band. + +We use the same approach as that for the Polio example. This gives slightly different results than the FDA result, because the FDA used inversion of exact binomial tests to construct confidence intervals. We use asymptotic approches based on approximate normality, which is more crude, but delivers a rather similar result. + +```{r} +NV <- 19965 +# number vaccinated +NU <- 20172 +# number unvaccinated +RV <- 9 / NV +# average outcome for vaccinated +RU <- 169 / NU +# average outcome for unvaccinated +VE <- (RU - RV) / RU +# vaccine efficacy + +# incidence per 100000 +incidence_rv <- RV * 100000 +incidence_ru <- RU * 100000 + +print(paste("Incidence per 100000 among vaccinated:", round(incidence_rv, 4))) + +print(paste("Incidence per 100000 among unvaccinated:", round(incidence_ru, 4))) + +# treatment effect - estimated reduction in incidence per 100000 people +delta_hat <- 100000 * (RV - RU) + +print(paste("Estimated ATE of occurances per 100,000 is", round(delta_hat, 4))) + +# variance, standard deviation and confidence interval of ATE using that outcomes are Bernoulli +var_rv <- RV * (1 - RV) / NV +var_ru <- RU * (1 - RU) / NU +var_delta_hat <- 100000^2 * (var_rv + var_ru) +std_delta_hat <- sqrt(var_delta_hat) + +print(paste("Standard deviation for ATE is", round(std_delta_hat, 4))) + +ci_delta <- c( + delta_hat - 1.96 * sqrt(var_delta_hat), + delta_hat + 1.96 * sqrt(var_delta_hat) +) + +print(paste( + "95% confidence interval of ATE is [", round(ci_delta[1], 4), ",", + round(ci_delta[2], 4), "]" +)) + +print(paste("Overall VE is", round(VE, 4))) + +# we use an approximate bootstrap to find the VE confidence interval +# using Monte Carlo draws as before +set.seed(1) +B <- 10000 +RVs <- RV + rnorm(B) * sqrt(var_rv) +RUs <- RU + rnorm(B) * sqrt(var_ru) +VEs <- (RUs - RVs) / RUs + +plot(density(VEs), col = 2, main = "Approximate Distribution of VE estimates") + +ci_ve <- quantile(VEs, c(.025, .975)) + +print(paste( + "95% confidence interval of VE is [", round(ci_ve[1], 4), ",", + round(ci_ve[2], 4), "]" +)) +``` + +In the code cell below we calculate the effectiveness of the vaccine for the two groups that are 65 or older + +```{r} +# Here we calculate the overall effectiveness of the vaccine for the two groups that are 65 or older +NV <- 3239 + 805 +NU <- 3255 + 812 +RV <- 1 / NV +RU <- (14 + 5) / NU +VE <- (RU - RV) / RU +print(paste("Overall VE is", round(VE, 4))) + +var_rv <- RV * (1 - RV) / NV +var_ru <- RU * (1 - RU) / NU + +# As before, we use an approximate bootstrap to find the confidence intervals +# using Monte Carlo draws + +set.seed(1) +B <- 10000 +RVs <- RV + rnorm(B) * sqrt(var_rv) + 10^(-10) +RUs <- RU + rnorm(B) * sqrt(var_ru) + 10^(-10) +VEs <- (RUs - RVs) / RUs + +plot(density(VEs), col = 2, main = "Approximate Distribution of VE estimates") + +ci_ve <- quantile(VEs, c(.025, .975)) + +print(paste( + "two-sided 95 % confidence interval is [", ci_ve[1], ",", + ci_ve[2], "]" +)) + +one_sided_ci_ve <- quantile(VEs, c(.05)) + +print(paste( + "one-sided 95 % confidence interval is [", one_sided_ci_ve[1], ",", + 1, "]" +)) +``` + +Let's try the parametric boostrap next, using the fact that the outcome is Bernouli. + +```{r} +NV <- 3239 + 805 +NU <- 3255 + 812 +RV <- 1 / NV +RU <- (14 + 5) / NU +VE <- (RU - RV) / RU +print(paste("Overall VE is", VE)) + +set.seed(1) +B <- 10000 # number of simulation draw +RVs <- rbinom(100000, size = NV, prob = RV) +RUs <- rbinom(100000, size = NU, prob = RU) +VEs <- (RUs - RVs) / RUs + +plot(density(VEs), col = 2, main = "Approximate Distribution of VE estimates") + +ci_ve <- quantile(VEs, c(.025, .975)) + +print(paste( + "two-sided 95 % confidence interval is [", ci_ve[1], ",", + ci_ve[2], "]" +)) + +one_sided_ci_ve <- quantile(VEs, c(.05)) + +print(paste("one-sided 95 % confidence interval is [", one_sided_ci_ve[1], ",", 1, "]")) +``` + +# Exact Binomial Test Inversion + +It is possible to use exact inference by inverting tests based on the exact binomial nature of the outcome variable. Here, we perform the Cornfield Procedure to find the exact confidence interval on the estimate of vaccine efficacy. + +```{r} +# Exact CI exploiting Bernoulli outcome using the Cornfield Procedure +NV <- 19965 +NU <- 20172 +RV <- 9 / NV +RU <- 169 / NU +VE <- (RU - RV) / RU + +1 - riskscoreci(9, NV, 169, NU, 0.95)$conf.int[2] +1 - riskscoreci(9, NV, 169, NU, 0.95)$conf.int[1] +``` + +Note that this exactly recovers the result in the FDA table (first row). + +Next we repeat the cornfield procedure to find the exact confidence interval on vaccine effectiveness for the two groups that are 65 or older. Here we see a big discrepancy between various asymptotic approaches and the exact finite-sample inference. This occurs because the binomial counts are too low for central limit theorems to work successfully. + +```{r} +# Exact CI exploiting Bernoulli outcome for the two groups that are 65 or older +NV <- 3239 + 805 +NU <- 3255 + 812 +RV <- 1 / NV +RU <- (14 + 5) / NU +VE <- (RU - RV) / RU + +1 - riskscoreci(1, NV, 19, NU, 0.95)$conf.int[2] +1 - riskscoreci(1, NV, 19, NU, 0.95)$conf.int[1] +``` + diff --git a/CM1/r-sim-precision-adj.Rmd b/CM1/r-sim-precision-adj.Rmd new file mode 100644 index 00000000..da358e95 --- /dev/null +++ b/CM1/r-sim-precision-adj.Rmd @@ -0,0 +1,117 @@ +--- +title: An R Markdown document converted from "CM1/r-sim-precision-adj.irnb" +output: html_document +--- + +# Analyzing RCT with Precision by Adjusting for Baseline Covariates + +```{r} +install.packages("sandwich") +install.packages("lmtest") +``` + +```{r} +library(sandwich) # heterokedasticity robust standard errors +library(lmtest) # coefficient testing +``` + +# Jonathan Roth's DGP + +Here we set up a DGP with heterogenous effects. In this example, which is due to Jonathan Roth, we have +$$ +E [Y(0) | Z] = - Z, \quad E [Y(1) |Z] = Z, \quad Z \sim N(0,1). +$$ +The CATE is +$$ +E [Y(1) - Y(0) | Z ]= 2 Z. +$$ +and the ATE is +$$ +2 E Z = 0. +$$ + +We would like to estimate the ATE as precisely as possible. + +An economic motivation for this example could be provided as follows: Let D be the treatment of going to college, and let $Z$ be academic skills. Suppose that academic skills cause lower earnings Y(0) in jobs that don't require a college degree, and cause higher earnings Y(1) in jobs that do require college degrees. This type of scenario is reflected in the DGP set-up above. + + +```{r} +# generate the simulated dataset +set.seed(123) # set MC seed +n <- 1000 # sample size +Z <- rnorm(n) # generate Z +Y0 <- -Z + rnorm(n) # conditional average baseline response is -Z +Y1 <- Z + rnorm(n) # conditional average treatment effect is +Z +D <- (runif(n) < .2) # treatment indicator; only 20% get treated +Y <- Y1 * D + Y0 * (1 - D) # observed Y +D <- D - mean(D) # demean D +Z <- Z - mean(Z) # demean Z +``` + +# Analyze the RCT data with Precision Adjustment + +Consider the follow regression models: + +* classical 2-sample approach, no adjustment (CL) +* classical linear regression adjustment (CRA) +* interactive regression adjusment (IRA) + +We carry out inference using heteroskedasticity robust inference, using the sandwich formulas for variance (Eicker-Huber-White). + +We observe that the CRA delivers estimates that are less efficient than the CL (pointed out by Freedman), whereas the IRA delivers a more efficient approach (pointed out by Lin). In order for the CRA to be more efficient than the CL, we need the linear model to be a correct model of the conditional expectation function of Y given D and X, which is not the case here. + +```{r} +# implement each of the models on the simulated data +CL <- lm(Y ~ D) +CRA <- lm(Y ~ D + Z) # classical +IRA <- lm(Y ~ D + Z + Z * D) # interactive approach + +# we are interested in the coefficients on variable "D". +coeftest(CL, vcov = vcovHC(CL, type = "HC1")) +coeftest(CRA, vcov = vcovHC(CRA, type = "HC1")) +coeftest(IRA, vcov = vcovHC(IRA, type = "HC1")) +``` + +# Using classical standard errors (non-robust) is misleading here. + +We don't teach non-robust standard errors in econometrics courses, but the default statistical inference for lm() procedure in R, summary.lm(), still uses 100-year old concepts, perhaps in part due to historical legacy. + +Here the non-robust standard errors suggest that there is not much difference between the different approaches, contrary to the conclusions reached using the robust standard errors. + +```{r} +summary(CL) +summary(CRA) +summary(IRA) +``` + +# Verify Asymptotic Approximations Hold in Finite-Sample Simulation Experiment + +```{r} +set.seed(123) +n <- 1000 +B <- 1000 + +CLs <- rep(0, B) +CRAs <- rep(0, B) +IRAs <- rep(0, B) + +for (i in 1:B) { + Z <- rnorm(n) + Y0 <- -Z + rnorm(n) + Y1 <- Z + rnorm(n) + Z <- Z - mean(Z) + D <- (runif(n) < .1) + D <- D - mean(D) + Y <- Y1 * D + Y0 * (1 - D) + CLs[i] <- lm(Y ~ D)$coef[2] + CRAs[i] <- lm(Y ~ D + Z)$coef[2] + IRAs[i] <- lm(Y ~ D + Z + Z * D)$coef[2] +} + +print("Standard deviations for estimators") + +sqrt(mean(CLs^2)) +sqrt(mean(CRAs^2)) +sqrt(mean(IRAs^2)) +``` + diff --git a/CM2/r-colliderbias-hollywood.Rmd b/CM2/r-colliderbias-hollywood.Rmd new file mode 100644 index 00000000..674a968e --- /dev/null +++ b/CM2/r-colliderbias-hollywood.Rmd @@ -0,0 +1,59 @@ +--- +title: An R Markdown document converted from "CM2/r-colliderbias-hollywood.irnb" +output: html_document +--- + +# Collider Bias + +Here is a simple mnemonic example to illustate the collider or M-bias. + +Here the idea is that people who get to Hollywood have to have high congenility = talent + beauty. Funnily enough this induces a negative correlation between talents and looks, when we condition on the set of actors or celebrities. This simple example explains an anecdotal observation that "talent and beauty are negatively correlated" for celebrities. + +```{r} +install.packages("dagitty") +``` + +```{r} +library(dagitty) +``` + +```{r} +causal_graph <- dagitty("dag{ Talent -> Congen <- Beauty }") +plot(causal_graph) +``` + +```{r} +# collider bias +n <- 1000000 +Talent <- rnorm(n) # talent +Beauty <- rnorm(n) # beauty +Congen <- Talent + Beauty + rnorm(n) # congeniality +TalentH <- subset(Talent, Congen > 0) # condition on Congen>0 +BeautyH <- subset(Beauty, Congen > 0) # condition on Congen>0 + +summary(lm(Talent ~ Beauty)) # regression of Talent on Beauty +summary(lm(Talent ~ Beauty + Congen)) # regression of Talent on Beauty and Congen +summary(lm(TalentH ~ BeautyH)) # regression of Talent on Beauty, conditional on Congen>0. +``` + +We can also use package Dagitty to illustrate collider bias, also known as M-bias. + +```{r} +## If we want to infer causal effect of Beauty on Talent, +## we can apply the command to figure out +## variables we should condition on: + +adjustmentSets(causal_graph, "Talent", "Beauty") + +## empty set -- we should not condition on the additional +## variable Congen. +``` + +```{r} +## Generate data where Congen = .5*Talent + .5*Beauty +set.seed(123) +data <- simulateSEM(causal_graph, .5) +confint(lm(Talent ~ Beauty, data))["Beauty", ] +confint(lm(Talent ~ Beauty + Congen, data))["Beauty", ] +``` + diff --git a/CM3/r-dagitty.Rmd b/CM3/r-dagitty.Rmd new file mode 100644 index 00000000..d7885b91 --- /dev/null +++ b/CM3/r-dagitty.Rmd @@ -0,0 +1,199 @@ +--- +title: An R Markdown document converted from "CM3/r-dagitty.irnb" +output: html_document +--- + +There are system packages that some of the R packages need. We install them here. + +```{r} +system("sudo add-apt-repository ppa:ubuntugis/ubuntugis-unstable") +system("sudo apt-get update") +system("sudo apt-get install libglpk-dev libgmp-dev libxml2-dev") +``` + +# Causal Identification in DAGs using Backdoor and Swigs, Equivalence Classes, Falsifiability Tests + +```{r} +# install and load package +install.packages("dagitty") +install.packages("ggdag") +``` + +```{r} +library(dagitty) +library(ggdag) +``` + +# Graph Generation and Plotting + +The following DAG is due to Judea Pearl + +```{r} +# generate a couple of DAGs and plot them + +G <- dagitty('dag{ +Z1 [pos="-2,-1.5"] +X1 [pos="-2,0"] +Z2 [pos="1.5,-1.5"] +X3 [pos="1.5, 0"] +Y [outcome,pos="1.5,1.5"] +D [exposure,pos="-2,1.5"] +M [mediator, pos="0,1.5"] +X2 [pos="0,0"] +Z1 -> X1 +X1 -> D +Z1 -> X2 +Z2 -> X3 +X3 -> Y +Z2 -> X2 +D -> Y +X2 -> Y +X2 -> D +M->Y +D->M +}') + + +ggdag(G) + theme_dag() +``` + +# Report Relatives of X2 + +```{r} +print(parents(G, "X2")) +print(children(G, "X2")) +print(ancestors(G, "X2")) +print(descendants(G, "X2")) +``` + +# Find Paths Between D and Y + + +```{r} +paths(G, "D", "Y") +``` + +# List All Testable Implications of the Model + +```{r} +print(impliedConditionalIndependencies(G)) +``` + +# Identification by Backdoor: List minimal adjustment sets to identify causal effecs $D \to Y$ + +```{r} +print(adjustmentSets(G, "D", "Y")) +``` + +# Identification via SWIG and D-separation + +```{r} +SWIG <- dagitty('dag{ +Z1 [pos="-2,-1.5"] +X1 [pos="-2,0"] +Z2 [pos="1.5,-1.5"] +X3 [pos="1.5, 0"] +Yd [outcome,pos="1.5,1.5"] +D [exposure,pos="-2,1.5"] +d [pos="-1, 1.5"] +Md [mediator, pos="0,1.5"] +X2 [pos="0,0"] +Z1 -> X1 +X1 -> D +Z1 -> X2 +Z2 -> X3 +X3 -> Yd +Z2 -> X2 +X2 -> Yd +X2 -> D +X3-> Yd +Md-> Yd +d-> Md +}') + +ggdag(SWIG) + theme_dag() +``` + + +# Deduce Conditional Exogeneity or Ignorability by D-separation + +```{r} +print(impliedConditionalIndependencies(SWIG)[5:8]) +``` + +This coincides with the backdoor criterion for this graph. + +# Print All Average Effects Identifiable by Conditioning + +```{r} +for (n in names(G)) { + for (m in children(G, n)) { + a <- adjustmentSets(G, n, m) + if (length(a) > 0) { + cat("The effect ", n, "->", m, " is identifiable by controlling for:\n", sep = "") + print(a, prefix = " * ") + } + } +} +``` + +# Equivalence Classes + +```{r} +P <- equivalenceClass(G) +plot(P) +``` + +Next Consider the elemntary Triangular Model: +$$ +D \to Y, \quad X \to (D,Y). +$$ +This model has no testable implications and is Markov-equivalent to any other DAG difined on names $(X, D, Y)$. + +```{r} +G3 <- dagitty("dag{ +D -> Y +X -> D +X -> Y +} +") + +ggdag(G3) + theme_dag() + +print(impliedConditionalIndependencies(G3)) +``` + +```{r} +P <- equivalenceClass(G3) +plot(P) +equivalentDAGs(G3, 10) +``` + +# Example of Testing DAG Validity + +Next we simulate the data from a Linear SEM associated to DAG G, and perform a test of conditional independence restrictions, exploting linearity. + + +There are many other options for nonlinear models and discrete categorical variabales. Type help(localTests). + +```{r} +set.seed(1) +x <- simulateSEM(G) +head(x) +localTests(G, data = x, type = c("cis")) +``` + +Next we replaced $D$ by $\bar D$ generated differently: +$$ +\bar D= (D + Y)/2. +$$ +$\bar D$ is an average of $D$ and $Y$ generated by $D$. We then test if the resulting collection of random variables satisifes conditional independence restrictions, exploiting linearity. We end up rejectiong these restrictions and therefore the validity of this model for the data generated in this way. This makes sense, because the new data no longer obeys the previous DAG structure. + + +```{r} +xR <- x +xR$D <- (x$D + x$Y) / 2 + +localTests(G, data = xR, type = c("cis")) +``` + diff --git a/CM3/r-dosearch.Rmd b/CM3/r-dosearch.Rmd new file mode 100644 index 00000000..b5d2b0e0 --- /dev/null +++ b/CM3/r-dosearch.Rmd @@ -0,0 +1,215 @@ +--- +title: An R Markdown document converted from "CM3/r-dosearch.irnb" +output: html_document +--- + +# Dosearch for Causal Identification in DAGs. + + +This a simple notebook for teaching that illustrates capabilites of the "dosearch" package, which is a great tool. + +NB. In my experience, the commands are sensitive to syntax ( e.g. spacing when -> are used), so be careful when changing to other examples. + +```{r} +install.packages("dosearch") +``` + +```{r} +library("dosearch") +``` + +We start with the simplest graph, with the simplest example +where $D$ is policy, $Y$ is outcomes, $X$ is a confounder: +$$ +D\to Y, \quad X \to (D,Y) +$$ + +Now suppose we want conditional average policy effect. + +```{r} +data <- "p(y, d, x)" # data structure + +query <- "p(y | do(d), x)" # query -- target parameter + +graph <- " + x -> y + x -> d + d -> y +" + +dosearch(data, query, graph) +``` + +This recovers the correct identification formula for law of the counterfactual $Y(d)$ induced by $do(D=d)$: +$$ +p_{Y(d)|X}(y|x) := p(y|do(d),x) = p(y|d,x). +$$ + +```{r} +data <- "p(y, d, x)" + +query <- "p(y | do(d))" + +graph <- " + x -> y + x -> d + d -> y +" + +dosearch(data, query, graph) +``` + +This recovers the correct identification formula: +$$ +p_{Y(d)}(y) := p(y: do(d)) = \sum_{x}\left(p(x)p(y|d,x)\right) +$$ +We integrate out $x$ in the previous formula. + +Suppose we don't observe the confounder. The effect is generally not identified. + +```{r} +data <- "p(y, d)" + +query <- "p(y | do(d))" + +graph <- " + x -> y + x -> d + d -> y +" + +dosearch(data, query, graph) +``` + +The next graph is an example of J. Pearl (different notation), where the graph is considerably more complicated. See "Pearl's Example" in the book - e.g. Figure 7.14. We are interested in $D \to Y$. + +Here we try conditioning on $X_2$. This would block one backdoor path from $D$ to $Y$, but would open another path on which $X_2$ is a collider, so this shouldn't work. The application below gave a correct answer (after I put the spacings carefully). + +```{r} +data <- "p(y, d, x2)" # observed only (Y, D, X_2) + +query <- "p(y | do(d))" # target parameter + +graph <- " + z1 -> x1 + z1 -> x2 + z2 -> x2 + z2 -> x3 + x2 -> d + x2 -> y + x3 -> y + x1 -> d + d -> m + m -> y +" + +dosearch(data, query, graph) +``` + +Intuitively, we should add more common causes. For example, adding $X_3$ and using $S = (X_2, X_3)$ should work. + +```{r} +data <- "p(y, d, x2, x3)" + +conditional_query <- "p(y | do(d), x2, x3)" # can ID conditional average effect? +query <- "p(y | do(d))" # can ID unconditional effect? + +graph <- " + z1 -> x1 + z1 -> x2 + z2 -> x2 + z2 -> x3 + x2 -> d + x2 -> y + x3 -> y + x1 -> d + d -> m + m -> y +" + +print(dosearch(data, conditional_query, graph)) +print(dosearch(data, query, graph)) +``` + +This retrieves the correct formulas for counterfactual distributions of $Y(d)$ induced by $Do(D=d)$: + +The conditional distribution is identified by +$$ +p_{Y(d) \mid X_2, X_3}(y) := p(y |x_2, x_3: do(d)) = p(y|x_2,x_3,d). +$$ + +The unconditional distribution is obtained by integration out over $x_2$ and $x_3$: + +$$ +p_{Y(d) }(y) := p(y do(d)) = \sum_{x2,x3}\left(p(x_2,x_3)p(y|x_2,x_3,d)\right). +$$ + + + +Next we suppose that we observe only $(Y,D, M)$. Can we identify the effect $D \to Y$? Can we use back-door-criterion twice to get $D \to M$ and $M \to Y$ affect? Yes, that's called front-door criterion -- so we just need to remember only the back-door and the fact that we can use it iteratively. + +```{r} +data <- "p(y, d, m)" + +query_dm <- "p(m | do(d))" +query_md <- "p(y | do(m))" +query <- "p(y | do(d))" + +graph <- " + z1 -> x1 + z1 -> x2 + z2 -> x2 + z2 -> x3 + x2 -> d + x2 -> y + x3 -> y + x1 -> d + d -> m + m -> y +" +print(dosearch(data, query_dm, graph)) +print(dosearch(data, query_md, graph)) +print(dosearch(data, query, graph)) +``` + +So we get identification results: +First, +$$ +p_{M(d)}(m) := p(m|do(d)) = p(m|d). +$$ +Second, +$$ +p_{Y(m)}(y) := p(y|do(m)) = \sum_{d}\left(p(d)p(y|d,m)\right), +$$ +and the last by integrating the product of these two formulas: +$$ +p_{Y(d)}(y) := p(y|do(d)) = \sum_{m}\left(p(m|d)\sum_{d}\left(p(d)p(y|d,m)\right)\right) +$$ + +The package is very rich and allows identification analysis, when the data comes from multiple sources. Suppose we observe marginal distributions $(Y,D)$ and $(D,M)$ only. Can we identify the effect of $D \to Y$. The answer is (guess) and the package correctly recovers it. + +```{r} +data <- "p(y, m) + p(m, d)" + +query_dm <- "p(m | do(d))" +query_md <- "p(y | do(m))" +query <- "p(y | do(d))" + +graph <- " + z1 -> x1 + z1 -> x2 + z2 -> x2 + z2 -> x3 + x2 -> d + x2 -> y + x3 -> y + x1 -> d + d -> m + m -> y +" +print(dosearch(data, query_dm, graph)) +print(dosearch(data, query_md, graph)) +print(dosearch(data, query, graph)) +``` + diff --git a/PM1/r-linear-model-overfitting.Rmd b/PM1/r-linear-model-overfitting.Rmd new file mode 100644 index 00000000..2dbba836 --- /dev/null +++ b/PM1/r-linear-model-overfitting.Rmd @@ -0,0 +1,64 @@ +--- +title: An R Markdown document converted from "PM1/r-linear-model-overfitting.irnb" +output: html_document +--- + +# Simple Exercise on Overfitting + +First set p=n + +```{r} + +set.seed(123) +n <- 1000 + +p <- n +X <- matrix(rnorm(n * p), n, p) +y <- rnorm(n) + +print("p/n is") +print(p / n) +print("R2 is") +print(summary(lm(y ~ X))$r.squared) +print("Adjusted R2 is") +print(summary(lm(y ~ X))$adj.r.squared) +``` + +Second, set p=n/2. + +```{r} + +set.seed(123) +n <- 1000 + +p <- n / 2 +X <- matrix(rnorm(n * p), n, p) +y <- rnorm(n) + +print("p/n is") +print(p / n) +print("R2 is") +print(summary(lm(y ~ X))$r.squared) +print("Adjusted R2 is") +print(summary(lm(y ~ X))$adj.r.squared) +``` + +Third, set p/n =.05 + +```{r} + +set.seed(123) +n <- 1000 + +p <- .05 * n +X <- matrix(rnorm(n * p), n, p) +y <- rnorm(n) + +print("p/n is") +print(p / n) +print("R2 is") +print(summary(lm(y ~ X))$r.squared) +print("Adjusted R2 is") +print(summary(lm(y ~ X))$adj.r.squared) +``` + diff --git a/PM1/r-ols-and-lasso-for-wage-gap-inference.Rmd b/PM1/r-ols-and-lasso-for-wage-gap-inference.Rmd new file mode 100644 index 00000000..ea7e6892 --- /dev/null +++ b/PM1/r-ols-and-lasso-for-wage-gap-inference.Rmd @@ -0,0 +1,398 @@ +--- +title: An R Markdown document converted from "PM1/r-ols-and-lasso-for-wage-gap-inference.irnb" +output: html_document +--- + +# An inferential problem: The Gender Wage Gap + +In the previous lab, we analyzed data from the March Supplement of the U.S. Current Population Survey (2015) and answered the question of how to use job-relevant characteristics, such as education and experience, to best predict wages. Now, we focus on the following inference question: + +What is the difference in predicted wages between men and women with the same job-relevant characteristics? + +Thus, we analyze if there is a difference in the payment of men and women (*gender wage gap*). The gender wage gap may partly reflect *discrimination* against women in the labor market or may partly reflect a *selection effect*, namely that women are relatively more likely to take on occupations that pay somewhat less (for example, school teaching). + +To investigate the gender wage gap, we consider the following log-linear regression model + +\begin{align} +\log(Y) &= \beta'X + \epsilon\\ +&= \beta_1 D + \beta_2' W + \epsilon, +\end{align} + +where $Y$ is hourly wage, $D$ is the indicator of being female ($1$ if female and $0$ otherwise) and the +$W$'s are a vector of worker characteristics explaining variation in wages. Considering transformed wages by the logarithm, we are analyzing the relative difference in the payment of men and women. + +```{r} +install.packages("xtable") +install.packages("hdm") # a library for high-dimensional metrics +install.packages("sandwich") # a package used to compute robust standard errors +``` + +```{r} +library(hdm) +library(xtable) +library(sandwich) +``` + +## Data analysis + +We consider the same subsample of the U.S. Current Population Survey (2015) as in the previous lab. Let us load the data set. + +```{r} +file <- "https://raw.githubusercontent.com/CausalAIBook/MetricsMLNotebooks/main/data/wage2015_subsample_inference.csv" +data <- read.csv(file) +dim(data) +``` + +To start our (causal) analysis, we compare the sample means given gender: + +```{r} +z <- data[which(colnames(data) %in% c("lwage", "sex", "shs", "hsg", "scl", "clg", + "ad", "ne", "mw", "so", "we", "exp1"))] + +data_female <- data[data$sex == 1, ] +z_female <- data_female[which(colnames(data) %in% c("lwage", "sex", "shs", "hsg", "scl", + "clg", "ad", "ne", "mw", "so", "we", "exp1"))] + +data_male <- data[data$sex == 0, ] +z_male <- data_male[which(colnames(data) %in% c("lwage", "sex", "shs", "hsg", "scl", "clg", "ad", + "ne", "mw", "so", "we", "exp1"))] + +table <- matrix(0, 12, 3) +table[1:12, 1] <- as.numeric(lapply(z, mean)) +table[1:12, 2] <- as.numeric(lapply(z_male, mean)) +table[1:12, 3] <- as.numeric(lapply(z_female, mean)) +rownames(table) <- c("Log Wage", "Sex", "Less then High School", "High School Graduate", "Some College", + "College Graduate", "Advanced Degree", "Northeast", "Midwest", "South", "West", "Experience") +colnames(table) <- c("All", "Men", "Women") +tab <- xtable(table, digits = 4) +tab +``` + +```{r} +print(tab, type = "html") # set type="latex" for printing table in LaTeX +``` + + + + + + + + + + + + + + + + + +
All Men Women
Log Wage 2.9708 2.9878 2.9495
Sex 0.4445 0.0000 1.0000
Less then High School 0.0233 0.0318 0.0127
High School Graduate 0.2439 0.2943 0.1809
Some College 0.2781 0.2733 0.2840
Gollage Graduate 0.3177 0.2940 0.3473
Advanced Degree 0.1371 0.1066 0.1752
Northeast 0.2596 0.2590 0.2604
Midwest 0.2965 0.2981 0.2945
South 0.2161 0.2209 0.2101
West 0.2278 0.2220 0.2350
Experience 13.7606 13.7840 13.7313
+ +In particular, the table above shows that the difference in average *logwage* between men and women is equal to $0.038$ + +```{r} +mean(data_female$lwage) - mean(data_male$lwage) +``` + +Thus, the unconditional gender wage gap is about $3,8$\% for the group of never married workers (women get paid less on average in our sample). We also observe that never married working women are relatively more educated than working men and have lower working experience. + +This unconditional (predictive) effect of gender equals the coefficient $\beta$ in the univariate ols regression of $Y$ on $D$: + +\begin{align} +\log(Y) &=\beta D + \epsilon. +\end{align} + +We verify this by running an ols regression in R. + +```{r} +nocontrol_fit <- lm(lwage ~ sex, data = data) +nocontrol_est <- summary(nocontrol_fit)$coef["sex", 1] +# HC - "heteroskedasticity cosistent" -- HC3 is the SE that remains consistent in high dimensions +hcv_coefs <- vcovHC(nocontrol_fit, type = "HC3") +nocontrol_se <- sqrt(diag(hcv_coefs))[2] # Estimated std errors + +# print unconditional effect of gender and the corresponding standard error +cat("The estimated coefficient on the dummy for gender is", nocontrol_est, + " and the corresponding robust standard error is", nocontrol_se) +``` + +Note that the standard error is computed with the *R* package *sandwich* to be robust to heteroskedasticity. + +Next, we run an ols regression of $Y$ on $(D,W)$ to control for the effect of covariates summarized in $W$: + +\begin{align} +\log(Y) &=\beta_1 D + \beta_2' W + \epsilon. +\end{align} + +Here, we are considering the flexible model from the previous lab. Hence, $W$ controls for experience, education, region, and occupation and industry indicators plus transformations and two-way interactions. + +Let us run the ols regression with controls. + +```{r} +# ols regression with controls + +flex <- lwage ~ sex + (exp1 + exp2 + exp3 + exp4) * (shs + hsg + scl + clg + C(occ2) + C(ind2) + mw + so + we) + +# Note that ()*() operation in formula objects in R creates a formula of the sort: +# '(exp1+exp2+exp3+exp4) + (shs+hsg+scl+clg+occ2+ind2+mw+so+we) +# + (exp1+exp2+exp3+exp4) * (shs+hsg+scl+clg+occ2+ind2+mw+so+we)' +# This is not intuitive at all, but that's what it does. + +control_fit <- lm(flex, data = data) +control_est <- summary(control_fit)$coef[2, 1] + +summary(control_fit) + +cat("Coefficient for OLS with controls", control_est) + +hcv_coefs <- vcovHC(control_fit, type = "HC3") +control_se <- sqrt(diag(hcv_coefs))[2] # Estimated std errors +``` + +The estimated regression coefficient $\beta_1\approx-0.0696$ measures how our linear prediction of wage changes if we set the gender variable $D$ from 0 to 1, holding the controls $W$ fixed. +We can call this the *predictive effect* (PE), as it measures the impact of a variable on the prediction we make. Overall, we see that the unconditional wage gap of size $4$\% for women increases to about $7$\% after controlling for worker characteristics. + +We now show how the conditional gap and the remainder decompose the marginal wage gap into the parts explained and unexplained by the additional controls. (Note that this does *not* explain why there is a difference in the controls to begin with in the two groups.) + +```{r} +xx0 <- model.matrix(~ (exp1 + exp2 + exp3 + exp4) * (shs + hsg + scl + clg + C(occ2) + C(ind2) + mw + so + we), + data = data[data$sex == 0, ]) +y0 <- data[data$sex == 0, ]$lwage +xx1 <- model.matrix(~ (exp1 + exp2 + exp3 + exp4) * (shs + hsg + scl + clg + C(occ2) + C(ind2) + mw + so + we), + data = data[data$sex == 1, ]) +y1 <- data[data$sex == 1, ]$lwage +mu1 <- colMeans(xx1) +mu0 <- colMeans(xx0) +betarest <- summary(control_fit)$coef[3:(ncol(xx0) + 1), 1] # the coefficients excluding intercept and "sex" + +cat("The marginal gap:", mean(data_female$lwage) - mean(data_male$lwage), "\n") +diff.unexplained <- control_est +cat("The unexplained difference: ", diff.unexplained, "\n") +diff.explained <- sum(betarest * (mu1[2:ncol(xx0)] - mu0[2:ncol(xx0)])) +cat("The explained difference:", diff.explained, "\n") +cat("The sum of these differences:", diff.unexplained + diff.explained, "\n") +``` + +We next consider a Oaxaca-Blinder decomposition that also incorporates an interaction term. + +```{r} +svd0 <- svd(xx0) +svd1 <- svd(xx1) +svd0$d[svd0$d <= 1e-10] <- 0 +svd0$d[svd0$d > 1e-10] <- 1 / svd0$d[svd0$d > 1e-10] +beta0 <- (svd0$v %*% (svd0$d * svd0$d * t(svd0$v))) %*% t(xx0) %*% y0 +svd1$d[svd1$d <= 1e-10] <- 0 +svd1$d[svd1$d > 1e-10] <- 1 / svd1$d[svd1$d > 1e-10] +beta1 <- (svd1$v %*% (svd1$d * svd1$d * t(svd1$v))) %*% t(xx1) %*% y1 + +cat("The marginal gap:", mean(data_female$lwage) - mean(data_male$lwage), "\n") +cat("The unexplained difference:", beta1[1] - beta0[1], "\n") +cat("The difference explained by endowment:", sum(beta0[2:ncol(xx0)] * (mu1[2:ncol(xx0)] - mu0[2:ncol(xx0)])), "\n") +cat("The difference explained by coefficient:", sum((beta1[2:ncol(xx0)] - beta0[2:ncol(xx0)]) * mu1[2:ncol(xx0)]), "\n") +cat("The sum of these differences:", + (beta1[1] - beta0[1] + sum(beta0[2:ncol(xx0)] * (mu1[2:ncol(xx0)] - mu0[2:ncol(xx0)])) + + sum((beta1[2:ncol(xx0)] - beta0[2:ncol(xx0)]) * mu1[2:ncol(xx0)])), "\n") +``` + +Next, we use the Frisch-Waugh-Lovell (FWL) theorem from lecture, partialling-out the linear effect of the controls via ols. + +```{r} +# Partialling-out using ols + +# model for Y +flex_y <- lwage ~ (exp1 + exp2 + exp3 + exp4) * (shs + hsg + scl + clg + C(occ2) + C(ind2) + mw + so + we) +# model for D +flex_d <- sex ~ (exp1 + exp2 + exp3 + exp4) * (shs + hsg + scl + clg + C(occ2) + C(ind2) + mw + so + we) + +# partialling-out the linear effect of W from Y +t_y <- lm(flex_y, data = data)$res +# partialling-out the linear effect of W from D +t_d <- lm(flex_d, data = data)$res + +# regression of Y on D after partialling-out the effect of W +partial_fit <- lm(t_y ~ t_d) +partial_est <- summary(partial_fit)$coef[2, 1] + +cat("Coefficient for D via partialling-out", partial_est) + +# standard error +hcv_coefs <- vcovHC(partial_fit, type = "HC3") +partial_se <- sqrt(diag(hcv_coefs))[2] + +# confidence interval +confint(partial_fit)[2, ] +``` + +Again, the estimated coefficient measures the linear predictive effect (PE) of $D$ on $Y$ after taking out the linear effect of $W$ on both of these variables. This coefficient is numerically equivalent to the estimated coefficient from the ols regression with controls, confirming the FWL theorem. + +We know that the partialling-out approach works well when the dimension of $W$ is low +in relation to the sample size $n$. When the dimension of $W$ is relatively high, we need to use variable selection +or penalization for regularization purposes. + +In the following, we illustrate the partialling-out approach using lasso instead of ols. + +```{r} +# Partialling-out using lasso + +# model for Y +flex_y <- lwage ~ (exp1 + exp2 + exp3 + exp4) * (shs + hsg + scl + clg + C(occ2) + C(ind2) + mw + so + we) +# model for D +flex_d <- sex ~ (exp1 + exp2 + exp3 + exp4) * (shs + hsg + scl + clg + C(occ2) + C(ind2) + mw + so + we) + +# partialling-out the linear effect of W from Y +t_y <- rlasso(flex_y, data = data)$res +# partialling-out the linear effect of W from D +t_d <- rlasso(flex_d, data = data)$res + +# regression of Y on D after partialling-out the effect of W +partial_lasso_fit <- lm(t_y ~ t_d) +partial_lasso_est <- summary(partial_lasso_fit)$coef[2, 1] + +cat("Coefficient for D via partialling-out using lasso", partial_lasso_est) + +# standard error +hcv_coefs <- vcovHC(partial_lasso_fit, type = "HC3") +partial_lasso_se <- sqrt(diag(hcv_coefs))[2] +``` + +Using lasso for partialling-out here provides similar results as using ols. + +Next, we summarize the results. + +```{r} +table <- matrix(0, 4, 2) +table[1, 1] <- nocontrol_est +table[1, 2] <- nocontrol_se +table[2, 1] <- control_est +table[2, 2] <- control_se +table[3, 1] <- partial_est +table[3, 2] <- partial_se +table[4, 1] <- partial_lasso_est +table[4, 2] <- partial_lasso_se +colnames(table) <- c("Estimate", "Std. Error") +rownames(table) <- c("Without controls", "full reg", "partial reg", "partial reg via lasso") +tab <- xtable(table, digits = c(3, 3, 4)) +tab +``` + +```{r} +print(tab, type = "html") +``` + + + + + + + + + +
Estimate Std. Error
Without controls -0.038 0.0159
full reg -0.070 0.0150
partial reg -0.070 0.0150
partial reg via lasso -0.072 0.0154
+ +It it worth noticing that controlling for worker characteristics increases the gender wage gap from less than 4\% to 7\%. The controls we used in our analysis include 5 educational attainment indicators (less than high school graduates, high school graduates, some college, college graduate, and advanced degree), 4 region indicators (midwest, south, west, and northeast); a quartic term (first, second, third, and fourth power) in experience and 22 occupation and 23 industry indicators. + +Keep in mind that the predictive effect (PE) does not only measures discrimination (causal effect of being female), it also may reflect +selection effects of unobserved differences in covariates between men and women in our sample. + +## OLS Overfitting + +Next we motivate the usage of lasso. We try an "extra" flexible model, where we take interactions of all controls, giving us about 1000 controls. To highlight the potential impact of overfitting on inference, we subset to the first 1000 observations so that $p \approx n$. + +```{r} +set.seed(2724) +subset_size <- 1000 +random <- sample(seq_len(nrow(data)), subset_size) +subset <- data[random, ] +``` + +For a linear model, the covariance matrix of the estimated $\hat{\beta}$ coefficients is given by $$\Sigma_{\hat{\beta}} = (X'X)^{-1} X' \Omega X (X'X)^{-1}$$ Under homoskedasticity, $\Omega = \sigma^2 I$ so $\Sigma_{\hat{\beta}}$ reduces to $\sigma^2(X'X)^{-1}$ with $\sigma^2$ estimated with the mean squared residuals. Under heteroskedasticity, $\Omega \neq \sigma^2 I$, so we must use an approach that yields valid standard errors. Under heteroskedasticity, there exists a variety of consistent "sandwich" estimators proposed for $\Sigma_{\hat{\beta}}$. With $e_i$ denoting the residual of observation $i: + +$ HC0 = (X'X)^{-1} X' \text{diag} [e_i^2] X(X'X)^{-1}$ + +$ HC1 = \frac{n}{n-p-1} (X'X)^{-1} X' \text{diag} [e_i^2] X(X'X)^{-1}$ + +$ HC2 = (X'X)^{-1} X' \text{diag} \left[\frac{e_i^2}{1-h_{ii}} \right] X(X'X)^{-1}$ + +$ HC3 = (X'X)^{-1} X' \text{diag} \left[\frac{e_i^2}{(1-h_{ii})^2} \right] X(X'X)^{-1}$ + + +For small sample sizes, the errors from HC0 are biased (usually downward). HC1 is a simple degree-of-freedom adjustment. HC2 is inspired by the insight that HC0's bias in finite samples results from points of high leverage in the design matrix $X$ (intuitively, outliers with respect to the independent variables). Thus, HC2 weights the $i$th squared residual by the reciprocal of $(1-h_{ii})$, with leverage values $h_{ii}$ as the $i$th diagonal element of the "hat" matrix $H = X(X'X)^{-1}X'$ to adjust for the finite-sample bias present in HC0. + +HC3 is similar to HC2, weighting by the squared $(1-h_{ii})^2$ in the denominator instead. HC3 is also equivalent to jackknife standard errors. HC3 has been shown to perform well regardless of the absence/presence of homoskedasticity and remains valid, in the sense of being biased upward under regularity conditions, in high dimensional settings. + +```{r} +# extra flexible model +extraflex <- lwage ~ sex + (exp1 + exp2 + exp3 + exp4 + shs + hsg + scl + clg + C(occ2) + C(ind2) + mw + so + we)^2 + +control_fit <- lm(extraflex, data = subset) +control_est <- summary(control_fit)$coef[2, 1] +cat("Number of Extra-Flex Controls", length(control_fit$coef) - 1, "\n") +cat("Coefficient for OLS with extra flex controls", control_est) + + +n <- subset_size +p <- length(control_fit$coef) + +# HC0 SE +hcv_coefs_hc0 <- vcovHC(control_fit, type = "HC0") +control_se_hc0 <- sqrt(diag(hcv_coefs_hc0))[2] + +# For a more correct approach, we +# would implement the approach of Cattaneo, Jannson, and Newey (2018)'s procedure. + +# Jackknife. Need to trim some leverages or otherwise regularize. Theory shouldn't +# really work here. +coefs <- hatvalues(control_fit) +trim <- 0.99999999999 +coefs_trimmed <- coefs * (coefs < trim) + trim * (coefs >= trim) +omega <- (control_fit$residuals^2) / ((1 - coefs_trimmed)^2) +hcv_coefs <- vcovHC(control_fit, omega = as.vector(omega), type = "HC3") +control_se_hc3 <- sqrt(diag(hcv_coefs))[2] +``` + +```{r} +# model for Y +extraflex_y <- lwage ~ (exp1 + exp2 + exp3 + exp4 + shs + hsg + scl + clg + C(occ2) + C(ind2))^2 +# model for D +extraflex_d <- sex ~ (exp1 + exp2 + exp3 + exp4 + shs + hsg + scl + clg + C(occ2) + C(ind2))^2 + +# partialling-out the linear effect of W from Y +t_y <- rlasso(extraflex_y, data = subset)$res +# partialling-out the linear effect of W from D +t_d <- rlasso(extraflex_d, data = subset)$res + +# regression of Y on D after partialling-out the effect of W +partial_lasso_fit <- lm(t_y ~ t_d) +partial_lasso_est <- summary(partial_lasso_fit)$coef[2, 1] + +cat("Coefficient for D via partialling-out using lasso", partial_lasso_est) + +# standard error +hcv_coefs <- vcovHC(partial_lasso_fit, type = "HC3") +partial_lasso_se <- sqrt(diag(hcv_coefs))[2] +``` + +```{r} +table <- matrix(0, 3, 2) +table[1, 1] <- control_est +table[1, 2] <- control_se_hc0 +table[2, 1] <- control_est +table[2, 2] <- control_se_hc3 +table[3, 1] <- partial_lasso_est +table[3, 2] <- partial_lasso_se +colnames(table) <- c("Estimate", "Std. Error") +rownames(table) <- c("full reg, HC0", "full reg, HC3", "partial reg via lasso") +tab <- xtable(table, digits = c(3, 3, 4)) +tab + +print(tab, type = "latex") +``` + +In this case $p/n \approx 1$, that is $p/n$ is no longer small and we start seeing the differences between +unregularized partialling out and regularized partialling out with lasso (double lasso). The results based on +double lasso have rigorous guarantees in this non-small p/n regime under approximate sparsity. The results based on OLS still +have guarantees in p/n< 1 regime under assumptions laid out in Cattaneo, Newey, and Jansson (2018), without approximate +sparsity, although other regularity conditions are needed. + diff --git a/PM1/r-ols-and-lasso-for-wage-prediction.Rmd b/PM1/r-ols-and-lasso-for-wage-prediction.Rmd new file mode 100644 index 00000000..cc690a06 --- /dev/null +++ b/PM1/r-ols-and-lasso-for-wage-prediction.Rmd @@ -0,0 +1,405 @@ +--- +title: An R Markdown document converted from "PM1/r-ols-and-lasso-for-wage-prediction.irnb" +output: html_document +--- + +## Introduction + +An important question in labor economics is what determines the wage of workers. This is a causal question, but we can begin to investigate it from a predictive perspective. + +In the following wage example, $Y$ is the (log) hourly wage of a worker and $X$ is a vector of worker's characteristics, e.g., education, experience, gender. Two main questions here are: + +* How can we use job-relevant characteristics, such as education and experience, to best predict wages? + +* What is the difference in predicted wages between men and women with the same job-relevant characteristics? + +In this lab, we focus on the prediction question first. + +## Data + +The data set we consider is from the 2015 March Supplement of the U.S. Current Population Survey. We select white non-hispanic individuals, aged 25 to 64 years, and working more than 35 hours per week for at least 50 weeks of the year. We exclude self-employed workers; individuals living in group quarters; individuals in the military, agricultural or private household sectors; individuals with inconsistent reports on earnings and employment status; individuals with allocated or missing information in any of the variables used in the analysis; and individuals with hourly wage below $3$. + +The variable of interest $Y$ is the (log) hourly wage rate constructed as the ratio of the annual earnings to the total number of hours worked, which is constructed in turn as the product of number of weeks worked and the usual number of hours worked per week. In our analysis, we also focus on single (never married) workers. The final sample is of size $n=5150$. + +```{r} +install.packages("xtable") +install.packages("hdm") # a library for high-dimensional metrics +install.packages("glmnet") # for lasso CV +``` + +```{r} +library(hdm) +library(xtable) +library(glmnet) +``` + +## Data analysis + +We start by loading the data set. + +```{r} +file <- "https://raw.githubusercontent.com/CausalAIBook/MetricsMLNotebooks/main/data/wage2015_subsample_inference.csv" +data <- read.csv(file) +dim(data) +``` + +Let's have a look at the structure of the data. + +```{r} +str(data) +``` + +We construct the output variable $Y$ and the matrix $Z$ which includes the characteristics of workers that are given in the data. + +```{r} +# construct matrices for estimation from the data +y <- log(data$wage) +n <- length(y) +Z <- data[- which(colnames(data) %in% c("wage", "lwage"))] +p <- dim(Z)[2] + +cat("Number of observations:", n, "\n") +cat("Number of raw regressors:", p) +``` + +For the outcome variable *wage* and a subset of the raw regressors, we calculate the empirical mean to get familiar with the data. + +```{r} +# generate a table of means of variables +Zsubset <- data[which(colnames(data) %in% c("lwage", "sex", "shs", "hsg", "scl", + "clg", "ad", "mw", "so", "we", "ne", "exp1"))] +table <- matrix(0, 12, 1) +table[1:12, 1] <- as.numeric(lapply(Zsubset, mean)) +rownames(table) <- c("Log Wage", "Sex", "Some High School", + "High School Graduate", "Some College", "College Graduate", + "Advanced Degree", "Midwest", "South", "West", "Northeast", "Experience") +colnames(table) <- c("Sample mean") +tab <- xtable(table, digits = 2) +tab +``` + +E.g., the share of female workers in our sample is ~44% ($sex=1$ if female). + +Alternatively, using the xtable package, we can also print the table in LaTeX. + +```{r} +print(tab, type = "latex") # type="latex" for printing table in LaTeX +``` + +## Prediction Question + +Now, we will construct a prediction rule for (log) hourly wage $Y$, which depends linearly on job-relevant characteristics $X$: + +\begin{equation} +Y = \beta'X+ \epsilon. +\end{equation} + +Our goals are + +* Predict wages using various characteristics of workers. + +* Assess the predictive performance of a given model using the (adjusted) sample MSE, the (adjusted) sample $R^2$ and the out-of-sample MSE and $R^2$. + + +Toward answering the latter, we measure the prediction quality of the two models via data splitting: + +- Randomly split the data into one training sample and one testing sample. Here we just use a simple method (stratified splitting is a more sophisticated version of splitting that we might consider). +- Use the training sample to estimate the parameters of the Basic Model and the Flexible Model. +- Before using the testing sample, we evaluate in-sample fit. + +```{r} +# splitting the data +set.seed(1) # to make the results replicable (we will generate random numbers) +random <- sample(1:n, floor(n * 4 / 5)) # draw (4/5)*n random numbers from 1 to n without replacing +train <- data[random, ] +test <- data[-random, ] +``` + + +We employ two different specifications for prediction: + + +1. Basic Model: $X$ consists of a set of raw regressors (e.g. gender, experience, education indicators, occupation and industry indicators and regional indicators). + + +2. Flexible Model: $X$ consists of all raw regressors from the basic model plus a dictionary of transformations (e.g., ${exp}^2$ and ${exp}^3$) and additional two-way interactions of a polynomial in experience with other regressors. An example of a regressor created through a two-way interaction is *experience* times the indicator of having a *college degree*. + +Using the **Flexible Model** enables us to approximate the real relationship by a more complex regression model and therefore to reduce the bias. The **Flexible Model** increases the range of potential shapes of the estimated regression function. In general, flexible models often deliver higher prediction accuracy but are harder to interpret. + +## Data-Splitting: In-sample performance + +Let us fit both models to our data by running ordinary least squares (ols): + +```{r} +# 1. basic model +basic <- lwage ~ (sex + exp1 + shs + hsg + scl + clg + mw + so + we + C(occ2) + C(ind2)) +regbasic <- lm(basic, data = train) # perform ols using the defined model +# number of regressors in the Basic Model +cat("Number of regressors in the basic model:", length(regbasic$coef), "\n") +``` + +##### Note that the basic model consists of $51$ regressors. + +```{r} +# 2. flexible model +flex <- lwage ~ sex + shs + hsg + scl + clg + mw + so + we + C(occ2) + C(ind2) + + (exp1 + exp2 + exp3 + exp4) * (shs + hsg + scl + clg + C(occ2) + C(ind2) + mw + so + we) +regflex <- lm(flex, data = train) +# number of regressors in the Flexible Model +cat("Number of regressors in the flexible model:", length(regflex$coef)) +``` + +##### Note that the flexible model consists of $246$ regressors. + +#### Re-estimating the flexible model using Lasso +We re-estimate the flexible model using Lasso (the least absolute shrinkage and selection operator) rather than ols. Lasso is a penalized regression method that can be used to reduce the complexity of a regression model when the ratio $p/n$ is not small. We will introduce this approach formally later in the course, but for now, we try it out here as a black-box method. + +```{r} +# Flexible model using Lasso, in-sample fit +train_flex <- model.matrix(flex, train) # all regressors +fit_lasso_cv <- cv.glmnet(as.matrix(train_flex), train$lwage, family = "gaussian", alpha = 1, nfolds = 5) +# in-sample fit right now, not out-of-sample using "test" +yhat_lasso_cv <- predict(fit_lasso_cv, newx = as.matrix(train_flex), s = "lambda.min") +``` + +#### Evaluating the predictive performance of the basic and flexible models in-sample +Now, we can evaluate the performance of both models based on the (adjusted) $R^2_{sample}$ and the (adjusted) $MSE_{sample}$: + +```{r} +# Assess predictive performance +sumbasic <- summary(regbasic) +sumflex <- summary(regflex) +# no summary() for lassocv + +ntrain <- nrow(train) + +# R-squared and adjusted R-squared +r2_1 <- sumbasic$r.squared +cat("R-squared for the basic model: ", r2_1, "\n") +r2_adj1 <- sumbasic$adj.r.squared +cat("adjusted R-squared for the basic model: ", r2_adj1, "\n") + +r2_2 <- sumflex$r.squared +cat("R-squared for the flexible model: ", r2_2, "\n") +r2_adj2 <- sumflex$adj.r.squared +cat("adjusted R-squared for the flexible model: ", r2_adj2, "\n") + +p_l <- fit_lasso_cv$nzero[fit_lasso_cv$index[1]] +r2_l <- 1 - sum((yhat_lasso_cv - train$lwage)^2) / sum((train$lwage - mean(train$lwage))^2) +cat("R-squared for the lasso with flexible model: ", r2_l, "\n") +r2_adj_l <- 1 - + (sum((yhat_lasso_cv - train$lwage)^2) / (ntrain - p_l - 1)) / + (sum((train$lwage - mean(train$lwage))^2) / (ntrain - 1)) +cat("adjusted R-squared for the flexible model: ", r2_adj_l, "\n") + +# MSE and adjusted MSE +mse1 <- mean(sumbasic$res^2) +cat("MSE for the basic model: ", mse1, "\n") +p1 <- sumbasic$df[1] # number of regressors +mse_adj1 <- (ntrain / (ntrain - p1)) * mse1 +cat("adjusted MSE for the basic model: ", mse_adj1, "\n") + +mse2 <- mean(sumflex$res^2) +cat("MSE for the flexible model: ", mse2, "\n") +p2 <- sumflex$df[1] +mse_adj2 <- (ntrain / (ntrain - p2)) * mse2 +cat("adjusted MSE for the lasso flexible model: ", mse_adj2, "\n") + +lasso_res <- train$lwage - yhat_lasso_cv +mse_l <- mean(lasso_res^2) +cat("MSE for the lasso flexible model: ", mse_l, "\n") +mse_adj_l <- (ntrain / (ntrain - p_l - 1)) * mse_l +cat("adjusted MSE for the lasso flexible model: ", mse_adj_l, "\n") +``` + +```{r} +# Output the table +table <- matrix(0, 3, 5) +table[1, 1:5] <- c(p1, r2_1, mse1, r2_adj1, mse_adj1) +table[2, 1:5] <- c(p2, r2_2, mse2, r2_adj2, mse_adj2) +table[3, 1:5] <- c(p_l, r2_l, mse_l, r2_adj_l, mse_adj_l) +colnames(table) <- c("p", "$R^2_{sample}$", "$MSE_{sample}$", "$R^2_{adjusted}$", "$MSE_{adjusted}$") +rownames(table) <- c("basic reg", "flexible reg", "lasso flex") +tab <- xtable(table, digits = c(0, 0, 2, 2, 2, 2)) +print(tab, type = "latex") +tab +``` + +Considering the measures above, the flexible model performs slightly better than the basic model. + +As $p/n$ is not large, the discrepancy between the adjusted and unadjusted measures is not large. However, if it were, we might still like to apply **data splitting** as a more general procedure to deal with potential overfitting if $p/n$. We illustrate the approach in the following. + +## Data Splitting: Out-of-sample performance + +Now that we have seen in-sample fit, we evaluate our models on the out-of-sample performance: +- Use the testing sample for evaluation. Predict the $\mathtt{wage}$ of every observation in the testing sample based on the estimated parameters in the training sample. +- Calculate the Mean Squared Prediction Error $MSE_{test}$ based on the testing sample for both prediction models. + +```{r} +# basic model +options(warn = -1) # ignore warnings +regbasic <- lm(basic, data = train) + +# calculating the out-of-sample MSE +yhat_bas <- predict(regbasic, newdata = test) +y_test <- test$lwage +mean_train <- mean(train$lwage) +mse_test1 <- sum((y_test - yhat_bas)^2) / length(y_test) +r2_test1 <- 1 - mse_test1 / mean((y_test - mean_train)^2) + +cat("Test MSE for the basic model: ", mse_test1, " ") +cat("Test R2 for the basic model: ", r2_test1) +``` + +In the basic model, the $MSE_{test}$ is quite close to the $MSE_{sample}$. + +```{r} +# flexible model +options(warn = -1) # ignore warnings +regflex <- lm(flex, data = train) + +# calculating the out-of-sample MSE +yhat_flex <- predict(regflex, newdata = test) +y_test <- test$lwage +mean_train <- mean(train$lwage) +mse_test2 <- sum((y_test - yhat_flex)^2) / length(y_test) +r2_test2 <- 1 - mse_test2 / mean((y_test - mean_train)^2) + +cat("Test MSE for the flexible model: ", mse_test2, " ") + +cat("Test R2 for the flexible model: ", r2_test2) +``` + +In the flexible model too, the discrepancy between the $MSE_{test}$ and the $MSE_{sample}$ is not large. + +It is worth noticing that the $MSE_{test}$ varies across different data splits. Hence, it is a good idea to average the out-of-sample MSE over different data splits to get valid results. + +Nevertheless, we observe that, based on the out-of-sample $MSE$, the basic model using ols regression performs **about as well (or slightly better)** than the flexible model. + +Next, let us use lasso regression in the flexible model instead of ols regression. The out-of-sample $MSE$ on the test sample can be computed for any black-box prediction method, so we also compare the performance of lasso regression in the flexible model to ols regression. + +```{r} +# Flexible model using Lasso +# model matrix should be formed before train/test as some levels dropped +flex_data <- model.matrix(flex, data) +train_flex <- flex_data[random, ] +test_flex <- flex_data[-random, ] + +fit_lasso_cv <- cv.glmnet(train_flex, train$lwage, family = "gaussian", alpha = 1, nfolds = 5) +yhat_lasso_cv <- predict(fit_lasso_cv, newx = test_flex, s = "lambda.min") + +# calculating the out-of-sample MSE +mse_lasso <- sum((y_test - yhat_lasso_cv)^2) / length(y_test) +r2_lasso <- 1 - mse_lasso / mean((y_test - mean(train$lwage))^2) + +cat("Test MSE for the lasso on flexible model: ", mse_lasso, " ") + +cat("Test R2 for the lasso flexible model: ", r2_lasso) +``` + +Finally, let us summarize the results: + +```{r} +# Output the comparison table +table2 <- matrix(0, 3, 2) +table2[1, 1] <- mse_test1 +table2[2, 1] <- mse_test2 +table2[3, 1] <- mse_lasso +table2[1, 2] <- r2_test1 +table2[2, 2] <- r2_test2 +table2[3, 2] <- r2_lasso + +rownames(table2) <- c("basic reg", "flexible reg", "lasso regression") +colnames(table2) <- c("$MSE_{test}$", "$R^2_{test}$") +tab2 <- xtable(table2, digits = 3) +tab2 +``` + +```{r} +print(tab2, type = "latex") +``` + +## Extra flexible model and Overfitting +Given the results above, it is not immediately clear why one would choose to use Lasso as results are fairly similar. To motivate, we consider an extra flexible model to show how OLS can overfit significantly to the in-sample train data and perform poorly on the out-of-sample testing data. + + +```{r} +# extra flexible model +extraflex <- lwage ~ sex + (exp1 + exp2 + exp3 + exp4 + shs + hsg + scl + clg + C(occ2) + C(ind2) + mw + so + we)^2 +regextra <- lm(extraflex, data = train) +sumextra <- summary(regextra) +cat("Number of Extra-Flex Controls", length(regextra$coef) - 1, "\n") +n <- length(data$wage) +p <- length(regextra$coef) +ntrain <- length(train$wage) +``` + +```{r} +## In-sample +# R-squared and adjusted R-squared +r2_extra <- sumextra$r.squared +cat("R-squared for the extra flexible model (in-sample): ", r2_extra, "\n") +r2_adjextra <- sumextra$adj.r.squared +cat("adjusted R-squared for the extra flexible model (in-sample): ", r2_adjextra, "\n") + +# MSE and adjusted MSE +mse_extra <- mean(sumextra$res^2) +cat("MSE for the extra flexible model (in-sample): ", mse_extra, "\n") +mse_adjextra <- (ntrain / (ntrain - p)) * mse_extra +cat("adjusted MSE for the basic model (in-sample): ", mse_adj1, "\n") +``` + +```{r} +## Out-of-sample +yhat_ex <- predict(regextra, newdata = test) +y_test_ex <- test$lwage +mse_test_ex <- sum((y_test_ex - yhat_ex)^2) / length(y_test_ex) +r2_test_ex <- 1 - mse_test_ex / mean((y_test_ex - mean(train$lwage))^2) + +cat("Test MSE for the basic model: ", mse_test_ex, " ") +cat("Test R2 for the basic model: ", r2_test_ex) +``` + +As we can see, a simple OLS overfits when the dimensionality of covariates is high, as the out-of-sample performance suffers dramatically in comparison to the in-sample performance. + +Contrast this with Lasso: + +```{r} +# model matrix should be formed before train/test as some levels dropped +flex_data <- model.matrix(extraflex, data) +train_flex <- flex_data[random, ] +test_flex <- flex_data[-random, ] + +# fit model +fit_lcv <- cv.glmnet(train_flex, train$lwage, family = "gaussian", alpha = 1, nfolds = 5) + +# in-sample +yhat_lcv <- predict(fit_lcv, newx = train_flex, s = "lambda.min") + +r2_l <- 1 - sum((yhat_lcv - train$lwage)^2) / sum((train$lwage - mean(train$lwage))^2) +p_l <- fit_lcv$nzero[fit_lcv$index[1]] +r2_adj_l <- 1 - + (sum((yhat_lcv - train$lwage) ^ 2) / (ntrain - p_l - 1)) / + (sum((train$lwage - mean(train$lwage))^2) / (ntrain - 1)) + +lasso_res <- train$lwage - yhat_lcv +mse_l <- mean(lasso_res^2) +mse_adj_l <- (ntrain / (ntrain - p_l - 1)) * mse_l + +cat("R-squared for the lasso with the extra flexible model (in-sample): ", r2_l, "\n") +cat("adjusted R-squared for the extra flexible model (in-sample): ", r2_adj_l, "\n") +cat("MSE for the lasso with the extra flexible model (in-sample): ", mse_l, "\n") +cat("adjusted MSE for the lasso with the extraflexible model (in-sample): ", mse_adj_l, "\n") + +# out-of-sample +yhat_lcv_test <- predict(fit_lcv, newx = test_flex, s = "lambda.min") +mse_lasso <- sum((test$lwage - yhat_lcv_test)^2) / length(test$lwage) +r2_lasso <- 1 - mse_lasso / mean((test$lwage - mean(train$lwage))^2) + +cat("\n") +cat("Test R2 for the lasso the extra flexible model: ", r2_lasso, "\n") +cat("Test MSE for the lasso on the extra flexible model: ", mse_lasso) +``` + +As shown above, the overfitting effect is mitigated with the penalized regression model. + diff --git a/PM2/r_convergence_hypothesis_double_lasso.Rmd b/PM2/r_convergence_hypothesis_double_lasso.Rmd new file mode 100644 index 00000000..0ab246fd --- /dev/null +++ b/PM2/r_convergence_hypothesis_double_lasso.Rmd @@ -0,0 +1,268 @@ +--- +title: An R Markdown document converted from "PM2/r_convergence_hypothesis_double_lasso.irnb" +output: html_document +--- + +# Testing the Convergence Hypothesis + +```{r} +install.packages("hdm") +install.packages("xtable") +install.packages("lmtest") +install.packages("sandwich") +install.packages("glmnet") +install.packages("ggplot2") +``` + +```{r} +library(hdm) +library(xtable) +library(lmtest) +library(sandwich) +library(glmnet) # For LassoCV +library(ggplot2) +``` + +## Introduction + +We provide an additional empirical example of partialling-out with Lasso to estimate the regression coefficient $\beta_1$ in the high-dimensional linear regression model: + $$ + Y = \beta_1 D + \beta_2'W + \epsilon. + $$ + +Specifically, we are interested in how the rates at which economies of different countries grow ($Y$) are related to the initial wealth levels in each country ($D$) controlling for country's institutional, educational, and other similar characteristics ($W$). + +The relationship is captured by $\beta_1$, the *speed of convergence/divergence*, which measures the speed at which poor countries catch up $(\beta_1< 0)$ or fall behind $(\beta_1> 0)$ rich countries, after controlling for $W$. Our inference question here is: do poor countries grow faster than rich countries, controlling for educational and other characteristics? In other words, is the speed of convergence negative: $ \beta_1 <0?$ This is the Convergence Hypothesis predicted by the Solow Growth Model. This is a structural economic model. Under some strong assumptions, that we won't state here, the predictive exercise we are doing here can be given causal interpretation. + +## Data Analysis + +We consider the data set GrowthData which is included in the package *hdm*. First, let us load the data set to get familiar with the data. + +```{r} +getdata <- function(...) { + e <- new.env() + name <- data(..., envir = e)[1] + e[[name]] +} + +# now load your data calling getdata() +growth <- getdata(GrowthData) +``` + +The sample contains $90$ countries and $63$ controls. + +```{r} +growth +``` + +Thus $p \approx 60$, $n=90$ and $p/n$ is not small. We expect the least squares method to provide a poor estimate of $\beta_1$. We expect the method based on partialling-out with Lasso to provide a high quality estimate of $\beta_1$. +To check this hypothesis, we analyze the relation between the output variable $Y$ and the other country's characteristics by running a linear regression in the first step. + +```{r} +## Create the outcome variable y and covariates x +y <- growth$Outcome +X <- growth[-which(colnames(growth) %in% c("intercept"))] +``` + +```{r} +fit <- lm(Outcome ~ ., data = X) +est <- summary(fit)$coef["gdpsh465", 1] + +hcv_coefs <- vcovHC(fit, type = "HC1") # HC - "heteroskedasticity cosistent" +se <- sqrt(diag(hcv_coefs))[2] # Estimated std errors + +# print unconditional effect of gdpsh465 and the corresponding standard error +cat("The estimated coefficient on gdpsh465 is", est, + " and the corresponding robust standard error is", se) + +# Calculate the 95% confidence interval for 'gdpsh465' +lower_ci <- est - 1.96 * se +upper_ci <- est + 1.96 * se + +cat("95% Confidence Interval: [", lower_ci, ",", upper_ci, "]") +``` + +## Summarize OLS results + +```{r} +# Create an empty data frame with column names +table <- data.frame( + Method = character(0), + Estimate = character(0), + `Std. Error` = numeric(0), + `Lower Bound CI` = numeric(0), + `Upper Bound CI` = numeric(0) +) + +# Add OLS results to the table +table <- rbind(table, c("OLS", est, se, lower_ci, upper_ci)) + +# Rename the columns to match the Python table +colnames(table) <- c("Method", "Estimate", "Std. Error", "lower bound CI", "upper bound CI") + +# Print the table +print(table) +``` + +Least squares provides a rather noisy estimate (high standard error) of the +speed of convergence, and does not allow us to answer the question +about the convergence hypothesis since the confidence interval includes zero. + +In contrast, we can use the partialling-out approach based on lasso regression ("Double Lasso"). + +```{r} +y <- growth$Outcome +W <- growth[-which(colnames(growth) %in% c("Outcome", "intercept", "gdpsh465"))] +D <- growth$gdpsh465 +``` + +## Method 1: Lasso with Theoretical Penalty using HDM + +While cross validation is commonly employed for choosing penalty parameters in Lasso, it can be very noisy and tends to choose relatively small penalty leading to some overfitting. For this reason, one should not use cross validation to choose tuning parameters unless sample splitting is employed. We illustrate the use of sample combined with cross validation in later chapters in the book. Since we are using the full sample here, it is much better (and theoretically valid) to use penalties that provably control overfitting, which is what we do here. + +We report the results using cross validation at the end of this notebook for comparison. There, we observe overfitting for the prediction of the outcome. + +```{r} +double_lasso <- function(y, D, W) { + + # residualize outcome with Lasso + yfit_rlasso <- hdm::rlasso(W, y, post = FALSE) + yhat_rlasso <- predict(yfit_rlasso, as.data.frame(W)) + yres <- y - as.numeric(yhat_rlasso) + + + # residualize treatment with Lasso + dfit_rlasso <- hdm::rlasso(W, D, post = FALSE) + dhat_rlasso <- predict(dfit_rlasso, as.data.frame(W)) + dres <- D - as.numeric(dhat_rlasso) + + # rest is the same as in the OLS case + hat <- mean(yres * dres) / mean(dres^2) + epsilon <- yres - hat * dres + V <- mean(epsilon^2 * dres^2) / mean(dres^2)^2 + stderr <- sqrt(V / length(y)) + + return(list(hat = hat, stderr = stderr)) +} +``` + +```{r} +results <- double_lasso(y, D, W) +hat <- results$hat +stderr <- results$stderr +# Calculate the 95% confidence interval +ci_lower <- hat - 1.96 * stderr +ci_upper <- hat + 1.96 * stderr +``` + +The least square method provides a rather noisy estimate of the speed of convergence. We can not answer the question if poor countries grow faster than rich countries. The least square method does not work when the ratio $p/n$ is large. + +In sharp contrast, partialling-out via Lasso provides a more precise estimate. The Lasso based point estimate is $-5\%$ and the $95\%$ confidence interval for the (annual) rate of convergence $[-7.8\%,-2.2\%]$ only includes negative numbers. This empirical evidence does support the convergence hypothesis. + +```{r} +# Add Double Lasso results to the table +table <- rbind(table, c("Double Lasso", hat, stderr, ci_lower, ci_upper)) + +# Print the table +print(table) +``` + +## Method 2: Lasso with Cross-Validation + +This section is for illustration purposes only. Given that we are using the full sample, cross validation *should not* be used for choosing tuning parameters here. Cross validation tends to (mildly) overfit, and this overfitting can lead to substantial problems when inference about parameters is the goal. + +```{r} +# Choose penalty based on KFold cross validation +set.seed(123) +# Given small sample size, we use an aggressive number of 20 folds +n_folds <- 20 + + +# Define LassoCV models for y and D +model_y <- cv.glmnet( + x = as.matrix(W), + y = y, + alpha = 1, # Lasso penalty + nfolds = n_folds, + family = "gaussian" +) + +model_d <- cv.glmnet( + x = as.matrix(W), + y = D, + alpha = 1, # Lasso penalty + nfolds = n_folds, + family = "gaussian" +) + +# Get the best lambda values for y and D +best_lambda_y <- model_y$lambda.min +best_lambda_d <- model_d$lambda.min + +# Fit Lasso models with the best lambda values +lasso_model_y <- glmnet(as.matrix(W), y, alpha = 1, lambda = best_lambda_y) +lasso_model_d <- glmnet(as.matrix(W), D, alpha = 1, lambda = best_lambda_d) + +# Calculate the residuals +res_y <- y - predict(lasso_model_y, s = best_lambda_y, newx = as.matrix(W)) +res_d <- D - predict(lasso_model_d, s = best_lambda_d, newx = as.matrix(W)) +``` + +```{r} +tmp_df <- as.data.frame(cbind(res_y, res_d)) +colnames(tmp_df) <- c("res_y", "res_d") +``` + +```{r} +fit_cv <- lm(res_y ~ res_d, data = tmp_df) +est_cv <- summary(fit_cv)$coef["res_d", 1] + +hcv_cv_coefs <- vcovHC(fit_cv, type = "HC1") # HC - "heteroskedasticity cosistent" +se_cv <- sqrt(diag(hcv_cv_coefs))[2] # Estimated std errors + +# Calculate the 95% confidence interval for 'gdpsh465' +lower_ci_cv <- est_cv - 1.96 * se_cv +upper_ci_cv <- est_cv + 1.96 * se_cv +``` + +```{r} +# Add LassoCV results to the table +table <- rbind(table, c("Double Lasso CV", est_cv, se_cv, lower_ci_cv, upper_ci_cv)) + +# Print the table +print(table) +``` + +We find that the outcome model chooses too small of a penalty based on cross-validation, leading to overfitting of the outcome and tiny outcome residuals. This leads to artificially small standard errors and a zero treatment effect. Theoretically driven penalty should be preferred for such small sample sizes. + +```{r} +# Create a data frame to store the results +results_y <- data.frame( + Alphas = model_y$lambda, + OutOfSampleR2 = 1 - model_y$cvm / var(y) +) + +results_d <- data.frame( + Alphas = model_d$lambda, + OutOfSampleR2 = 1 - model_d$cvm / var(D) +) + +# Plot Outcome Lasso-CV Model +ggplot(data = results_y, aes(x = Alphas, y = OutOfSampleR2)) + + geom_line() + + labs( + title = "Outcome Lasso-CV Model: Out-of-sample R-squared as function of penalty level", + x = "Penalty Level", + y = "Out-of-sample R-squared" + ) + +# Plot Treatment Lasso-CV Model +ggplot(data = results_d, aes(x = (Alphas), y = OutOfSampleR2)) + + geom_line() + + labs( + title = "Treatment Lasso-CV Model: Out-of-sample R-squared as function of penalty level", + x = "Penalty Level", + y = "Out-of-sample R-squared" + ) +``` + diff --git a/PM2/r_experiment_non_orthogonal.Rmd b/PM2/r_experiment_non_orthogonal.Rmd new file mode 100644 index 00000000..42778dec --- /dev/null +++ b/PM2/r_experiment_non_orthogonal.Rmd @@ -0,0 +1,516 @@ +--- +title: An R Markdown document converted from "PM2/r_experiment_non_orthogonal.irnb" +output: html_document +--- + +# Simulation Design + +```{r} +install.packages("hdm") +``` + +```{r} +library(hdm) +``` + +## Generating RCT data + +```{r} +gen_data <- function(n, d, p, delta, base) { + X <- matrix(rnorm(n * d), nrow = n, ncol = d) + D <- rbinom(n, 1, p) + y0 <- base - X[, 1] + rnorm(n, mean = 0, sd = 0.1) + y1 <- delta + base - X[, 1] + rnorm(n, mean = 0, sd = 0.1) + y <- y1 * D + y0 * (1 - D) + return(list(y = y, D = D, X = X)) +} +``` + +```{r} +n <- 100 # n samples +d <- 100 # n features +delta <- 1.0 # treatment effect +base <- 0.3 # baseline outcome +``` + +## Two Means Estimator + +```{r} +# Simple two means estimate and calculation of variance +twomeans <- function(y, D) { + hat0 <- mean(y[D == 0]) # mean of outcome of un-treated + hat1 <- mean(y[D == 1]) # mean of outcome of treated + V0 <- var(y[D == 0]) / mean(1 - D) # asymptotic variance of the mean of outcome of untreated + V1 <- var(y[D == 1]) / mean(D) # asymptotic variance of the mean of outcome of treated + hat <- hat1 - hat0 # estimate of the treatment effect + # standard error of the estimate of the treatment effect + stderr <- sqrt((V0 + V1) / length(y)) + return(list(hat = hat, stderr = stderr)) +} +``` + +```{r} +# Set the random seed for reproducibility +set.seed(125) + +# Generate RCT data +data <- gen_data(n, d, 0.2, delta, base) +y <- data$y +D <- data$D +X <- data$X + +# Calculate estimation quantities +twomeans(y, D) +``` + +## Partialling-Out Estimator + +```{r} +# We implement the partialling out version of OLS (for pedagogical purposes) +partialling_out <- function(y, D, W) { + # Residualize outcome with OLS + yfit <- lm(y ~ W) + yhat <- predict(yfit, as.data.frame(W)) + yres <- y - as.numeric(yhat) + + # Residualize treatment with OLS + Dfit <- lm(D ~ W) + Dhat <- predict(Dfit, as.data.frame(W)) + Dres <- D - as.numeric(Dhat) + + # Calculate final residual ~ residual OLS estimate + hat <- mean(yres * Dres) / mean(Dres^2) + + # Calculate residual of final regression (epsilon in the BLP decomposition) + epsilon <- yres - hat * Dres + + # Calculate variance of the treatment effect + V <- mean(epsilon^2 * Dres^2) / mean(Dres^2)^2 + stderr <- sqrt(V / length(y)) + + return(list(hat = hat, stderr = stderr)) +} +``` + +```{r} +partialling_out(y, D, cbind(D * X, X)) +``` + +## Double Lasso Partialling-Out Estimator + +```{r} +# Now we simply replace OLS with Lasso to implement the Double Lasso process + +double_lasso <- function(y, D, W) { + + # residualize outcome with Lasso + yfit_rlasso <- hdm::rlasso(W, y, post = FALSE) + yhat_rlasso <- predict(yfit_rlasso, as.data.frame(W)) + yres <- y - as.numeric(yhat_rlasso) + + + # residualize treatment with Lasso + dfit_rlasso <- hdm::rlasso(W, D, post = FALSE) + dhat_rlasso <- predict(dfit_rlasso, as.data.frame(W)) + Dres <- D - as.numeric(dhat_rlasso) + + # rest is the same as in the OLS case + hat <- mean(yres * Dres) / mean(Dres^2) + epsilon <- yres - hat * Dres + V <- mean(epsilon^2 * Dres^2) / mean(Dres^2)^2 + stderr <- sqrt(V / length(y)) + + return(list(hat = hat, stderr = stderr)) +} +``` + +```{r} +double_lasso(y, D, cbind(D * X, X)) +``` + +# Simulation + +### Two-Means + +```{r} +# We now check the distributional properties of the different estimators across experiments +# First is the simple two means estimate + +n_experiments <- 100 +# we will keep track of coverage (truth is in CI) and of the point estimate and stderr +cov <- numeric(n_experiments) +hats <- numeric(n_experiments) +stderrs <- numeric(n_experiments) + +for (i in 1:n_experiments) { + # Generate data for each experiment + data <- gen_data(n, d, 0.2, delta, base) + y <- data$y + D <- data$D + X <- data$X + + # Calculate two-means estimate + results <- twomeans(y, D) + hat <- results$hat + stderr <- results$stderr + + # Calculate the 95% confidence interval + ci_lower <- hat - 1.96 * stderr + ci_upper <- hat + 1.96 * stderr + + # Check if the true parameter delta is within the confidence interval + cov[i] <- (ci_lower <= delta) & (delta <= ci_upper) + + hats[i] <- hat + stderrs[i] <- stderr +} +``` + +```{r} +# Calculate average coverage (should be .95 ideally) +coverage_rate <- mean(cov) + +cat("Coverage Rate (95% CI):", coverage_rate, "\n") +``` + +```{r} +hist(hats, main = "Distribution of Estimates", xlab = "Estimate", col = "skyblue") +``` + +```{r} +mean(hats) # mean of estimate; measures how biased the estimate is (should be =delta ideally) +``` + +```{r} +sd(hats)# standard deviation of estimates; should be close to the standard errors we calculated for the CIs +``` + +```{r} +mean(stderrs) +``` + +### Partialling Out + +```{r} +# Let's repeat this for the partialling out process (OLS), controlling for X + +n_experiments <- 100 +cov <- numeric(n_experiments) +hats <- numeric(n_experiments) +stderrs <- numeric(n_experiments) + +for (i in 1:n_experiments) { + # Generate data for each experiment + data <- gen_data(n, d, 0.2, delta, base) + y <- data$y + D <- data$D + X <- data$X + + # Calculate partialling out estimate with OLS + results <- partialling_out(y, D, X) + hat <- results$hat + stderr <- results$stderr + + # Calculate the 95% confidence interval + ci_lower <- hat - 1.96 * stderr + ci_upper <- hat + 1.96 * stderr + + # Check if the true parameter delta is within the confidence interval + cov[i] <- (ci_lower <= delta) & (delta <= ci_upper) + + hats[i] <- hat + stderrs[i] <- stderr +} +``` + +```{r} +mean(cov) +``` + +```{r} +hist(hats, main = "Distribution of Estimates", xlab = "Estimate", col = "skyblue") +``` + +```{r} +mean(hats) # ols is heavily biased... mean of estimates very far from delta=1 +``` + +```{r} +sd(hats) +``` + +```{r} +mean(stderrs) # standard error severely under estimates the variance of the estimate; all this is due to overfitting +``` + +### Double Lasso + +```{r} +# Now let's try the double Lasso. +n_experiments <- 100 +cov <- numeric(n_experiments) +hats <- numeric(n_experiments) +stderrs <- numeric(n_experiments) + +for (i in 1:n_experiments) { + # Generate data for each experiment + data <- gen_data(n, d, 0.2, delta, base) + y <- data$y + D <- data$D + X <- data$X + + # Calculate partialling out estimate with OLS + results <- double_lasso(y, D, X) + hat <- results$hat + stderr <- results$stderr + + # Calculate the 95% confidence interval + ci_lower <- hat - 1.96 * stderr + ci_upper <- hat + 1.96 * stderr + + # Check if the true parameter delta is within the confidence interval + cov[i] <- (ci_lower <= delta) & (delta <= ci_upper) + + hats[i] <- hat + stderrs[i] <- stderr +} +``` + +```{r} +mean(cov) +``` + +```{r} +hist(hats, main = "Distribution of Estimates", xlab = "Estimate", col = "skyblue") +``` + +```{r} +mean(hats) # much closer to 1... (almost the same as two-means) +sd(hats) # standard deviation much smaller than two means, which did not adjust for X +mean(stderrs) # and close to the calculate standard errors; we correctly estimated uncertainty +``` + +### Single Lasso + +```{r} +# Now we simply replace OLS with Lasso to implement the Double Lasso process + +double_lasso <- function(y, D, W) { + + # residualize outcome with Lasso + yfit_rlasso <- hdm::rlasso(W, y, post = FALSE) + yhat_rlasso <- predict(yfit_rlasso, as.data.frame(W)) + yres <- y - as.numeric(yhat_rlasso) + + + # residualize treatment with Lasso + dfit_rlasso <- hdm::rlasso(W, D, post = FALSE) + dhat_rlasso <- predict(dfit_rlasso, as.data.frame(W)) + Dres <- D - as.numeric(dhat_rlasso) + + # rest is the same as in the OLS case + hat <- mean(yres * Dres) / mean(Dres^2) + epsilon <- yres - hat * Dres + V <- mean(epsilon^2 * Dres^2) / mean(Dres^2)^2 + stderr <- sqrt(V / length(y)) + + return(list(hat = hat, stderr = stderr)) +} +``` + +```{r} +# Now let's try the double Lasso. + +n_experiments <- 100 +hats <- numeric(n_experiments) + +for (i in 1:n_experiments) { + # Generate data for each experiment + data <- gen_data(n, d, 0.2, delta, base) + y <- data$y + D <- data$D + X <- data$X + + # Calculate single lasso estimate + + + yfit_rlasso <- hdm::rlasso(cbind(D, X), y, post = FALSE) + hat <- yfit_rlasso$coefficients[2] + + hats[i] <- hat +} +``` + +```{r} +hist(hats, main = "Distribution of Estimates", xlab = "Estimate", col = "skyblue") +``` + +```{r} +# bias is comparable and larger than standard deviation. +# Even if we could estimate the standard deviation, confidence intervals would undercover +1 - mean(hats) +sd(hats) +``` + +### Post-Lasso OLS + +```{r} +# Now let's try the post-Lasso. +n_experiments <- 100 +cov <- numeric(n_experiments) +hats <- numeric(n_experiments) +stderrs <- numeric(n_experiments) + +for (i in 1:n_experiments) { + # Generate data for each experiment + data <- gen_data(n, d, 0.2, delta, base) + y <- data$y + D <- data$D + X <- data$X + + + # run a big lasso y ~ D, X + DX <- cbind(D, X) + yfit_rlasso <- hdm::rlasso(DX, y, post = FALSE) # could just use this functionality + coefs <- yfit_rlasso$coefficients[2:n] + selected_columns <- X[, abs(coefs) > 0.0] + # run OLS on y ~ D, X[chosen by lasso] + # calculate standard error as if lasso step never happened + results <- partialling_out(y, D - mean(D), selected_columns) + hat <- results$hat + stderr <- results$stderr + + # Calculate the 95% confidence interval + ci_lower <- hat - 1.96 * stderr + ci_upper <- hat + 1.96 * stderr + # Check if the true parameter delta is within the confidence interval + cov[i] <- (ci_lower <= delta) & (delta <= ci_upper) + hats[i] <- hat + stderrs[i] <- stderr +} +``` + +```{r} +mean(cov) +``` + +```{r} +hist(hats, main = "Distribution of Estimates", xlab = "Estimate", col = "skyblue") +``` + +```{r} +1 - mean(hats) # quite un-biased; bias < standard deviation +sd(hats) +``` + +```{r} +# we under-estimated a bit the uncertainty; smaller estimated stderr than true std. +# this is most prob a finite sample error, from ignoring the lasso variable selection step +# this is an RCT and so even post lasso ols is Neyman orthogonal. We should expect good behavior. +mean(stderrs) +``` + +### Not RCT Data + +```{r} +gen_data_non_rct <- function(n, d, p, delta, base) { + X <- matrix(rnorm(n * d), nrow = n, ncol = d) + D <- X[, 1] + rnorm(n, mean = 0, sd = 1 / 4) + y <- delta * D + base - X[, 1] + rnorm(n, mean = 0, sd = 1) + return(list(y = y, D = D, X = X)) +} +``` + +```{r} +# post-lasso +n_experiments <- 100 +cov <- numeric(n_experiments) +hats <- numeric(n_experiments) +stderrs <- numeric(n_experiments) + +for (i in 1:n_experiments) { + # Generate data for each experiment + data <- gen_data_non_rct(n, d, p, delta, base) + y <- data$y + D <- data$D + X <- data$X + + + # run a big lasso y ~ D, X + DX <- cbind(D, X) + yfit_rlasso <- hdm::rlasso(DX, y, post = FALSE) # could just use this functionality + coefs <- yfit_rlasso$coefficients[2:n] + selected_columns <- X[, abs(coefs) > 0.0] + # run OLS on y ~ D, X[chosen by lasso] + # calculate standard error as if lasso step never happened + results <- partialling_out(y, D - mean(D), selected_columns) + hat <- results$hat + stderr <- results$stderr + + # Calculate the 95% confidence interval + ci_lower <- hat - 1.96 * stderr + ci_upper <- hat + 1.96 * stderr + # Check if the true parameter delta is within the confidence interval + cov[i] <- (ci_lower <= delta) & (delta <= ci_upper) + hats[i] <- hat + stderrs[i] <- stderr +} +``` + +```{r} +mean(cov) # Oops! Post Lasso OLS severely undercovers; It is not Neyman orthogonal when D is correlated with X +``` + +```{r} +hist(hats, main = "Distribution of Estimates", xlab = "Estimate", col = "skyblue") +``` + +```{r} +mean(hats) # very heavily biased +``` + +```{r} +# But now let's try the double Lasso. +n_experiments <- 100 +cov <- numeric(n_experiments) +hats <- numeric(n_experiments) +stderrs <- numeric(n_experiments) + +for (i in 1:n_experiments) { + # Generate data for each experiment + data <- gen_data_non_rct(n, d, p, delta, base) + y <- data$y + D <- data$D + X <- data$X + + # Calculate partialling out estimate with OLS + results <- double_lasso(y, D, X) + hat <- results$hat + stderr <- results$stderr + + # Calculate the 95% confidence interval + ci_lower <- hat - 1.96 * stderr + ci_upper <- hat + 1.96 * stderr + + # Check if the true parameter delta is within the confidence interval + cov[i] <- (ci_lower <= delta) & (delta <= ci_upper) + + hats[i] <- hat + stderrs[i] <- stderr +} +``` + +```{r} +mean(cov) # great coverage +``` + +```{r} +hist(hats, main = "Distribution of Estimates", xlab = "Estimate", col = "skyblue") +``` + +```{r} +1 - mean(hats) +sd(hats) # very small bias compared to standard deviation +mean(stderrs) +``` + diff --git a/PM2/r_heterogenous_wage_effects.Rmd b/PM2/r_heterogenous_wage_effects.Rmd new file mode 100644 index 00000000..d2709e27 --- /dev/null +++ b/PM2/r_heterogenous_wage_effects.Rmd @@ -0,0 +1,102 @@ +--- +title: An R Markdown document converted from "PM2/r_heterogenous_wage_effects.irnb" +output: html_document +--- + +# Application: Heterogeneous Effect of Sex on Wage Using Double Lasso + + We use US census data from the year 2015 to analyse the effect of gender and interaction effects of other variables with gender on wage jointly. The dependent variable is the logarithm of the wage, the target variable is *female* (in combination with other variables). All other variables denote some other socio-economic characteristics, e.g. marital status, education, and experience. + + + +This analysis allows a closer look how the gender wage gap is related to other socio-economic variables. + + +```{r} +install.packages("hdm") +install.packages("xtable") +``` + +```{r} +library(hdm) +library(xtable) +``` + +```{r} +file <- "https://raw.githubusercontent.com/CausalAIBook/MetricsMLNotebooks/main/data/wage2015_subsample_inference.csv" +data <- read.csv(file) +str(data) +data +``` + +```{r} +y <- data$lwage +Z <- subset(data, select = -c(lwage, wage)) +``` + +```{r} +center_colmeans <- function(x) { + xcenter <- colMeans(x) + x - rep(xcenter, rep.int(nrow(x), ncol(x))) +} +``` + +```{r} +# create the model matrix for the covariates +controls_formula <- "~ 0 + (shs + hsg + scl + clg + C(occ2) + C(ind2) + mw + so + we + exp1 + exp2 + exp3 + exp4)**2" +Zcontrols <- model.matrix(as.formula(controls_formula), data = Z) +Zcontrols <- center_colmeans(Zcontrols) +``` + +Construct all the variables that we will use to model heterogeneity of effect in a linear manner + +```{r} +# create the model matrix for the linear heterogeneity +linear_het_formula <- "~ -1 + (shs + hsg + scl + clg + mw + so + we + exp1 + exp2 + exp3 + exp4)" +Zhet <- model.matrix(as.formula(linear_het_formula), data = Z) +Zhet <- center_colmeans(Zhet) +``` + +Construct all interaction variables between sex and heterogeneity variables + +```{r} +# create the model matrix for the higher order heterogeneity +Zhet <- as.data.frame(cbind(Zhet, "sex" = Z$sex)) +nonlin_het_formula <- "~ -1 + sex + sex * (shs + hsg + scl + clg + mw + so + we + exp1 + exp2 + exp3 + exp4)" +Zinteractions <- model.matrix(as.formula(nonlin_het_formula), data = Zhet) +interaction_cols <- Zinteractions[, grepl("sex", colnames(Zinteractions))] +``` + +Put variables all together + +```{r} +X <- cbind(Zinteractions, Zcontrols) +``` + +Get estimates and CIs + +```{r} +# this cell takes 30 minutes to run +index_gender <- grep("sex", colnames(Zinteractions)) +effects_female <- hdm::rlassoEffects(x = X, y = y, index = index_gender, post = FALSE) +result <- summary(effects_female) +result$coef +print(xtable(result$coef[, c(1, 2, 4)], type = "latex"), digits = 3) +``` + +Now, we estimate and plot confidence intervals, first "pointwise" and then the joint confidence intervals. + +```{r} +pointwise_ci <- confint(effects_female, level = 0.95) +pointwise_ci +print(xtable(pointwise_ci), type = "latex") +``` + +Finally, we compare the pointwise confidence intervals to joint confidence intervals. + +```{r} +joint_ci <- confint(effects_female, level = 0.95, joint = TRUE) +joint_ci +print(xtable(joint_ci), type = "latex") +``` + diff --git a/PM2/r_linear_penalized_regs.Rmd b/PM2/r_linear_penalized_regs.Rmd new file mode 100644 index 00000000..a2f52db0 --- /dev/null +++ b/PM2/r_linear_penalized_regs.Rmd @@ -0,0 +1,707 @@ +--- +title: An R Markdown document converted from "PM2/r_linear_penalized_regs.irnb" +output: html_document +--- + +# Penalized Linear Regressions: A Simulation Experiment + +```{r} +install.packages("xtable") +install.packages("hdm") +install.packages("glmnet") +install.packages("ggplot2") +install.packages("tidyr") +``` + +```{r} +library(hdm) +library(xtable) +library(glmnet) +library(ggplot2) +``` + +## Data Generating Process + +We define a simple data generating process that allows for sparse, dense, and sparse+dense coefficients + +```{r} +gen_data <- function(n, p, regime = "sparse") { + # constants chosen to get R^2 of approximately .80 + if (regime == "sparse") { + beta <- (1 / seq(1:p)^2) * 7 + } else if (regime == "dense") { + beta <- rnorm(p) * 0.5 + } else if (regime == "sparsedense") { + beta_1 <- (1 / seq(1:p)^2) * 6.5 + beta_2 <- rnorm(p, 0, 0.5) * 0.7 + beta <- beta_1 + beta_2 + } + + true_fn <- function(x) { + x[, seq_len(dim(x)[2])] %*% beta + } + + X <- matrix(runif(n * p, min = -0.5, max = 0.5), n, p) + gX <- true_fn(X) + y <- gX + rnorm(n) + + Xtest <- matrix(runif(n * p, min = -0.5, max = 0.5), n, p) + gXtest <- true_fn(Xtest) + ytest <- gXtest + rnorm(n) + + Xpop <- matrix(runif(100000 * p, min = -0.5, max = 0.5), 100000, p) + gXpop <- true_fn(Xpop) + ypop <- gXpop + rnorm(100000) + + return(list( + X = X, y = y, gX = gX, Xtest = Xtest, ytest = ytest, gXtest = gXtest, + Xpop = Xpop, ypop = ypop, gXpop = gXpop, beta = beta + )) +} +``` + +## Data Generating Process: Approximately Sparse + +```{r} +set.seed(1) +n <- 100 +p <- 400 +res <- gen_data(n, p, regime = "sparse") +``` + +```{r} +X <- res$X +y <- res$y +gX <- res$gX +Xtest <- res$Xtest +ytest <- res$ytest +gXtest <- res$gXtest +Xpop <- res$Xpop +ypop <- res$ypop +gXpop <- res$gXpop +betas <- res$beta +``` + +```{r} +plot(gX, y, xlab = "g(X)", ylab = "y") # plot V vs g(X) +print(c("theoretical R2:", var(gX) / var(y))) # theoretical R-square in the simulation example +``` + +```{r} +# Plot betas +plot(seq_along(betas), abs(betas), + log = "y", pch = 20, col = "blue", + xlab = expression(beta), ylab = "Magnitude (log scale)", + main = expression(paste("Beta Magnitude")) +) +``` + +## Lasso, Ridge, ElasticNet + +We use glmnet's penalized estimators, which choose the penalty parameter via cross-validation (by default 10-fold cross-validation). These methods search over an adaptively chosen grid of hyperparameters. The parameter `alpha` controls what penalty (or allows for a convex combination of `l1` and `l2` penalty). Set `alpha=0.5` for elastic net. + +Features will be standardized (by glmnet) so that penalization does not favor different features asymmetrically. + +```{r} +r2_score <- function(preds, actual, ytrain = y) { + rss <- sum((preds - actual)^2) # residual sum of squares + # total sum of squares, we take mean(ytrain) as mean(actual) is an out-of-sample object + tss <- sum((actual - mean(ytrain))^2) + rsq <- 1 - rss / tss + return(rsq) +} +``` + +```{r} +# family gaussian means that we'll be using square loss +fit_lasso_cv <- cv.glmnet(X, y, family = "gaussian", alpha = 1, nfolds = 5) +# family gaussian means that we'll be using square loss +fit_ridge <- cv.glmnet(X, y, family = "gaussian", alpha = 0, nfolds = 5) +# family gaussian means that we'll be using square loss +fit_elnet <- cv.glmnet(X, y, family = "gaussian", alpha = .5, nfolds = 5) +``` + +We calculate the R-squared on the small test set that we have + +```{r} +cat( + "lassocv R2 (Test): ", r2_score(predict(fit_lasso_cv, newx = Xtest, s = "lambda.min"), ytest), + "\nridge R2 (Test): ", r2_score(predict(fit_ridge, newx = Xtest, s = "lambda.min"), ytest), + "\nelnet R2 (Test): ", r2_score(predict(fit_elnet, newx = Xtest, s = "lambda.min"), ytest) +) +``` + +We also calculate what the R-squared would be in the population limit (in our case for practical purposes when we have a very very large test sample) + +```{r} +r2_lasso_cv <- r2_score(predict(fit_lasso_cv, newx = Xpop, s = "lambda.min"), ypop) +r2_ridge <- r2_score(predict(fit_ridge, newx = Xpop, s = "lambda.min"), ypop) +r2_elnet <- r2_score(predict(fit_elnet, newx = Xpop, s = "lambda.min"), ypop) + +cat( + "lassocv R2 (Pop): ", r2_lasso_cv, + "\nridge R2 (Pop): ", r2_ridge, + "\nelnet R2 (Pop): ", r2_elnet +) +``` + +#### glmnet failure in Ridge + +**Note**: Ridge regression performs worse relatively to the Ridge in the correponding [Python notebook](https://colab.research.google.com/github/CausalAIBook/MetricsMLNotebooks/blob/main/PM2/python_linear_penalized_regs.ipynb). Even if one were to control for the randomness in the data and use the same data for both, R's glmnet fails. + +To understand why, look at the cross-validated MSE curve with different $\lambda$ (). + +```{r} +plot(fit_ridge) +``` + +From the [glmnet documentation](https://glmnet.stanford.edu/articles/glmnet.html): + + + +> This plots the cross-validation curve (red dotted line) along with upper and lower standard deviation curves along the $\lambda$ sequence (error bars). Two special values along the $\lambda$ sequence are indicated by the vertical dotted lines. ```lambda.min``` is the value of $\lambda$ that gives minimum mean cross-validated error, while ```lambda.1se``` is the value of $\lambda$ that gives the most regularized model such that the cross-validated error is within one standard error of the minimum. + +Notice that the chosen ```lambda.min``` is at the boundary of the sequence. An easy way to check this instead of plotting is to extract the $\lambda$ sequence and the minimum chosen $\lambda_{min}$ from the fitted object. + +```{r} +cat("lambda sequence: ", fit_ridge$lambda) +cat("\nChosen minimum lambda: ", fit_ridge$lambda.min) +``` + +In general, it is good practice to examine the lambda sequence that R produces and searches over in cross-validation. When the penalty is chosen at the boundary like we see here, this indicates the generated penalty sequence is likely misspecified. Thus, we choose to supply our own sequence. In particular, we choose values that match up with those in Python's ```sklearn``` Ridge implementation. + + +```glmnet``` minimizes the elastic net loss function as follows: +$$\min_{\beta} \frac{1}{N} \| X\beta - y\|_2^2 + \lambda_{R} \left( \frac{1}{2} (1-\alpha) \|\beta\|_2^2 + \alpha \|\beta\|_1 \right) $$ + +For ridge, $\alpha=0$, so $$\min_{\beta} \frac{1}{N} \| X\beta - y\|_2^2 + \frac{\lambda_{R}}{2} \|\beta\|_2^2 $$ + +Meanwhile, ```sklearn``` minimizes $$\min_{\beta} \frac{1}{N} \|X\beta-y\|_2^2 + \frac{\lambda_{python}}{N} \|\beta\|_2^2$$ where $\lambda_{python}$ is chosen from the grid $(0.1,1,10)$. + +To translate this into R, we must set in glmnet $$\lambda_{R} :=\frac{2}{N} \lambda_{python}$$ + +```{r} +# sklearn lambdas (penalty) +lambdas_sklearn <- c(0.1, 1, 10) # defaults +l_seq <- 2 / nrow(X) * lambdas_sklearn +l_seq # note how different these are to the actual lambdas generated by glmnet +``` + +```{r} +fit_ridge <- cv.glmnet(X, y, family = "gaussian", alpha = 0, nfolds = 5, lambda = l_seq) +r2_ridge <- r2_score(predict(fit_ridge, newx = Xpop, s = "lambda.min"), ypop) +``` + +```{r} +cat( + "lassocv R2 (Pop): ", r2_lasso_cv, + "\nridge R2 (Pop): ", r2_ridge, + "\nelnet R2 (Pop): ", r2_elnet +) +``` + +## Plug-in Hyperparameter Lasso and Post-Lasso OLS + +Here we compute the lasso and ols post lasso using plug-in choices for penalty levels. + +\We use "plug-in" tuning with a theoretically valid choice of penalty $\lambda = 2 \cdot c \hat{\sigma} \sqrt{n} \Phi^{-1}(1-\alpha/2p)$, where $c>1$ and $1-\alpha$ is a confidence level, and $\Phi^{-1}$ denotes the quantile function. Under homoskedasticity, this choice ensures that the Lasso predictor is well behaved, delivering good predictive performance under approximate sparsity. In practice, this formula will work well even in the absence of homoskedasticity, especially when the random variables $\epsilon$ and $X$ in the regression equation decay quickly at the tails. + +In practice, many people choose to use cross-validation, which is perfectly fine for predictive tasks. However, when conducting inference, to make our analysis valid we will require cross-fitting in addition to cross-validation. As we have not yet discussed cross-fitting, we rely on this theoretically-driven penalty in order to allow for accurate inference in the upcoming notebooks. + +We pull an anaue of R's rlasso. Rlasso functionality: it is searching the right set of regressors. This function was made for the case of ***p*** regressors and ***n*** observations where ***p >>>> n***. It assumes that the error is i.i.d. The errors may be non-Gaussian or heteroscedastic.\ +The post lasso function makes OLS with the selected ***T*** regressors. +To select those parameters, they use $\lambda$ as variable to penalize\ +**Funny thing: the function rlasso was named like that because it is the "rigorous" Lasso.** + +```{r} +fit_rlasso <- hdm::rlasso(y ~ X, post = FALSE) # lasso with plug-in penalty level +fit_rlasso_post <- hdm::rlasso(y ~ X, post = TRUE) # post-lasso with plug-in penalty level +``` + +```{r} +r2_lasso <- r2_score(predict(fit_rlasso, newdata = Xtest), ytest) +r2_lasso_post <- r2_score(predict(fit_rlasso_post, newdata = Xtest), ytest) + +cat( + "rlasso R2 (Test): ", r2_lasso, + "\nrlasso-post R2 (Test): ", r2_lasso_post +) +``` + +```{r} +r2_lasso <- r2_score(predict(fit_rlasso, newdata = (Xpop)), (ypop)) +r2_lasso_post <- r2_score(predict(fit_rlasso_post, newdata = (Xpop)), (ypop)) + +cat( + "rlasso R2 (Pop): ", r2_lasso, + "\nrlasso-post R2 (Pop): ", r2_lasso_post +) +``` + +## LAVA: Dense + Sparse Coefficients + +Next we code up lava, which alternates the fitting of lasso and ridge + +```{r} +# Define function to compute lava estimator. Doing an iterative scheme with fixed +# number of iteration. Could iterate until a convergence criterion is met. +lava_predict <- function(X, Y, newX, lambda1, lambda2, iter = 5) { + + # Need to demean internally + dy <- Y - mean(Y) + dx <- scale(X, scale = FALSE) + + sp1 <- glmnet::glmnet(dx, dy, lambda = lambda1) # lasso step fits "sparse part" + de1 <- glmnet::glmnet(dx, dy - predict(sp1, newx = dx), alpha = 0, lambda = lambda2) + + i <- 1 + while (i <= iter) { + sp1 <- glmnet::glmnet(dx, dy - predict(de1, newx = dx, s = "lambda.min"), lambda = lambda1) + de1 <- glmnet::glmnet(dx, dy - predict(sp1, newx = dx, s = "lambda.min"), alpha = 0, lambda = lambda2) + i <- i + 1 + } + + bhat <- sp1$beta + de1$beta + a0 <- mean(Y) - sum(colMeans(X) * bhat) + + # Need to add intercept to output + + yhat <- newX %*% bhat + a0 + + return(yhat) +} +``` + +```{r} +# define function to get predictions and r2 scores for lava estimator + +lava_yhat_r2 <- function(xtr_mod, xte_mod, ytr, yte, num_folds = 5) { + # 5-fold CV. glmnet does cross-validation internally and + # relatively efficiently. We're going to write out all the steps to make sure + # we're using the same CV folds across all procedures in a transparent way and + # to keep the overall structure clear as well. + + # Setup for brute force K-Fold CV + n <- length(ytr) + Kf <- num_folds # Number of folds + sampleframe <- rep(1:Kf, ceiling(n / Kf)) + cvgroup <- sample(sampleframe, size = n, replace = FALSE) # CV groups + + + ## ------------------------------------------------------------ + # We're going to take a shortcut and use the range of lambda values that come out + # of the default implementation in glmnet for everything. Could do better here - maybe + + ## Fit ridge on grid of lambda values (chosen by default using glmnet) using basic model. + ridge_mod <- glmnet::glmnet(xtr_mod, ytr, alpha = 0) # alpha = 0 gives ridge + ridge_lambda <- ridge_mod$lambda # values of penalty parameter + + ## Fit lasso on grid of lambda values (chosen by default using glmnet) using basic model. + lasso_mod <- glmnet::glmnet(xtr_mod, ytr) # default is lasso (equivalent to alpha = 1) + lasso_lambda <- lasso_mod$lambda # values of penalty parameter + + ## ------------------------------------------------------------ + + + # Lava - Using a double loop over candidate penalty parameter values. + + lambda1_lava_mod <- lasso_mod$lambda[seq(5, length(lasso_lambda), 10)] + lambda2_lava_mod <- ridge_mod$lambda[seq(5, length(ridge_lambda), 10)] + + cv_mod_lava <- matrix(0, length(lambda1_lava_mod), length(lambda2_lava_mod)) + + for (k in 1:Kf) { + indk <- cvgroup == k + + k_xtr_mod <- xtr_mod[!indk, ] + k_ytr <- ytr[!indk] + k_xte_mod <- xtr_mod[indk, ] + k_yte <- ytr[indk] + + for (ii in seq_along(lambda1_lava_mod)) { + for (jj in seq_along(lambda2_lava_mod)) { + cv_mod_lava[ii, jj] <- cv_mod_lava[ii, jj] + + sum((k_yte - lava_predict(k_xtr_mod, k_ytr, + newX = k_xte_mod, + lambda1 = lambda1_lava_mod[ii], + lambda2 = lambda2_lava_mod[jj]))^2) + } + } + } + + # Get CV min values of tuning parameters + cvmin_lava_mod <- which(cv_mod_lava == min(cv_mod_lava), arr.ind = TRUE) + cvlambda1_lava_mod <- lambda1_lava_mod[cvmin_lava_mod[1]] + cvlambda2_lava_mod <- lambda2_lava_mod[cvmin_lava_mod[2]] + + cat("Min Lava Lasso CV Penalty: ", cvlambda1_lava_mod) + cat("\nMin Lava Ridge CV Penalty: ", cvlambda2_lava_mod) + + + #### Look at performance on test sample + + # Calculate R^2 in training data and in validation data as measures + # Refit on entire training sample + + + #### CV-min model + + # In sample fit + cvmin_yhat_lava_tr <- lava_predict(xtr_mod, ytr, + newX = xtr_mod, + lambda1 = cvlambda1_lava_mod, + lambda2 = cvlambda2_lava_mod + ) + r2_lava_mod <- 1 - sum((ytr - cvmin_yhat_lava_tr)^2) / sum((ytr - mean(ytr))^2) + + # Out of sample fit + cvmin_yhat_lava_test <- lava_predict(xtr_mod, ytr, + newX = xte_mod, + lambda1 = cvlambda1_lava_mod, + lambda2 = cvlambda2_lava_mod + ) + r2v_lava_mod <- 1 - sum((yte - cvmin_yhat_lava_test)^2) / sum((yte - mean(ytr))^2) + + + cat("\nIn sample R2 (CV-min): ", r2_lava_mod) + cat("\nOut of Sample R2 (CV-min): ", r2v_lava_mod) + + + #### Use average model across cv-folds and refit model using all training data + ###### we won't report these results. + ###### Averaging is theoretically more solid, but cv-min is more practical. + n_tr <- length(ytr) + n_te <- length(yte) + yhat_tr_lava_mod <- matrix(0, n_tr, Kf) + yhat_te_lava_mod <- matrix(0, n_te, Kf) + + + for (k in 1:Kf) { + indk <- cvgroup == k + + k_xtr_mod <- xtr_mod[!indk, ] + k_ytr <- ytr[!indk] + + # Lava + yhat_tr_lava_mod[, k] <- as.vector(lava_predict(k_xtr_mod, k_ytr, + newX = xtr_mod, + lambda1 = cvlambda1_lava_mod, + lambda2 = cvlambda2_lava_mod + )) + yhat_te_lava_mod[, k] <- as.vector(lava_predict(k_xtr_mod, k_ytr, + newX = xte_mod, + lambda1 = cvlambda1_lava_mod, + lambda2 = cvlambda2_lava_mod + )) + } + + avg_yhat_lava_tr <- rowMeans(yhat_tr_lava_mod) + avg_yhat_lava_test <- rowMeans(yhat_te_lava_mod) + + r2_cv_ave_lava_mod <- 1 - sum((ytr - avg_yhat_lava_tr)^2) / sum((ytr - mean(ytr))^2) + r2v_cv_ave_lava_mod <- 1 - sum((yte - avg_yhat_lava_test)^2) / sum((yte - mean(ytr))^2) + + cat("\nIn sample R2 (Average Across Folds): ", r2_cv_ave_lava_mod) + cat("\nOut of Sample R2 (Average Across Folds): ", r2v_cv_ave_lava_mod) + + return(c( + cvlambda1_lava_mod, + cvlambda2_lava_mod, + cvmin_yhat_lava_tr, # CV_min + cvmin_yhat_lava_test, # CV_min + r2_lava_mod, # CV_min + r2v_lava_mod, # CV_min + avg_yhat_lava_tr, # Average across Folds + avg_yhat_lava_test, # Average across Folds + r2_cv_ave_lava_mod, # Average across Folds + r2v_cv_ave_lava_mod # Average across Folds + )) +} +``` + +```{r} +# Results for Test +cat("Test Results ...\n") +r2_lava_traintest <- lava_yhat_r2(X, Xtest, y, ytest) +``` + +```{r} +# Results for Pop +## note we don't have to re-train the entire model +## this is just due to the way the function is defined above +cat("Population Results ...\n") +r2_lava_pop <- lava_yhat_r2(X, Xpop, y, ypop) +``` + +```{r} +# report R2 using CV min +cat("LAVA R2 (Test): ", r2_lava_traintest[[6]]) +cat("\nLAVA R2 (Pop) ", r2_lava_pop[[6]]) +``` + +## Summarizing Results + +```{r} +table <- matrix(0, 6, 1) +table[1, 1] <- r2_lasso_cv +table[2, 1] <- r2_ridge +table[3, 1] <- r2_elnet +table[4, 1] <- r2_lasso +table[5, 1] <- r2_lasso_post +table[6, 1] <- r2_lava_pop[[6]] + +colnames(table) <- c("R2 (Population)") +rownames(table) <- c( + "Cross-Validated Lasso", "Cross-Validated ridge", "Cross-Validated elnet", + "Lasso", "Post-Lasso", "Lava" +) +tab <- xtable(table, digits = 3) +print(tab, type = "latex") # set type="latex" for printing table in LaTeX +tab +``` + +```{r} +# Creating a data frame with the predicted values for test +data <- data.frame( + gXtest = gXtest, + Ridge = predict(fit_ridge, newx = Xtest, s = "lambda.min"), + ENet = predict(fit_elnet, newx = Xtest, s = "lambda.min"), + RLasso = predict(fit_rlasso, newdata = Xtest), + RLassoPost = predict(fit_rlasso_post, newdata = Xtest), + LassoCV = predict(fit_lasso_cv, newx = Xtest, s = "lambda.min"), + Lava = as.vector(r2_lava_traintest[[4]]) +) +colnames(data) <- c("gXtest", "Ridge", "ENet", "RLasso", "RlassoPost", "LassoCV", "Lava") + +# Reshaping data into longer format for ggplot +data_long <- tidyr::gather(data, Model, Predicted, -gXtest) + +# Plotting +ggplot(data_long, aes(x = gXtest, y = Predicted, color = Model)) + + geom_point(aes(shape = Model)) + + geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "black") + # gX by gX + scale_color_manual(values = c("brown", "yellow", "red", "green", "blue", "magenta"), + guide = guide_legend(title = "Model")) + + theme_minimal() + + labs( + title = "Comparison of Methods on Predicting gX", + x = "gXtest", + y = "Predictions" + ) + + guides(shape = "none") # Remove the shape legend +``` + +## Data Generating Process: Dense Coefficients + +```{r} +set.seed(1) +n <- 100 +p <- 400 +res <- gen_data(n, p, regime = "dense") + +X <- res$X +y <- res$y +gX <- res$gX +Xtest <- res$Xtest +ytest <- res$ytest +gXtest <- res$gXtest +Xpop <- res$Xpop +ypop <- res$ypop +gXpop <- res$gXpop +betas <- res$beta +``` + +```{r} +plot(gX, y, xlab = "g(X)", ylab = "y") # plot V vs g(X) +print(c("theoretical R2:", var(gX) / var(y))) # theoretical R-square in the simulation example +``` + +```{r} +# plot betas +plot(seq_along(betas), abs(betas), + log = "y", pch = 20, col = "blue", + xlab = expression(beta), ylab = "Magnitude (log scale)", + main = expression(paste("Beta Magnitude")) +) +``` + +```{r} +# family gaussian means that we'll be using square loss +fit_lasso_cv <- cv.glmnet(X, y, family = "gaussian", alpha = 1, nfolds = 5) +# family gaussian means that we'll be using square loss +fit_ridge <- cv.glmnet(X, y, family = "gaussian", alpha = 0, nfolds = 5) +# family gaussian means that we'll be using square loss +fit_elnet <- cv.glmnet(X, y, family = "gaussian", alpha = .5, nfolds = 5) +fit_rlasso <- hdm::rlasso(y ~ X, post = FALSE) # lasso with plug-in penalty level +fit_rlasso_post <- hdm::rlasso(y ~ X, post = TRUE) # post-lasso with plug-in penalty level + +r2_lasso_cv <- r2_score(predict(fit_lasso_cv, newx = Xpop, s = "lambda.min"), ypop) +r2_ridge <- r2_score(predict(fit_ridge, newx = Xpop, s = "lambda.min"), ypop) +r2_elnet <- r2_score(predict(fit_elnet, newx = Xpop, s = "lambda.min"), ypop) +r2_rlasso <- r2_score(predict(fit_rlasso, newdata = Xpop), ypop) +r2_rlasso_post <- r2_score(predict(fit_rlasso_post, newdata = Xpop), ypop) +r2_lava <- lava_yhat_r2(X, Xpop, y, ypop)[[6]] +``` + +```{r} +table <- matrix(0, 6, 1) +table[1, 1] <- r2_lasso_cv +table[2, 1] <- r2_ridge +table[3, 1] <- r2_elnet +table[4, 1] <- r2_rlasso +table[5, 1] <- r2_rlasso_post +table[6, 1] <- r2_lava + +colnames(table) <- c("R2") +rownames(table) <- c( + "Cross-Validated Lasso", "Cross-Validated ridge", "Cross-Validated elnet", + "Lasso", "Post-Lasso", "Lava" +) +tab <- xtable(table, digits = 3) +print(tab, type = "latex") # set type="latex" for printing table in LaTeX +tab +``` + +```{r} +# get lava prediction on test set for plot below +lava_yhat <- lava_yhat_r2(X, Xtest, y, ytest)[[4]] +``` + +```{r} +# Creating a data frame with the predicted values for test +data <- data.frame( + gXtest = gXtest, + Ridge = predict(fit_ridge, newx = Xtest, s = "lambda.min"), + ENet = predict(fit_elnet, newx = Xtest, s = "lambda.min"), + RLasso = predict(fit_rlasso, newdata = Xtest), + RLassoPost = predict(fit_rlasso_post, newdata = Xtest), + LassoCV = predict(fit_lasso_cv, newx = Xtest, s = "lambda.min"), + Lava = as.vector(lava_yhat) +) +colnames(data) <- c("gXtest", "Ridge", "ENet", "RLasso", "RlassoPost", "LassoCV", "Lava") + +# Reshaping data into longer format for ggplot +data_long <- tidyr::gather(data, Model, Predicted, -gXtest) + +# Plotting +ggplot(data_long, aes(x = gXtest, y = Predicted, color = Model)) + + geom_point(aes(shape = Model)) + + geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "black") + # gX by gX + scale_color_manual(values = c("brown", "yellow", "red", "green", "blue", "magenta"), + guide = guide_legend(title = "Model")) + + theme_minimal() + + labs( + title = "Comparison of Methods on Predicting gX", + x = "gXtest", + y = "Predictions" + ) + + guides(shape = "none") # Remove the shape legend +``` + +## Data Generating Process: Approximately Sparse + Small Dense Part + +```{r} +set.seed(1) +n <- 100 +p <- 400 +res <- gen_data(n, p, regime = "sparsedense") + +X <- res$X +y <- res$y +gX <- res$gX +Xtest <- res$Xtest +ytest <- res$ytest +gXtest <- res$gXtest +Xpop <- res$Xpop +ypop <- res$ypop +gXpop <- res$gXpop +betas <- res$beta +``` + +```{r} +plot(gX, y, xlab = "g(X)", ylab = "y") # plot V vs g(X) +print(c("theoretical R2:", var(gX) / var(y))) # theoretical R-square in the simulation example +``` + +```{r} +# plot betas +plot(seq_along(betas), abs(betas), + log = "y", pch = 20, col = "blue", + xlab = expression(beta), ylab = "Magnitude (log scale)", + main = expression(paste("Beta Magnitude")) +) +``` + +```{r} +# family gaussian means that we'll be using square loss +fit_lasso_cv <- cv.glmnet(X, y, family = "gaussian", alpha = 1, nfolds = 5) +# family gaussian means that we'll be using square loss +fit_ridge <- cv.glmnet(X, y, family = "gaussian", alpha = 0, nfolds = 5) +# family gaussian means that we'll be using square loss +fit_elnet <- cv.glmnet(X, y, family = "gaussian", alpha = .5, nfolds = 5) +fit_rlasso <- rlasso(y ~ X, post = FALSE) # lasso with plug-in penalty level +fit_rlasso_post <- rlasso(y ~ X, post = TRUE) # post-lasso with plug-in penalty level + +r2_lasso_cv <- r2_score(predict(fit_lasso_cv, newx = Xpop, s = "lambda.min"), ypop) +r2_ridge <- r2_score(predict(fit_ridge, newx = Xpop, s = "lambda.min"), ypop) +r2_elnet <- r2_score(predict(fit_elnet, newx = Xpop, s = "lambda.min"), ypop) +r2_rlasso <- r2_score(predict(fit_rlasso, newdata = Xpop), ypop) +r2_rlasso_post <- r2_score(predict(fit_rlasso_post, newdata = Xpop), ypop) +r2_lava <- lava_yhat_r2(X, Xpop, y, ypop)[[6]] +``` + +```{r} +table <- matrix(0, 6, 1) +table[1, 1] <- r2_lasso_cv +table[2, 1] <- r2_ridge +table[3, 1] <- r2_elnet +table[4, 1] <- r2_rlasso +table[5, 1] <- r2_rlasso_post +table[6, 1] <- r2_lava + +colnames(table) <- c("R2") +rownames(table) <- c( + "Cross-Validated Lasso", "Cross-Validated ridge", "Cross-Validated elnet", + "Lasso", "Post-Lasso", "Lava" +) +tab <- xtable(table, digits = 3) +print(tab, type = "latex") # set type="latex" for printing table in LaTeX +tab +``` + +```{r} +# get lava prediction on test set for plot below +lava_yhat <- lava_yhat_r2(X, Xtest, y, ytest)[[4]] +``` + +```{r} +# Creating a data frame with the predicted values for test +data <- data.frame( + gXtest = gXtest, + Ridge = predict(fit_ridge, newx = Xtest, s = "lambda.min"), + ENet = predict(fit_elnet, newx = Xtest, s = "lambda.min"), + RLasso = predict(fit_rlasso, newdata = Xtest), + RLassoPost = predict(fit_rlasso_post, newdata = Xtest), + LassoCV = predict(fit_lasso_cv, newx = Xtest, s = "lambda.min"), + Lava = as.vector(lava_yhat) +) +colnames(data) <- c("gXtest", "Ridge", "ENet", "RLasso", "RlassoPost", "LassoCV", "Lava") + +# Reshaping data into longer format for ggplot +data_long <- tidyr::gather(data, Model, Predicted, -gXtest) + +# Plotting +ggplot(data_long, aes(x = gXtest, y = Predicted, color = Model)) + + geom_point(aes(shape = Model)) + + geom_abline(slope = 1, intercept = 0, linetype = "dashed", color = "black") + # gX by gX + scale_color_manual(values = c("brown", "yellow", "red", "green", "blue", "magenta"), + guide = guide_legend(title = "Model")) + + theme_minimal() + + labs( + title = "Comparison of Methods on Predicting gX", + x = "gXtest", + y = "Predictions" + ) + + guides(shape = "none") # Remove the shape legend +``` + diff --git a/PM2/r_ml_for_wage_prediction.Rmd b/PM2/r_ml_for_wage_prediction.Rmd new file mode 100644 index 00000000..9856acde --- /dev/null +++ b/PM2/r_ml_for_wage_prediction.Rmd @@ -0,0 +1,447 @@ +--- +title: An R Markdown document converted from "PM2/r_ml_for_wage_prediction.irnb" +output: html_document +--- + +# A Simple Case Study using Wage Data from 2015 + +We illustrate how to predict an outcome variable $Y$ in a high-dimensional setting, where the number of covariates $p$ is large in relation to the sample size $n$. We use linear prediction rules for estimation, including OLS and the penalized linear methods we've studied. Later, we will also consider nonlinear prediction rules including tree-based methods and neural nets. + +```{r} +install.packages("xtable") +install.packages("hdm") +install.packages("glmnet") +install.packages("MLmetrics") +``` + +```{r} +library(hdm) +library(xtable) +library(glmnet) +library(MLmetrics) +``` + +## Data + +Again, we consider data from the U.S. March Supplement of the Current Population Survey (CPS) in 2015. +The preproccessed sample consists of $5150$ never-married individuals. + +```{r} +file <- "https://raw.githubusercontent.com/CausalAIBook/MetricsMLNotebooks/main/data/wage2015_subsample_inference.csv" +data <- read.csv(file) +dim(data) +``` + +The outcomes $Y_i$'s are hourly (log) wages of never-married workers living in the U.S. The raw regressors $Z_i$'s consist of a variety of characteristics, including experience, education and industry and occupation indicators. + +```{r} +Z <- subset(data, select = -c(lwage, wage)) # regressors +colnames(Z) +``` + +The following figure shows the weekly wage distribution from the US survey data. + +```{r} +hist(data$wage, xlab = "hourly wage", main = "Empirical wage distribution from the US survey data", breaks = 35) +``` + +Wages show a high degree of skewness. Hence, wages are transformed in almost all studies by +the logarithm. + +## Analysis + +Due to the skewness of the data, we are considering log wages which leads to the following regression model + +$$log(wage) = g(Z) + \epsilon.$$ + +In this notebook, we will evaluate *linear* prediction rules. In later notebooks, we will also utilize nonlinear prediction methods. In linear models, we estimate the prediction rule of the form + +$$\hat g(Z) = \hat \beta'X.$$ + +Again, we generate $X$ in three ways: + +1. Basic Model: $X$ consists of a set of raw regressors (e.g. gender, experience, education indicators, regional indicators). + + +2. Flexible Model: $X$ consists of all raw regressors from the basic model plus occupation and industry indicators, transformations (e.g., $\operatorname{exp}^2$ and $\operatorname{exp}^3$) and additional two-way interactions. + +3. Extra Flexible Model: $X$ takes the flexible model and takes all pairwise interactions. + +To evaluate the out-of-sample performance, we split the data first. + +```{r} +set.seed(1234) +training <- sample(nrow(data), nrow(data) * (3 / 4), replace = FALSE) + +data_train <- data[training, ] +data_test <- data[-training, ] +``` + +```{r} +y_train <- data_train$lwage +y_test <- data_test$lwage +``` + +We are starting by running a simple OLS regression. We fit the basic and flexible model to our training data by running an ols regression and compute the R-squared on the test sample + +As known from our first lab, the basic model consists of $51$ regressors and the flexible model of $246$ regressors. Let us fit our models to the training sample using the two different model specifications. We are starting by running a simple ols regression and computing the mean squared error and $R^2$ on the test sample. + +### Low dimensional specification (basic) + +```{r} +x_basic <- "sex + exp1 + shs + hsg+ scl + clg + mw + so + we + C(occ2)+ C(ind2)" +formula_basic <- as.formula(paste("lwage", "~", x_basic)) +model_x_basic_train <- model.matrix(formula_basic, data_train) +model_x_basic_test <- model.matrix(formula_basic, data_test) +p_basic <- dim(model_x_basic_train)[2] +p_basic +``` + +```{r} +# ols (basic model) +fit_lm_basic <- lm(formula_basic, data_train) +# Compute the Out-Of-Sample Performance +yhat_lm_basic <- predict(fit_lm_basic, newdata = data_test) +cat("Basic model MSE (OLS): ", mean((y_test - yhat_lm_basic)^2)) # MSE OLS (basic model) +``` + +To determine the out-of-sample $MSE$ and the standard error in one step, we can use the function *lm*: + +```{r} +mse_lm_basic <- summary(lm((y_test - yhat_lm_basic)^2 ~ 1))$coef[1:2] +mse_lm_basic +``` + +We also compute the out-of-sample $R^2$: + +```{r} +r2_lm_basic <- 1 - mse_lm_basic[1] / var(y_test) +cat("Basic model R^2 (OLS): ", r2_lm_basic) # MSE OLS (basic model) +``` + +### High-dimensional specification (flexible) + +```{r} +x_flex <- paste("sex + exp1 + shs + hsg + scl + clg + C(occ2) + C(ind2) + mw + so + we ", + "+ (exp1 + exp2 + exp3 + exp4) * (shs + hsg + scl + clg + C(occ2) + C(ind2) + mw + so + we)") +formula_flex <- as.formula(paste("lwage", "~", x_flex)) +model_x_flex_train <- model.matrix(formula_flex, data_train) +model_x_flex_test <- model.matrix(formula_flex, data_test) +p_flex <- dim(model_x_flex_train)[2] +p_flex +``` + +We repeat the same procedure for the flexible model. + +```{r} +# ols (flexible model) +fit_lm_flex <- lm(formula_flex, data_train) +# Compute the Out-Of-Sample Performance +options(warn = -1) +yhat_lm_flex <- predict(fit_lm_flex, newdata = data_test) +mse_lm_flex <- summary(lm((y_test - yhat_lm_flex)^2 ~ 1))$coef[1:2] +r2_lm_flex <- 1 - mse_lm_flex[1] / var(y_test) +cat("Flexible model R^2 (OLS): ", r2_lm_flex) # MSE OLS (flexible model) +``` + +### Penalized regressions (flexible model) + + + + +We observe that ols regression works better for the basic model with smaller $p/n$ ratio. We proceed by running penalized regressions first for the flexible model, tuned via cross-validation. + +```{r} +fit_lasso_cv_flex <- cv.glmnet(model_x_flex_train, y_train, family = "gaussian", alpha = 1) +fit_ridge_flex <- cv.glmnet(model_x_flex_train, y_train, family = "gaussian", alpha = 0) +fit_elnet_flex <- cv.glmnet(model_x_flex_train, y_train, family = "gaussian", alpha = .5) + +yhat_lasso_cv_flex <- predict(fit_lasso_cv_flex, newx = model_x_flex_test) +yhat_ridge_flex <- predict(fit_ridge_flex, newx = model_x_flex_test) +yhat_elnet_flex <- predict(fit_elnet_flex, newx = model_x_flex_test) + +mse_lasso_cv_flex <- summary(lm((y_test - yhat_lasso_cv_flex)^2 ~ 1))$coef[1:2] +mse_ridge_flex <- summary(lm((y_test - yhat_ridge_flex)^2 ~ 1))$coef[1:2] +mse_elnet_flex <- summary(lm((y_test - yhat_elnet_flex)^2 ~ 1))$coef[1:2] + +r2_lasso_cv_flex <- 1 - mse_lasso_cv_flex[1] / var(y_test) +r2_ridge_flex <- 1 - mse_ridge_flex[1] / var(y_test) +r2_elnet_flex <- 1 - mse_elnet_flex[1] / var(y_test) + +# R^2 using cross-validation (flexible model) +cat("Flexible model R^2 (Lasso): ", r2_lasso_cv_flex) +cat("\nFlexible model R^2 (Ridge): ", r2_ridge_flex) +cat("\nFlexible model R^2 (Elastic Net): ", r2_elnet_flex) +``` + +We can also try a variant of the `l1` penalty, where the weight is chosen based on theoretical derivations. We use package *hdm* and the function *rlasso*, relying on a theoretical based choice of the penalty level $\lambda$ in the lasso regression. + +Specifically, we use "plug-in" tuning with a theoretically valid choice of penalty $\lambda = 2 \cdot c \hat{\sigma} \sqrt{n} \Phi^{-1}(1-\alpha/2p)$, where $c>1$ and $1-\alpha$ is a confidence level, $\Phi^{-1}$ denotes the quantile function, and $\hat{\sigma}$ is estimated in an iterative manner (see corresponding notes in book). Under homoskedasticity, this choice ensures that the Lasso predictor is well behaved, delivering good predictive performance under approximate sparsity. In practice, this formula will work well even in the absence of homoskedasticity, especially when the random variables $\epsilon$ and $X$ in the regression equation decay quickly at the tails. + +In practice, many people choose to use cross-validation, which is perfectly fine for predictive tasks. However, when conducting inference, to make our analysis valid we will require cross-fitting in addition to cross-validation. As we have not yet discussed cross-fitting, we rely on this theoretically-driven penalty in order to allow for accurate inference in the upcoming notebooks. + +Now, we repeat the same procedure for the flexible model. + +```{r} +fit_rlasso_flex <- hdm::rlasso(formula_flex, data_train, post = FALSE) +fit_rlasso_post_flex <- hdm::rlasso(formula_flex, data_train, post = TRUE) +yhat_rlasso_flex <- predict(fit_rlasso_flex, newdata = data_test) +yhat_rlasso_post_flex <- predict(fit_rlasso_post_flex, newdata = data_test) + +mse_lasso_flex <- summary(lm((y_test - yhat_rlasso_flex)^2 ~ 1))$coef[1:2] +mse_lasso_post_flex <- summary(lm((y_test - yhat_rlasso_post_flex)^2 ~ 1))$coef[1:2] + +r2_lasso_flex <- 1 - mse_lasso_flex[1] / var(y_test) +r2_lasso_post_flex <- 1 - mse_lasso_post_flex[1] / var(y_test) +# R^2 theoretically chosen penalty (flexible model) +cat("Flexible model R^2 (RLasso): ", r2_lasso_flex) +cat("\nFlexible model R^2 (RLasso post): ", r2_lasso_post_flex) +``` + +Finally, we try the combination of a sparse and a dense coefficient using the LAVA method + +```{r} +# Define function to compute lava estimator. Doing an iterative scheme with fixed +# number of iteration. Could iterate until a convergence criterion is met. +lava_predict <- function(X, Y, newX, lambda1, lambda2, iter = 5) { + + # Need to demean internally + dy <- Y - mean(Y) + dx <- scale(X, scale = FALSE) + + sp1 <- glmnet::glmnet(dx, dy, lambda = lambda1) # lasso step fits "sparse part" + de1 <- glmnet::glmnet(dx, dy - predict(sp1, newx = dx), alpha = 0, lambda = lambda2) + + i <- 1 + while (i <= iter) { + sp1 <- glmnet::glmnet(dx, dy - predict(de1, newx = dx, s = "lambda.min"), lambda = lambda1) + de1 <- glmnet::glmnet(dx, dy - predict(sp1, newx = dx, s = "lambda.min"), alpha = 0, lambda = lambda2) + i <- i + 1 + } + + bhat <- sp1$beta + de1$beta + a0 <- mean(Y) - sum(colMeans(X) * bhat) + + # Need to add intercept to output + + yhat <- newX %*% bhat + a0 + + return(yhat) +} +``` + +```{r} +# define function to get predictions and r2 scores for lava estimator +lava_yhat_r2 <- function(xtr_mod, xte_mod, ytr, yte, num_folds = 5) { + # 5-fold CV. glmnet does cross-validation internally and + # relatively efficiently. We're going to write out all the steps to make sure + # we're using the same CV folds across all procedures in a transparent way and + # to keep the overall structure clear as well. + + # Setup for brute force K-Fold CV + n <- length(ytr) + Kf <- num_folds # Number of folds + sampleframe <- rep(1:Kf, ceiling(n / Kf)) + cvgroup <- sample(sampleframe, size = n, replace = FALSE) # CV groups + + + ## ------------------------------------------------------------ + # We're going to take a shortcut and use the range of lambda values that come out + # of the default implementation in glmnet for everything. Could do better here - maybe + + ## Fit ridge on grid of lambda values (chosen by default using glmnet) using basic model. + ridge_mod <- glmnet::glmnet(xtr_mod, ytr, alpha = 0) # alpha = 0 gives ridge + ridge_lambda <- ridge_mod$lambda # values of penalty parameter + + ## Fit lasso on grid of lambda values (chosen by default using glmnet) using basic model. + lasso_mod <- glmnet::glmnet(xtr_mod, ytr) # default is lasso (equivalent to alpha = 1) + lasso_lambda <- lasso_mod$lambda # values of penalty parameter + + ## ------------------------------------------------------------ + + + # Lava - Using a double loop over candidate penalty parameter values. + + lambda1_lava_mod <- lasso_mod$lambda[seq(5, length(lasso_lambda), 10)] + lambda2_lava_mod <- ridge_mod$lambda[seq(5, length(ridge_lambda), 10)] + + cv_mod_lava <- matrix(0, length(lambda1_lava_mod), length(lambda2_lava_mod)) + + for (k in 1:Kf) { + indk <- cvgroup == k + + k_xtr_mod <- xtr_mod[!indk, ] + k_ytr <- ytr[!indk] + k_xte_mod <- xtr_mod[indk, ] + k_yte <- ytr[indk] + + for (ii in seq_along(lambda1_lava_mod)) { + for (jj in seq_along(lambda2_lava_mod)) { + cv_mod_lava[ii, jj] <- cv_mod_lava[ii, jj] + + sum((k_yte - lava_predict(k_xtr_mod, k_ytr, + newX = k_xte_mod, + lambda1 = lambda1_lava_mod[ii], + lambda2 = lambda2_lava_mod[jj]))^2) + } + } + } + + # Get CV min values of tuning parameters + cvmin_lava_mod <- which(cv_mod_lava == min(cv_mod_lava), arr.ind = TRUE) + cvlambda1_lava_mod <- lambda1_lava_mod[cvmin_lava_mod[1]] + cvlambda2_lava_mod <- lambda2_lava_mod[cvmin_lava_mod[2]] + + #### Look at performance on test sample + + # Calculate R^2 in training data and in validation data as measures + # Refit on entire training sample + + #### CV-min model + + # In sample fit + cvmin_yhat_lava_tr <- lava_predict(xtr_mod, ytr, + newX = xtr_mod, + lambda1 = cvlambda1_lava_mod, + lambda2 = cvlambda2_lava_mod + ) + r2_lava_mod <- 1 - sum((ytr - cvmin_yhat_lava_tr)^2) / sum((ytr - mean(ytr))^2) + + # Out of sample fit + cvmin_yhat_lava_test <- lava_predict(xtr_mod, ytr, + newX = xte_mod, + lambda1 = cvlambda1_lava_mod, + lambda2 = cvlambda2_lava_mod + ) + r2v_lava_mod <- 1 - sum((yte - cvmin_yhat_lava_test)^2) / sum((yte - mean(ytr))^2) + + #### Use average model across cv-folds and refit model using all training data + ###### we won't report these results. + ###### Averaging is theoretically more solid, but cv-min is more practical. + n_tr <- length(ytr) + n_te <- length(yte) + yhat_tr_lava_mod <- matrix(0, n_tr, Kf) + yhat_te_lava_mod <- matrix(0, n_te, Kf) + + + for (k in 1:Kf) { + indk <- cvgroup == k + + k_xtr_mod <- xtr_mod[!indk, ] + k_ytr <- ytr[!indk] + + # Lava + yhat_tr_lava_mod[, k] <- as.vector(lava_predict(k_xtr_mod, k_ytr, + newX = xtr_mod, + lambda1 = cvlambda1_lava_mod, + lambda2 = cvlambda2_lava_mod + )) + yhat_te_lava_mod[, k] <- as.vector(lava_predict(k_xtr_mod, k_ytr, + newX = xte_mod, + lambda1 = cvlambda1_lava_mod, + lambda2 = cvlambda2_lava_mod + )) + } + + avg_yhat_lava_tr <- rowMeans(yhat_tr_lava_mod) + avg_yhat_lava_test <- rowMeans(yhat_te_lava_mod) + + r2_cv_ave_lava_mod <- 1 - sum((ytr - avg_yhat_lava_tr)^2) / sum((ytr - mean(ytr))^2) + r2v_cv_ave_lava_mod <- 1 - sum((yte - avg_yhat_lava_test)^2) / sum((yte - mean(ytr))^2) + + return(c( + cvlambda1_lava_mod, + cvlambda2_lava_mod, + cvmin_yhat_lava_tr, # CV_min + cvmin_yhat_lava_test, # CV_min + r2_lava_mod, # CV_min + r2v_lava_mod, # CV_min + avg_yhat_lava_tr, # Average across Folds + avg_yhat_lava_test, # Average across Folds + r2_cv_ave_lava_mod, # Average across Folds + r2v_cv_ave_lava_mod # Average across Folds + )) +} +``` + +```{r} +fit_lava_flex <- lava_yhat_r2(model_x_flex_train, model_x_flex_test, y_train, y_test) +cat("Flexible model R^2 (LAVA): ", fit_lava_flex[[6]]) # using CV_min +``` + + + +We find that for this dataset the low dimensional OLS is sufficient. The high-dimensional approaches did not manage to substantively increase predictive power. + +### Extra high-dimensional specification (extra flexible) + +We repeat the same procedure for the extra flexible model. + + + + + + +```{r} +x_extra <- " sex + (exp1 + exp2 + exp3 + exp4 + shs + hsg + scl + clg + C(occ2) + C(ind2) + mw + so + we)^2" +formula_extra <- as.formula(paste("lwage", "~", x_extra)) +model_x_extra_train <- model.matrix(formula_extra, data_train) +model_x_extra_test <- model.matrix(formula_extra, data_test) +p_extra <- dim(model_x_extra_train)[2] +p_extra +``` + +```{r} +# ols (extra flexible model) +fit_lm_extra <- lm(formula_extra, data_train) +options(warn = -1) +yhat_lm_extra <- predict(fit_lm_extra, newdata = data_test) +mse_lm_extra <- summary(lm((y_test - yhat_lm_extra)^2 ~ 1))$coef[1:2] +r2_lm_extra <- 1 - mse_lm_extra[1] / var(y_test) +cat("Extra flexible model R^2 (OLS): ", r2_lm_extra) +``` + +#### Penalized regressions (extra flexible model) + +Now let's repeat our penalized regression analysis for the extra flexible model. Note this block takes a while ~ 1 hour 15 minutes. To reduce time substantially, reduce the number of folds in LAVA. + +```{r} +# penalized regressions +fit_lasso_cv_extra <- cv.glmnet(model_x_extra_train, y_train, family = "gaussian", alpha = 1) +fit_ridge_extra <- cv.glmnet(model_x_extra_train, y_train, family = "gaussian", alpha = 0) +fit_elnet_extra <- cv.glmnet(model_x_extra_train, y_train, family = "gaussian", alpha = .5) +fit_rlasso_extra <- hdm::rlasso(formula_extra, data_train, post = FALSE) +fit_rlasso_post_extra <- hdm::rlasso(formula_extra, data_train, post = TRUE) +fit_lava_extra <- lava_yhat_r2(model_x_extra_train, model_x_extra_test, y_train, y_test) + +yhat_lasso_cv_extra <- predict(fit_lasso_cv_extra, newx = model_x_extra_test) +yhat_ridge_extra <- predict(fit_ridge_extra, newx = model_x_extra_test) +yhat_elnet_extra <- predict(fit_elnet_extra, newx = model_x_extra_test) +yhat_rlasso_extra <- predict(fit_rlasso_extra, newdata = data_test) +yhat_rlasso_post_extra <- predict(fit_rlasso_post_extra, newdata = data_test) +yhat_lava_extra <- fit_lava_extra[[4]] + +mse_lasso_cv_extra <- summary(lm((y_test - yhat_lasso_cv_extra)^2 ~ 1))$coef[1:2] +mse_ridge_extra <- summary(lm((y_test - yhat_ridge_extra)^2 ~ 1))$coef[1:2] +mse_elnet_extra <- summary(lm((y_test - yhat_elnet_extra)^2 ~ 1))$coef[1:2] +mse_lasso_extra <- summary(lm((y_test - yhat_rlasso_extra)^2 ~ 1))$coef[1:2] +mse_lasso_post_extra <- summary(lm((y_test - yhat_rlasso_post_extra)^2 ~ 1))$coef[1:2] +mse_lava_extra <- summary(lm(as.vector(y_test - yhat_lava_extra)^2 ~ 1))$coef[1:2] + +r2_lasso_cv_extra <- 1 - mse_lasso_cv_extra[1] / var(y_test) +r2_ridge_extra <- 1 - mse_ridge_extra[1] / var(y_test) +r2_elnet_extra <- 1 - mse_elnet_extra[1] / var(y_test) +r2_lasso_extra <- 1 - mse_lasso_extra[1] / var(y_test) +r2_lasso_post_extra <- 1 - mse_lasso_post_extra[1] / var(y_test) +r2_lava_extra <- 1 - mse_lava_extra[1] / var(y_test) + +# R^2 (extra flexible) +cat("\nExtra flexible model R^2 (Lasso): ", r2_lasso_cv_extra) +cat("\nExtra flexible model R^2 (Ridge): ", r2_ridge_extra) +cat("\nExtra flexible model R^2 (Elastic Net): ", r2_elnet_extra) +cat("\nExtra flexible model R^2 (RLasso): ", r2_lasso_extra) +cat("\nExtra flexible model R^2 (RLasso post): ", r2_lasso_post_extra) +cat("\nExtra flexible model R^2 (LAVA): ", r2_lava_extra) # using CV_min +``` + + + diff --git a/PM2/r_orthogonal_orig.Rmd b/PM2/r_orthogonal_orig.Rmd new file mode 100644 index 00000000..a49aef69 --- /dev/null +++ b/PM2/r_orthogonal_orig.Rmd @@ -0,0 +1,116 @@ +--- +title: An R Markdown document converted from "PM2/r_orthogonal_orig.irnb" +output: html_document +--- + +# Simulation on Orthogonal Estimation + +We compare the performance of the naive and orthogonal methods in a computational experiment where +$p=n=100$, $\beta_j = 1/j^2$, $(\gamma_{DW})_j = 1/j^2$ and $$Y = 1 \cdot D + \beta' W + \epsilon_Y$$ + +where $W \sim N(0,I)$, $\epsilon_Y \sim N(0,1)$, and $$D = \gamma'_{DW} W + \tilde{D}$$ where $\tilde{D} \sim N(0,1)/4$. + +The true treatment effect here is 1. From the plots produced in this notebook (estimate minus ground truth), we show that the naive single-selection estimator is heavily biased (lack of Neyman orthogonality in its estimation strategy), while the orthogonal estimator based on partialling out, is approximately unbiased and Gaussian. + +```{r} +install.packages("hdm") +install.packages("ggplot2") +``` + +```{r} +library(hdm) +library(ggplot2) +``` + +```{r} +# Initialize constants +B <- 10000 # Number of iterations +n <- 100 # Sample size +p <- 100 # Number of features + +# Initialize arrays to store results +Naive <- rep(0, B) +Orthogonal <- rep(0, B) + + +lambdaYs <- rep(0, B) +lambdaDs <- rep(0, B) + +for (i in 1:B) { + # Generate parameters + beta <- 1 / (1:p)^2 + gamma <- 1 / (1:p)^2 + + # Generate covariates / random data + X <- matrix(rnorm(n * p), n, p) + D <- X %*% gamma + rnorm(n) / 4 + + # Generate Y using DGP + Y <- D + X %*% beta + rnorm(n) + + # Single selection method + rlasso_result <- hdm::rlasso(Y ~ D + X) # Fit lasso regression + sx_ids <- which(rlasso_result$coef[-c(1, 2)] != 0) # Selected covariates + + # Check if any Xs are selected + if (sum(sx_ids) == 0) { + Naive[i] <- lm(Y ~ D)$coef[2] # Fit linear regression with only D if no Xs are selected + } else { + Naive[i] <- lm(Y ~ D + X[, sx_ids])$coef[2] # Fit linear regression with selected X otherwise + } + + # Partialling out / Double Lasso + + fitY <- hdm::rlasso(Y ~ X, post = TRUE) + resY <- fitY$res + + fitD <- hdm::rlasso(D ~ X, post = TRUE) + resD <- fitD$res + + Orthogonal[i] <- lm(resY ~ resD)$coef[2] # Fit linear regression for residuals +} +``` + +## Make a Nice Plot + +```{r} +# Specify ratio +img_width <- 15 +img_height <- img_width / 2 +``` + +```{r} +# Create a data frame for the estimates +df <- data.frame(Method = rep(c("Naive", "Orthogonal"), each = B), + Value = c(Naive - 1, Orthogonal - 1)) + +# Create the histogram using ggplot2 +hist_plot <- ggplot(df, aes(x = Value, fill = Method)) + + geom_histogram(binwidth = 0.1, color = "black", alpha = 0.7) + + facet_wrap(~Method, scales = "fixed") + + labs( + title = "Distribution of Estimates (Centered around Ground Truth)", + x = "Bias", + y = "Frequency" + ) + + scale_x_continuous(breaks = seq(-2, 1.5, 0.5)) + + theme_minimal() + + theme( + plot.title = element_text(hjust = 0.5), # Center the plot title + strip.text = element_text(size = 10), # Increase text size in facet labels + legend.position = "none", # Remove the legend + panel.grid.major = element_blank(), # Make major grid lines invisible + # panel.grid.minor = element_blank(), # Make minor grid lines invisible + strip.background = element_blank() # Make the strip background transparent + ) + + theme(panel.spacing = unit(2, "lines")) # Adjust the ratio to separate subplots wider + +# Set a wider plot size +options(repr.plot.width = img_width, repr.plot.height = img_height) + +# Display the histogram +print(hist_plot) +``` + +As we can see from the above bias plots (estimates minus the ground truth effect of 1), the double lasso procedure concentrates around zero whereas the naive estimator does not. + diff --git a/PM3/r_functional_approximation_by_nn_and_rf.Rmd b/PM3/r_functional_approximation_by_nn_and_rf.Rmd new file mode 100644 index 00000000..aca4a796 --- /dev/null +++ b/PM3/r_functional_approximation_by_nn_and_rf.Rmd @@ -0,0 +1,193 @@ +--- +title: An R Markdown document converted from "PM3/r_functional_approximation_by_nn_and_rf.irnb" +output: html_document +--- + +# Functional Approximations by Trees and Neural Networks + +Here we show how the function +$$ +x \mapsto exp(4 x) +$$ +can be easily approximated by a tree-based methods (Trees, Random Forest) and a neural network (2 Layered Neural Network) + +```{r} +install.packages("randomForest") +install.packages("rpart") +install.packages("gbm") +install.packages("keras") +``` + +```{r} +library(randomForest) +library(rpart) +library(gbm) +library(keras) +``` + +# Functional Approximation by a Tree + +We play around with the penalty level $cp$ below to illustrate how it affects the complexity of tree. Recall we may use this to prune the tree to improve predictive performance and lessen the noise in our final estimate. A simple penalty would be the number of leaves times a penalty level $\alpha$. + +Specifics on the penalty can be found [here](https://cran.r-project.org/web/packages/rpart/rpart.pdf). + +```{r} +set.seed(1) +x_train <- matrix(runif(1000), 1000, 1) +y_train <- exp(4 * x_train) # Noiseless case Y=g(X) +dim(x_train) + + +# shallow tree +TreeModel <- rpart(y_train ~ x_train, cp = .01) # cp is penalty level +pred_tm <- predict(TreeModel, newx = x_train) +plot(x_train, y_train, type = "p", pch = 19, xlab = "z", ylab = "g(z)") +points(x_train, pred_tm, col = 3, pch = 19) +``` + +```{r} +set.seed(1) +x_train <- matrix(runif(1000), 1000, 1) +y_train <- exp(4 * x_train) # Noiseless case Y=g(X) +dim(x_train) + + +TreeModel <- rpart(y_train ~ x_train, cp = .0005) # cp is penalty level +pred_tm <- predict(TreeModel, newx = x_train) +plot(x_train, y_train, type = "p", pch = 19, xlab = "z", ylab = "g(z)") +points(x_train, pred_tm, col = 3, pch = 19) +``` + +# Functional Approximation by RF + +Here we show how the function +$$ +x \mapsto exp(4 x) +$$ +can be easily approximated by a tree-based method (Random Forest) and a neural network (2 Layered Neural Network) + +```{r} +RFmodel <- randomForest(y_train ~ x_train) +pred_rf <- predict(RFmodel, newdata = x_train) +plot(x_train, y_train, type = "p", pch = 19, xlab = "z", ylab = "g(z)") +points(x_train, pred_rf, col = 4, pch = 19) +``` + +# Boosted Trees + +```{r} +data_train <- as.data.frame(cbind(x_train, y_train)) +BoostTreemodel <- gbm(y_train ~ x_train, + distribution = "gaussian", n.trees = 100, shrinkage = .01, + interaction.depth = 3 +) + +# shrinkage is "learning rate" +# n.trees is the number of boosting steps +# interaction.depth is the max depth of each tree +pred_bt <- predict(BoostTreemodel, newdata = data_train, n.trees = 100) +plot(x_train, y_train, type = "p", pch = 19, xlab = "z", ylab = "g(z)") +points(x_train, pred_bt, col = 4, pch = 19) +``` + +```{r} +data_train <- as.data.frame(cbind(x_train, y_train)) +BoostTreemodel <- gbm(y_train ~ x_train, + distribution = "gaussian", n.trees = 1000, shrinkage = .01, + interaction.depth = 3 +) +# shrinkage is "learning rate" +# n.trees is the number of boosting steps +# interaction.depth is the max depth of each tree +pred_bt <- predict(BoostTreemodel, newdata = data_train, n.trees = 1000) +plot(x_train, y_train, type = "p", pch = 19, xlab = "z", ylab = "g(z)") +points(x_train, pred_bt, col = 4, pch = 19) +``` + +# Same Example with a Neural Network + +```{r} +build_model <- function() { + + model <- keras_model_sequential() %>% + layer_dense( + units = 200, activation = "relu", + input_shape = 1 + ) %>% + layer_dense(units = 20, activation = "relu") %>% + layer_dense(units = 1) + + model %>% compile( + optimizer = optimizer_adam(lr = 0.01), + loss = "mse", + metrics = c("mae"), + ) +} +``` + +```{r} +model <- build_model() +summary(model) +``` + +```{r} +num_epochs <- 1 +model %>% fit(x_train, y_train, + epochs = num_epochs, batch_size = 10, verbose = 0 +) +pred_nn <- model %>% predict(x_train) +plot(x_train, y_train, type = "p", pch = 19, xlab = "z", ylab = "g(z)") +points(x_train, pred_nn, col = 4, pch = 19, ) +``` + +```{r} +num_epochs <- 100 +model %>% fit(x_train, y_train, + epochs = num_epochs, batch_size = 10, verbose = 0 +) +pred_nn <- model %>% predict(x_train) +plot(x_train, y_train, type = "p", pch = 19, xlab = "z", ylab = "g(z)") +points(x_train, pred_nn, col = 4, pch = 19, ) +``` + +### Using Early Stopping + +```{r} +# Define the neural network architecture +model <- keras_model_sequential() %>% + layer_dense(units = 200, activation = "relu", input_shape = 1) %>% + layer_dense(units = 20, activation = "relu") %>% + layer_dense(units = 1) # Output layer with 1 unit for regression task + +# Compile the model +model %>% compile( + optimizer = optimizer_adam(lr = 0.01), + loss = "mse", + metrics = c("mae"), +) + +summary(model) +``` + +```{r} +num_epochs <- 100 + +# Define early stopping based on validation set (20%) performance +# Patience set to 5 epochs (default in skorch is 5) +early_stopping <- callback_early_stopping(monitor = "val_loss", patience = 5) + +# Train the model +model %>% fit( + x_train, y_train, + epochs = num_epochs, + batch_size = 10, + validation_split = 0.2, # 20% validation set + verbose = 0, + callbacks = list(early_stopping) +) + +pred_nn <- model %>% predict(x_train) +plot(x_train, y_train, type = "p", pch = 19, xlab = "z", ylab = "g(z)") +points(x_train, pred_nn, col = 4, pch = 19) +``` + diff --git a/PM3/r_ml_wage_prediction.Rmd b/PM3/r_ml_wage_prediction.Rmd new file mode 100644 index 00000000..c5cd0bf7 --- /dev/null +++ b/PM3/r_ml_wage_prediction.Rmd @@ -0,0 +1,594 @@ +--- +title: An R Markdown document converted from "PM3/r_ml_wage_prediction.irnb" +output: html_document +--- + +# Machine Learning Estimators for Wage Prediction + +We illustrate how to predict an outcome variable Y in a high-dimensional setting, where the number of covariates $p$ is large in relation to the sample size $n$. So far we have used linear prediction rules, e.g. Lasso regression, for estimation. +Now, we also consider nonlinear prediction rules including tree-based methods. + +```{r} +# Import relevant packages +install.packages("xtable") +install.packages("hdm") +install.packages("glmnet") +install.packages("randomForest") +install.packages("rpart") +install.packages("nnet") +install.packages("gbm") +install.packages("rpart.plot") +install.packages("keras") +``` + +```{r} +library(hdm) +library(xtable) +library(glmnet) +library(randomForest) +library(rpart) +library(nnet) +library(gbm) +library(rpart.plot) +library(keras) +``` + +## Data + +Again, we consider data from the U.S. March Supplement of the Current Population Survey (CPS) in 2015. +The preproccessed sample consists of $5150$ never-married individuals. + +```{r} +file <- "https://raw.githubusercontent.com/CausalAIBook/MetricsMLNotebooks/main/data/wage2015_subsample_inference.csv" +data <- read.csv(file) +dim(data) +``` + +The outcomes $Y_i$'s are hourly (log) wages of never-married workers living in the U.S. The raw regressors $Z_i$'s consist of a variety of characteristics, including experience, education and industry and occupation indicators. + +```{r} +Z <- subset(data, select = -c(lwage, wage)) # regressors +colnames(Z) +``` + +The following figure shows the weekly wage distribution from the US survey data. + +```{r} +hist(data$wage, xlab = "hourly wage", main = "Empirical wage distribution from the US survey data", breaks = 35) +``` + +Wages show a high degree of skewness. Hence, wages are transformed in almost all studies by +the logarithm. + +## Analysis + +Due to the skewness of the data, we are considering log wages which leads to the following regression model + +$$log(wage) = g(Z) + \epsilon.$$ + +We will estimate the two sets of prediction rules: Linear and Nonlinear Models. +In linear models, we estimate the prediction rule of the form + +$$\hat g(Z) = \hat \beta'X.$$ +Again, we generate $X$ in two ways: + +1. Basic Model: $X$ consists of a set of raw regressors (e.g. gender, experience, education indicators, regional indicators). + + +2. Flexible Model: $X$ consists of all raw regressors from the basic model plus occupation and industry indicators, transformations (e.g., ${exp}^2$ and ${exp}^3$) and additional two-way interactions. + +To evaluate the out-of-sample performance, we split the data first. + +```{r} +set.seed(1234) +training <- sample(nrow(data), nrow(data) * (3 / 4), replace = FALSE) + +data_train <- data[training, ] +data_test <- data[-training, ] +``` + +We construct the two different model matrices $X_{basic}$ and $X_{flex}$ for both the training and the test sample: + +```{r} +x_basic <- "sex + exp1 + shs + hsg+ scl + clg + mw + so + we + C(occ2)+ C(ind2)" +x_flex <- paste("sex + exp1 + shs + hsg + scl + clg + C(occ2) + C(ind2) + mw + so + we ", + "+ (exp1 + exp2 + exp3 + exp4) * (shs + hsg + scl + clg + C(occ2) + C(ind2) + mw + so + we)") +formula_basic <- as.formula(paste("lwage", "~", x_basic)) +formula_flex <- as.formula(paste("lwage", "~", x_flex)) + +model_x_basic_train <- model.matrix(formula_basic, data_train) +model_x_basic_test <- model.matrix(formula_basic, data_test) +p_basic <- dim(model_x_basic_train)[2] +model_x_flex_train <- model.matrix(formula_flex, data_train) +model_x_flex_test <- model.matrix(formula_flex, data_test) +p_flex <- dim(model_x_flex_train)[2] +``` + +```{r} +y_train <- data_train$lwage +y_test <- data_test$lwage +``` + +```{r} +p_basic +p_flex +``` + +As known from our first lab, the basic model consists of $51$ regressors and the flexible model of $246$ regressors. Let us fit our models to the training sample using the two different model specifications. We are starting by running a simple ols regression. + +### OLS + +We fit the basic model to our training data by running an ols regression and compute the mean squared error on the test sample. + +```{r} +# ols (basic model) +fit_lm_basic <- lm(formula_basic, data_train) +``` + +```{r} +# Compute the Out-Of-Sample Performance +yhat_lm_basic <- predict(fit_lm_basic, newdata = data_test) +# MSE OLS (basic model) +cat("The mean squared error (MSE) using the basic model is equal to", mean((y_test - yhat_lm_basic)^2)) +``` + +To determine the out-of-sample $MSE$ and the standard error in one step, we can use the function *lm*: + +```{r} +mse_lm_basic <- summary(lm((y_test - yhat_lm_basic)^2 ~ 1))$coef[1:2] +mse_lm_basic +``` + +We also compute the out-of-sample $R^2$: + +```{r} +r2_lm_basic <- 1 - mse_lm_basic[1] / var(y_test) +# MSE OLS (basic model) +cat("The R^2 using the basic model is equal to", r2_lm_basic) +``` + +We repeat the same procedure for the flexible model. + +```{r} +# ols (flexible model) +fit_lm_flex <- lm(formula_flex, data_train) +# Compute the Out-Of-Sample Performance +options(warn = -1) +yhat_lm_flex <- predict(fit_lm_flex, newdata = data_test) +mse_lm_flex <- summary(lm((y_test - yhat_lm_flex)^2 ~ 1))$coef[1:2] +r2_lm_flex <- 1 - mse_lm_flex[1] / var(y_test) +cat("The R^2 using the flexible model is equal to", r2_lm_flex) # MSE OLS (flexible model) +``` + +We observe that ols regression works better for the basic model with smaller $p/n$ ratio. We now proceed by running lasso regressions and related penalized methods. + +### Lasso, Ridge and Elastic Net + +Considering the basic model, we run a lasso/post-lasso regression first and then we compute the measures for the out-of-sample performance. Note that applying the package *hdm* and the function *rlasso* we rely on a theoretical based choice of the penalty level $\lambda$ in the lasso regression. + +```{r} +# lasso and variants +fit_rlasso <- hdm::rlasso(formula_basic, data_train, post = FALSE) +fit_rlasso_post <- hdm::rlasso(formula_basic, data_train, post = TRUE) +yhat_rlasso <- predict(fit_rlasso, newdata = data_test) +yhat_rlasso_post <- predict(fit_rlasso_post, newdata = data_test) + +mse_lasso <- summary(lm((y_test - yhat_rlasso)^2 ~ 1))$coef[1:2] +mse_lasso_post <- summary(lm((y_test - yhat_rlasso_post)^2 ~ 1))$coef[1:2] + +r2_lasso <- 1 - mse_lasso[1] / var(y_test) +r2_lasso_post <- 1 - mse_lasso_post[1] / var(y_test) +# R^2 lasso/post-lasso (basic model) +cat("The R^2 using the basic model is equal to", r2_lasso, "for lasso and", r2_lasso_post, "for post-lasso") +``` + +Now, we repeat the same procedure for the flexible model. + +```{r} +fit_rlasso_flex <- hdm::rlasso(formula_flex, data_train, post = FALSE) +fit_rlasso_post_flex <- hdm::rlasso(formula_flex, data_train, post = TRUE) +yhat_rlasso_flex <- predict(fit_rlasso_flex, newdata = data_test) +yhat_rlasso_post_flex <- predict(fit_rlasso_post_flex, newdata = data_test) + +mse_lasso_flex <- summary(lm((y_test - yhat_rlasso_flex)^2 ~ 1))$coef[1:2] +mse_lasso_post_flex <- summary(lm((y_test - yhat_rlasso_post_flex)^2 ~ 1))$coef[1:2] + +# R^2 lasso/post-lasso (flexible model) +r2_lasso_flex <- 1 - mse_lasso_flex[1] / var(y_test) +r2_lasso_post_flex <- 1 - mse_lasso_post_flex[1] / var(y_test) +cat("The R^2 using the flexible model is equal to", r2_lasso_flex, + "for lasso and", r2_lasso_post_flex, "for post-lasso") +``` + +In contrast to a theoretical based choice of the tuning parameter $\lambda$ in the lasso regression, we can also use cross-validation to determine the penalty level by applying the package *glmnet* and the function cv.glmnet. In this context, we also run a ridge and a elastic net regression by adjusting the parameter *alpha*. + +```{r} +fit_lasso_cv <- cv.glmnet(model_x_basic_train, y_train, family = "gaussian", alpha = 1) +fit_ridge <- cv.glmnet(model_x_basic_train, y_train, family = "gaussian", alpha = 0) +fit_elnet <- cv.glmnet(model_x_basic_train, y_train, family = "gaussian", alpha = .5) + +yhat_lasso_cv <- predict(fit_lasso_cv, newx = model_x_basic_test) +yhat_ridge <- predict(fit_ridge, newx = model_x_basic_test) +yhat_elnet <- predict(fit_elnet, newx = model_x_basic_test) + +mse_lasso_cv <- summary(lm((y_test - yhat_lasso_cv)^2 ~ 1))$coef[1:2] +mse_ridge <- summary(lm((y_test - yhat_ridge)^2 ~ 1))$coef[1:2] +mse_elnet <- summary(lm((y_test - yhat_elnet)^2 ~ 1))$coef[1:2] + +r2_lasso_cv <- 1 - mse_lasso_cv[1] / var(y_test) +r2_ridge <- 1 - mse_ridge[1] / var(y_test) +r2_elnet <- 1 - mse_elnet[1] / var(y_test) + +# R^2 using cross-validation (basic model) +cat("R^2 using cross-validation for lasso, ridge and elastic net in the basic model:", + r2_lasso_cv, r2_ridge, r2_elnet) +``` + +Note that the following calculations for the flexible model need some computation time. + +```{r} +fit_lasso_cv_flex <- cv.glmnet(model_x_flex_train, y_train, family = "gaussian", alpha = 1) +fit_ridge_flex <- cv.glmnet(model_x_flex_train, y_train, family = "gaussian", alpha = 0) +fit_elnet_flex <- cv.glmnet(model_x_flex_train, y_train, family = "gaussian", alpha = .5) + +yhat_lasso_cv_flex <- predict(fit_lasso_cv_flex, newx = model_x_flex_test) +yhat_ridge_flex <- predict(fit_ridge_flex, newx = model_x_flex_test) +yhat_elnet_flex <- predict(fit_elnet_flex, newx = model_x_flex_test) + +mse_lasso_cv_flex <- summary(lm((y_test - yhat_lasso_cv_flex)^2 ~ 1))$coef[1:2] +mse_ridge_flex <- summary(lm((y_test - yhat_ridge_flex)^2 ~ 1))$coef[1:2] +mse_elnet_flex <- summary(lm((y_test - yhat_elnet_flex)^2 ~ 1))$coef[1:2] + +r2_lasso_cv_flex <- 1 - mse_lasso_cv_flex[1] / var(y_test) +r2_ridge_flex <- 1 - mse_ridge_flex[1] / var(y_test) +r2_elnet_flex <- 1 - mse_elnet_flex[1] / var(y_test) + +# R^2 using cross-validation (flexible model) +cat("R^2 using cross-validation for lasso, ridge and elastic net in the flexible model:", + r2_lasso_cv_flex, r2_ridge_flex, r2_elnet_flex) +``` + +The performance of the lasso regression with cross-validated penalty is quite similar to the performance of lasso using a theoretical based choice of the tuning parameter. + +#Non-linear models + +Besides linear regression models, we consider nonlinear regression models to build a predictive model. We are applying regression trees, random forests, boosted trees and neural nets to estimate the regression function $g(X)$. + +## Regression Trees + +We fit a regression tree to the training data using the basic model. The variable *cp* controls the complexity of the regression tree, i.e. how deep we build the tree. + +```{r} +# tree +fit_trees <- rpart(formula_basic, data_train, minbucket = 5, cp = 0.001) +# plotting the tree +prp(fit_trees, leaf.round = 1, space = 2, yspace = 2, split.space = 2, shadow.col = "gray", trace = 1) +``` + +An important method to improve predictive performance is called "Pruning the Tree". This +means the process of cutting down the branches of a tree. We apply pruning to the complex tree above to reduce the depth. Initially, we determine the optimal complexity of the regression tree. + +```{r} +bestcp <- fit_trees$cptable[which.min(fit_trees$cptable[, "xerror"]), "CP"] +bestcp +``` + +Now, we can prune the tree and visualize the prediction rule. + +```{r} +fit_prunedtree <- prune(fit_trees, cp = bestcp) +prp(fit_prunedtree, leaf.round = 1, space = 3, yspace = 3, split.space = 7, + shadow.col = "gray", trace = 1, yesno = 1) +``` + +Finally, we calculate the mean-squared error and the $R^2$ on the test sample to evaluate the out-of-sample performance of the pruned tree. + +```{r} +yhat_pt <- predict(fit_prunedtree, newdata = data_test) +mse_pt <- summary(lm((y_test - yhat_pt)^2 ~ 1))$coef[1:2] +r2_pt <- 1 - mse_pt[1] / var(y_test) + +# R^2 of the pruned tree +cat("R^2 of the pruned tree:", r2_pt) +``` + +## Random Forest and Boosted Trees + +In the next step, we apply the more advanced tree-based methods random forest and boosted trees. + +```{r} +# random forest +fit_rf <- randomForest(model_x_basic_train, y_train, ntree = 2000, nodesize = 20, data = data_train) + +## Evaluating the method +yhat_rf <- predict(fit_rf, newdata = model_x_basic_test) # prediction + +mse_rf <- summary(lm((y_test - yhat_rf)^2 ~ 1))$coef[1:2] +r2_rf <- 1 - mse_rf[1] / var(y_test) +``` + +```{r} +# boosting +fit_boost <- gbm(formula_basic, data = data_train, distribution = "gaussian", bag.fraction = .5, + interaction.depth = 2, n.trees = 1000, shrinkage = .01) +best_boost <- gbm.perf(fit_boost, plot.it = FALSE) # cross-validation to determine when to stop + +## Evaluating the method +yhat_boost <- predict(fit_boost, newdata = data_test, n.trees = best_boost) + +mse_boost <- summary(lm((y_test - yhat_boost)^2 ~ 1))$coef[1:2] +r2_boost <- 1 - mse_boost[1] / var(y_test) +``` + +```{r} +# printing R^2 +cat("R^2 of the random forest and boosted trees:", r2_rf, r2_boost) +``` + +## NNets + +First, we need to determine the structure of our network. We are using the R package *keras* to build a simple sequential neural network with three dense layers -- 2 hidden and one output layer. + +```{r} +# Define the neural network architecture +model <- keras_model_sequential() %>% + layer_dense(units = 50, activation = "relu", input_shape = dim(model_x_basic_train)[2]) %>% + layer_dense(units = 50, activation = "relu") %>% + layer_dense(units = 1) # Output layer with 1 unit for regression task + +# Compile the model +model %>% compile( + optimizer = optimizer_adam(lr = 0.01), + loss = "mse", + metrics = c("mae"), +) + +summary(model) +``` + +```{r} +num_epochs <- 100 + +# Define early stopping based on validation set (20%) performance +# Patience set to 5 epochs (default in skorch is 5) +early_stopping <- callback_early_stopping(monitor = "val_loss", patience = 5) + +# Train the model +model %>% fit( + model_x_basic_train, y_train, + epochs = num_epochs, + batch_size = 10, + validation_split = 0.2, # 20% validation set + verbose = 0, + callbacks = list(early_stopping) +) +``` + +```{r} +# evaluating the performance +model %>% evaluate(model_x_basic_test, y_test, verbose = 0) +``` + +```{r} +# Calculating the performance measures +yhat_nn <- model %>% predict(model_x_basic_test) +mse_nn <- summary(lm((y_test - yhat_nn)^2 ~ 1))$coef[1:2] +r2_nn <- 1 - mse_nn[1] / var(y_test) +# printing R^2 +cat("R^2 of the neural network:", r2_nn) +``` + +To conclude, let us have a look at our results. + +## Results + +```{r} +table <- matrix(0, 16, 3) +table[1, 1:2] <- mse_lm_basic +table[2, 1:2] <- mse_lm_flex +table[3, 1:2] <- mse_lasso +table[4, 1:2] <- mse_lasso_post +table[5, 1:2] <- mse_lasso_flex +table[6, 1:2] <- mse_lasso_post_flex +table[7, 1:2] <- mse_lasso_cv +table[8, 1:2] <- mse_ridge +table[9, 1:2] <- mse_elnet +table[10, 1:2] <- mse_lasso_cv_flex +table[11, 1:2] <- mse_ridge_flex +table[12, 1:2] <- mse_elnet_flex +table[13, 1:2] <- mse_rf +table[14, 1:2] <- mse_boost +table[15, 1:2] <- mse_pt +table[16, 1:2] <- mse_nn + + +table[1, 3] <- r2_lm_basic +table[2, 3] <- r2_lm_flex +table[3, 3] <- r2_lasso +table[4, 3] <- r2_lasso_post +table[5, 3] <- r2_lasso_flex +table[6, 3] <- r2_lasso_post_flex +table[7, 3] <- r2_lasso_cv +table[8, 3] <- r2_ridge +table[9, 3] <- r2_elnet +table[10, 3] <- r2_lasso_cv_flex +table[11, 3] <- r2_ridge_flex +table[12, 3] <- r2_elnet_flex +table[13, 3] <- r2_rf +table[14, 3] <- r2_boost +table[15, 3] <- r2_pt +table[16, 3] <- r2_nn + + +colnames(table) <- c("MSE", "S.E. for MSE", "R-squared") +rownames(table) <- c( + "Least Squares (basic)", "Least Squares (flexible)", "Lasso", "Post-Lasso", + "Lasso (flexible)", "Post-Lasso (flexible)", + "Cross-Validated lasso", "Cross-Validated ridge", "Cross-Validated elnet", + "Cross-Validated lasso (flexible)", "Cross-Validated ridge (flexible)", "Cross-Validated elnet (flexible)", + "Random Forest", "Boosted Trees", "Pruned Tree", "Neural Net (Early)" +) +tab <- xtable(table, digits = 3) +print(tab, type = "latex") # set type="latex" for printing table in LaTeX +tab +``` + +Above, we displayed the results for a single split of data into the training and testing part. The table shows the test MSE in column 1 as well as the standard error in column 2 and the test $R^2$ +in column 3. We see that most models perform similarly. For most of these methods, test MSEs are within one standard error of each other. Remarkably, OLS with just the basic variables performs extremely well. However, OLS on a flexible model with many regressors performs very poorly giving nearly the highest test MSE. It is worth noticing that, as this is just a simple illustration meant to be relatively quick, the nonlinear models are not tuned. Thus, there is potential to improve the performance of the nonlinear methods we used in the analysis. + +### Ensemble learning + +In the final step, we can build a prediction model by combing the strengths of the models we considered so far. This ensemble method is of the form + $$ f(x) = \sum_{k=1}^K \alpha_k f_k(x) $$ +where the $f_k$'s denote our prediction rules from the table above and the $\alpha_k$'s are the corresponding weights. + +We first estimate the weights without penalization. + +```{r} +ensemble_ols <- summary(lm(y_test ~ yhat_lm_basic + yhat_lm_flex + yhat_rlasso + yhat_rlasso_flex + + yhat_rlasso_post + yhat_rlasso_post_flex + yhat_lasso_cv + yhat_lasso_cv_flex + + yhat_ridge + yhat_ridge_flex + yhat_elnet + yhat_elnet_flex + + yhat_pt + yhat_rf + yhat_boost + yhat_nn)) +ensemble_ols +``` + +Alternatively, we can determine the weights via lasso regression. + +```{r} +ensemble_lasso <- summary(hdm::rlasso(y_test ~ yhat_lm_basic + yhat_lm_flex + yhat_rlasso + yhat_rlasso_flex + + yhat_rlasso_post + yhat_rlasso_post_flex + yhat_lasso_cv + yhat_lasso_cv_flex + + yhat_ridge + yhat_ridge_flex + yhat_elnet + yhat_elnet_flex + + yhat_pt + yhat_rf + yhat_boost + yhat_nn)) +ensemble_lasso +``` + +The estimated weights are shown in the following table. + +```{r} +table <- matrix(0, 17, 2) +table[1:17, 1] <- ensemble_ols$coef[1:17] +table[1:17, 2] <- ensemble_lasso$coef[1:17] + +colnames(table) <- c("Weight OLS", "Weight Lasso") + +rownames(table) <- c( + "Constant", "Least Squares (basic)", "Least Squares (flexible)", "Lasso (basic)", + "Lasso (flexible)", "Post-Lasso (basic)", "Post-Lasso (flexible)", "LassoCV (basic)", + "Lasso CV (flexible)", "Ridge CV (basic)", "Ridge CV (flexible)", "ElNet CV (basic)", + "ElNet CV (flexible)", "Pruned Tree", "Random Forest", "Boosted Trees", "Neural Net" +) +tab <- xtable(table, digits = 3) +print(tab, type = "latex") # set type="latex" for printing table in LaTeX +tab +``` + +We note the superior $R^2$ performance of the ensembles. Though for more unbiased performance evaluation, we should have left out a third sample to validate the performance of the stacked model. + +```{r} +# print ensemble R^2 +cat("R^2 of stacking with LS weights:", ensemble_ols$adj.r.squared, "\n") +cat("R^2 of stacking with Lasso weights:", ensemble_lasso$adj.r.squared, "\n") +``` + +# Automatic Machine Learning with H20 AutoML + +We illustrate how to predict an outcome variable Y in a high-dimensional setting, using the AutoML package *H2O* that covers the complete pipeline from the raw dataset to the deployable machine learning model. In last few years, AutoML or automated machine learning has become widely popular among data science community. Again, we reanalyze the wage prediction problem using data from the U.S. March Supplement of the Current Population Survey (CPS) in 2015. + +We can use AutoML as a benchmark and compare it to the methods that we used previously where we applied one machine learning method after the other. + +On a unix system, installation of the package h2o also requires the RCurl package, which requires the libcurl library to have been installed. If the installation of the h2o package fails, try installing the libcurl package first and repeating the cell. You can install the libcurl package for instance by running: +```bash +sudo apt-get install -y libcurl4-openssl-dev +``` + +```{r} +# load the H2O package +install.packages("h2o") +``` + +```{r} +library(h2o) +``` + +```{r} +# start h2o cluster +h2o.init() +``` + +```{r} +# convert data as h2o type +train_h <- as.h2o(data_train) +test_h <- as.h2o(data_test) + +# have a look at the data +h2o.describe(train_h) +``` + +```{r} +y_name <- "lwage" +x_names <- setdiff(names(data), c("lwage", "wage", "occ", "ind")) + +# run AutoML for 10 base models and a maximal runtime of 100 seconds +aml <- h2o.automl( + x = x_names, y = y_name, + training_frame = train_h, + leaderboard_frame = test_h, + max_models = 10, + seed = 1, + max_runtime_secs = 100 +) +# AutoML Leaderboard +lb <- aml@leaderboard +print(lb, n = nrow(lb)) +``` + +We see that two Stacked Ensembles are at the top of the leaderboard. Stacked Ensembles often outperform a single model. The out-of-sample (test) MSE of the leading model is given by + +```{r} +aml@leaderboard$mse[1] +``` + +The in-sample performance can be evaluated by + +```{r} +aml@leader +``` + +This is in line with our previous results. To understand how the ensemble works, let's take a peek inside the Stacked Ensemble "All Models" model. The "All Models" ensemble is an ensemble of all of the individual models in the AutoML run. This is often the top performing model on the leaderboard. + +```{r} +model_ids <- as.data.frame(aml@leaderboard$model_id)[, 1] +# Get the "All Models" Stacked Ensemble model +se <- h2o.getModel(grep("StackedEnsemble_AllModels", model_ids, value = TRUE)[1]) +# Get the Stacked Ensemble metalearner model +metalearner <- se@model$metalearner_model +h2o.varimp(metalearner) +``` + +The table above gives us the variable importance of the metalearner in the ensemble. The AutoML Stacked Ensembles use the default metalearner algorithm (GLM with non-negative weights), so the variable importance of the metalearner is actually the standardized coefficient magnitudes of the GLM. + +```{r} +h2o.varimp_plot(metalearner) +``` + +## Generating Predictions Using Leader Model + +We can also generate predictions on a test sample using the leader model object. + +```{r} +pred <- as.matrix(h2o.predict(aml@leader, test_h)) # make prediction using x data from the test sample +head(pred) +``` + +```{r} +y_test <- as.matrix(test_h$lwage) +r2_test <- 1 - summary(lm((y_test - pred)^2 ~ 1))$coef[1] / var(y_test) +cat("MSE, SE, R^2:", summary(lm((y_test - pred)^2 ~ 1))$coef[1:2], r2_test) +``` + +We observe both a similar MSE and $R^2$ relative to the better performing models in our previous results. + +```{r} +h2o.shutdown(prompt = FALSE) # shut down the h20 automatically without prompting user +``` + diff --git a/PM4/r-dml-401k.Rmd b/PM4/r-dml-401k.Rmd new file mode 100644 index 00000000..ed199079 --- /dev/null +++ b/PM4/r-dml-401k.Rmd @@ -0,0 +1,1019 @@ +--- +title: An R Markdown document converted from "PM4/r-dml-401k.irnb" +output: html_document +--- + +# Inference on Predictive and Causal Effects in High-Dimensional Nonlinear Models + +## Impact of 401(k) on Financial Wealth + +As a practical illustration of the methods developed in this lecture, we consider estimation of the effect of 401(k) eligibility and participation +on accumulated assets. 401(k) plans are pension accounts sponsored by employers. The key problem in determining the effect of participation in 401(k) plans on accumulated assets is saver heterogeneity coupled with the fact that the decision to enroll in a 401(k) is non-random. It is generally recognized that some people have a higher preference for saving than others. It also seems likely that those individuals with high unobserved preference for saving would be most likely to choose to participate in tax-advantaged retirement savings plans and would tend to have otherwise high amounts of accumulated assets. The presence of unobserved savings preferences with these properties then implies that conventional estimates that do not account for saver heterogeneity and endogeneity of participation will be biased upward, tending to overstate the savings effects of 401(k) participation. + +One can argue that eligibility for enrolling in a 401(k) plan in this data can be taken as exogenous after conditioning on a few observables of which the most important for their argument is income. The basic idea is that, at least around the time 401(k)’s initially became available, people were unlikely to be basing their employment decisions on whether an employer offered a 401(k) but would instead focus on income and other aspects of the job. + +```{r} +install.packages("xtable") +install.packages("hdm") +install.packages("sandwich") +install.packages("ggplot2") +install.packages("randomForest") +install.packages("glmnet") +install.packages("rpart") +install.packages("gbm") +``` + +```{r} +library(xtable) +library(hdm) +library(sandwich) +library(ggplot2) +library(randomForest) +library(data.table) +library(glmnet) +library(rpart) +library(gbm) +``` + +### Data + +The raw dataset can be found [here](https://raw.githubusercontent.com/CausalAIBook/MetricsMLNotebooks/main/data/401k.csv). +The data set can be loaded from the `hdm` package for R directly by typing: + + +```{r} +data(pension) +data <- pension +dim(data) +``` + +See the "Details" section on the description of the data set, which can be accessed by + +```{r} +help(pension) +``` + +The data consist of 9,915 observations at the household level drawn from the 1991 Survey of Income and Program Participation (SIPP). All the variables are referred to 1990. We use net financial assets (*net\_tfa*) as the outcome variable, $Y$, in our analysis. The net financial assets are computed as the sum of IRA balances, 401(k) balances, checking accounts, saving bonds, other interest-earning accounts, other interest-earning assets, stocks, and mutual funds less non mortgage debts. + +Among the $9915$ individuals, $3682$ are eligible to participate in the program. The variable *e401* indicates eligibility and *p401* indicates participation, respectively. + +```{r} +hist_e401 <- ggplot(data, aes(x = e401, fill = factor(e401))) + + geom_bar() +hist_e401 +``` + +Eligibility is highly associated with financial wealth: + +```{r} +dens_net_tfa <- ggplot(data, aes(x = net_tfa, color = factor(e401), fill = factor(e401))) + + geom_density() + + xlim(c(-20000, 150000)) + + facet_wrap(. ~ e401) + +dens_net_tfa +``` + +The unconditional APE of e401 is about $19559$: + +```{r} +e1 <- data[data$e401 == 1, ] +e0 <- data[data$e401 == 0, ] +round(mean(e1$net_tfa) - mean(e0$net_tfa), 0) +``` + +Among the $3682$ individuals that are eligible, $2594$ decided to participate in the program. The unconditional APE of p401 is about $27372$: + +```{r} +p1 <- data[data$p401 == 1, ] +p0 <- data[data$p401 == 0, ] +round(mean(p1$net_tfa) - mean(p0$net_tfa), 0) +``` + +As discussed, these estimates are biased since they do not account for saver heterogeneity and endogeneity of participation. + +```{r} +# outcome variable +y <- data[, "net_tfa"] +# treatment variable +D <- data[, "e401"] +D2 <- data[, "p401"] +D3 <- data[, "a401"] + +columns_to_drop <- c( + "e401", "p401", "a401", "tw", "tfa", "net_tfa", "tfa_he", + "hval", "hmort", "hequity", + "nifa", "net_nifa", "net_n401", "ira", + "dum91", "icat", "ecat", "zhat", + "i1", "i2", "i3", "i4", "i5", "i6", "i7", + "a1", "a2", "a3", "a4", "a5" +) + +# covariates +X <- data[, !(names(data) %in% columns_to_drop)] +``` + +```{r} +# Constructing the controls +x_formula <- paste("~ 0 + poly(age, 6, raw=TRUE) + poly(inc, 8, raw=TRUE) + poly(educ, 4, raw=TRUE) ", + "+ poly(fsize, 2, raw=TRUE) + male + marr + twoearn + db + pira + hown") +X <- as.data.table(model.frame(x_formula, X)) +head(X) +``` + +## Estimating the ATE of 401(k) Eligibility on Net Financial Assets + +We are interested in valid estimators of the average treatment effect of `e401` and `p401` on `net_tfa`. We start using ML approaches to estimate the function $g_0$ and $m_0$ in the following PLR model: + +\begin{align} + & Y = D\theta_0 + g_0(X) + \zeta, & E[\zeta \mid D,X]= 0,\\ + & D = m_0(X) + V, & E[V \mid X] = 0. +\end{align} + +## Partially Linear Regression Models (PLR) + +```{r} +dml2_for_plm <- function(x, d, y, dreg, yreg, nfold = 3, method = "regression") { + nobs <- nrow(x) # number of observations + foldid <- rep.int(1:nfold, times = ceiling(nobs / nfold))[sample.int(nobs)] # define folds indices + I <- split(1:nobs, foldid) # split observation indices into folds + ytil <- dtil <- rep(NA, nobs) + cat("fold: ") + for (b in seq_along(I)) { + if (method == "regression") { + dfit <- dreg(x[-I[[b]], ], d[-I[[b]]]) # take a fold out + yfit <- yreg(x[-I[[b]], ], y[-I[[b]]]) # take a foldt out + dhat <- predict(dfit, x[I[[b]], ], type = "response") # predict the left-out fold + yhat <- predict(yfit, x[I[[b]], ], type = "response") # predict the left-out fold + dtil[I[[b]]] <- (d[I[[b]]] - dhat) # record residual for the left-out fold + ytil[I[[b]]] <- (y[I[[b]]] - yhat) # record residial for the left-out fold + } else if (method == "randomforest") { + dfit <- dreg(x[-I[[b]], ], as.factor(d)[-I[[b]]]) # take a fold out + yfit <- yreg(x[-I[[b]], ], y[-I[[b]]]) # take a fold out + dhat <- predict(dfit, x[I[[b]], ], type = "prob")[, 2] # predict the left-out fold + yhat <- predict(yfit, x[I[[b]], ], type = "response") # predict the left-out fold + dtil[I[[b]]] <- (d[I[[b]]] - dhat) # record residual for the left-out fold + ytil[I[[b]]] <- (y[I[[b]]] - yhat) # record residial for the left-out fold + } else if (method == "decisiontrees") { + dfit <- dreg(x[-I[[b]], ], as.factor(d)[-I[[b]]]) # take a fold out + yfit <- yreg(x[-I[[b]], ], y[-I[[b]]]) # take a fold out + dhat <- predict(dfit, x[I[[b]], ])[, 2] # predict the left-out fold + yhat <- predict(yfit, x[I[[b]], ]) # predict the left-out fold + dtil[I[[b]]] <- (d[I[[b]]] - dhat) # record residual for the left-out fold + ytil[I[[b]]] <- (y[I[[b]]] - yhat) # record residial for the left-out fold + } else if (method == "boostedtrees") { + dfit <- dreg(x[-I[[b]], ], d[-I[[b]]]) # take a fold out + yfit <- yreg(x[-I[[b]], ], y[-I[[b]]]) # take a fold out + dhat <- predict(dfit, x[I[[b]], ], type = "response") # predict the left-out fold + yhat <- predict(yfit, x[I[[b]], ], type = "response") # predict the left-out fold + dtil[I[[b]]] <- (d[I[[b]]] - dhat) # record residual for the left-out fold + ytil[I[[b]]] <- (y[I[[b]]] - yhat) # record residial for the left-out fold + } + cat(b, " ") + } + rfit <- lm(ytil ~ dtil) # estimate the main parameter by regressing one residual on the other + coef_est <- coef(rfit)[2] # extract coefficient + se <- sqrt(vcovHC(rfit)[2, 2]) # record robust standard error + cat(sprintf("\ncoef (se) = %g (%g)\n", coef_est, se)) # printing output + return(list(coef_est = coef_est, se = se, dtil = dtil, ytil = ytil)) # save output and residuals +} +``` + +```{r} +summaryPLR <- function(point, stderr, resD, resy, name) { + data <- data.frame( + estimate = point, # point estimate + stderr = stderr, # standard error + lower = point - 1.96 * stderr, # lower end of 95% confidence interval + upper = point + 1.96 * stderr, # upper end of 95% confidence interval + `rmse y` = sqrt(mean(resy^2)), # RMSE of model that predicts outcome y + `rmse D` = sqrt(mean(resD^2)), # RMSE of model that predicts treatment D + `accuracy D` = mean(abs(resD) < 0.5) # binary classification accuracy of model for D + ) + rownames(data) <- name + return(data) +} +``` + +#### Double Lasso with Cross-Fitting + +```{r} +# DML with LassoCV +set.seed(123) +cat(sprintf("\nDML with Lasso CV \n")) + +dreg_lasso_cv <- function(x, d) { + cv.glmnet(x, d, family = "gaussian", alpha = 1, nfolds = 5) +} +yreg_lasso_cv <- function(x, y) { + cv.glmnet(x, y, family = "gaussian", alpha = 1, nfolds = 5) +} + +dml2_results <- dml2_for_plm(as.matrix(X), D, y, dreg_lasso_cv, yreg_lasso_cv, nfold = 5) + +sum_lasso_cv <- summaryPLR(dml2_results$coef_est, dml2_results$se, dml2_results$dtil, + dml2_results$ytil, name = "LassoCV") +tableplr <- data.frame() +tableplr <- rbind(sum_lasso_cv) +tableplr +``` + +```{r} +# Because residuals are output, reconstruct fitted values for use in ensemble +dhat_lasso <- D - dml2_results$dtil +yhat_lasso <- y - dml2_results$ytil +``` + +#### Using a $\ell_2$ Penalized Logistic Regression for D + +Note we are using the $\ell_2$ penalty here. You can use the $\ell_1$ penalty as well, but computation will take longer. + +```{r} +# DML with Lasso/Logistic +set.seed(123) +cat(sprintf("\nDML with Lasso/Logistic \n")) + +dreg_logistic_cv <- function(x, d) { + cv.glmnet(x, d, family = "binomial", alpha = 0, nfolds = 5) +} +yreg_lasso_cv <- function(x, y) { + cv.glmnet(x, y, family = "gaussian", alpha = 1, nfolds = 5) +} + +dml2_results <- dml2_for_plm(as.matrix(X), D, y, dreg_logistic_cv, yreg_lasso_cv, nfold = 5) +sum_lasso_logistic_cv <- summaryPLR(dml2_results$coef_est, dml2_results$se, dml2_results$dtil, + dml2_results$ytil, name = "LassoCV/LogisticCV") +tableplr <- rbind(tableplr, sum_lasso_logistic_cv) +tableplr +``` + +```{r} +# Because residuals are output, reconstruct fitted values for use in ensemble +dhat_lasso_logistic <- D - dml2_results$dtil +yhat_lasso_logistic <- y - dml2_results$ytil +``` + +#### Random Forests + +```{r} +# DML with Random Forest +set.seed(123) +cat(sprintf("\nDML with Random Forest \n")) + +dreg_rf <- function(x, d) { + randomForest(x, d, ntree = 1000, nodesize = 10) +} # ML method=Forest +yreg_rf <- function(x, y) { + randomForest(x, y, ntree = 1000, nodesize = 10) +} # ML method=Forest + +dml2_results <- dml2_for_plm(as.matrix(X), D, y, dreg_rf, yreg_rf, nfold = 5, method = "randomforest") +sum_rf <- summaryPLR(dml2_results$coef_est, dml2_results$se, dml2_results$dtil, + dml2_results$ytil, name = "Random Forest") +tableplr <- rbind(tableplr, sum_rf) +tableplr +``` + +```{r} +# Because residuals are output, reconstruct fitted values for use in ensemble +dhat_rf <- D - dml2_results$dtil +dhat_rf <- y - dml2_results$ytil +``` + +#### Decision Trees + +```{r} +# DML with Decision Trees +set.seed(123) +cat(sprintf("\nDML with Decision Trees \n")) + +dreg_tr <- function(x, d) { + rpart(as.formula("D~."), cbind(data.frame(D = d), x), method = "class", minbucket = 10, cp = 0.001) +} +dreg_tr <- function(x, y) { + rpart(as.formula("y~."), cbind(data.frame(y = y), x), minbucket = 10, cp = 0.001) +} + +# decision tree takes in X as dataframe, not matrix/array +dml2_results <- dml2_for_plm(X, D, y, dreg_tr, dreg_tr, nfold = 5, method = "decisiontrees") +sum_tr <- summaryPLR(dml2_results$coef_est, dml2_results$se, dml2_results$dtil, + dml2_results$ytil, name = "Decision Trees") +tableplr <- rbind(tableplr, sum_tr) +tableplr +``` + +```{r} +# Because residuals are output, reconstruct fitted values for use in ensemble +dhat_tr <- D - dml2_results$dtil +yhat_tr <- y - dml2_results$ytil +``` + + +Ideally, we would do (semi) cross-fitting with AutoML in order to find good first-stage models and re-run DML with these models. Unfortunately this is not easy to do in R. In the case of semi cross-fitting, we can use R's H20 AutoML trained on the entire training set $y\sim X$, $D \sim X$, $Z\sim X$ to determine the best model (eg ensemble), but H20 does not allow you to extract said model so we can re-use that in DML. + +#### Boosted Trees + +```{r} +# DML with Boosted Trees +set.seed(123) +cat(sprintf("\nDML with Boosted Trees \n")) + +# NB: early stopping cannot easily be implemented with gbm +## set n.trees = best, where best <- gbm.perf(dreg_boost, plot.it = FALSE) +dreg_boost <- function(x, d) { + gbm(as.formula("D~."), cbind(data.frame(D = d), x), distribution = "bernoulli", + interaction.depth = 2, n.trees = 100, shrinkage = .1) +} +yreg_boost <- function(x, y) { + gbm(as.formula("y~."), cbind(data.frame(y = y), x), distribution = "gaussian", + interaction.depth = 2, n.trees = 100, shrinkage = .1) +} + +# passing these through regression as type="response", and D should not be factor! +dml2_results <- dml2_for_plm(X, D, y, dreg_boost, yreg_boost, nfold = 5, method = "boostedtrees") +sum_boost <- summaryPLR(dml2_results$coef_est, dml2_results$se, dml2_results$dtil, + dml2_results$ytil, name = "Boosted Trees") +tableplr <- rbind(tableplr, sum_boost) +tableplr +``` + +```{r} +# Because residuals are output, reconstruct fitted values for use in ensemble +dhat_boost <- D - dml2_results$dtil +yhat_boost <- y - dml2_results$ytil +``` + +## Ensembles + +Boosted trees give the best RMSE for both Y and D, so the ensemble based on choosing the best performing prediction rule is identical to boosting in this case. + +```{r} +# Best fit is boosted trees for both D and Y + +sum_best <- summaryPLR(dml2_results$coef_est, dml2_results$se, dml2_results$dtil, + dml2_results$ytil, name = "Best") +tableplr <- rbind(tableplr, sum_best) +tableplr +``` + +We'll form a model average with unconstrained least squares weights. + +```{r} +# Least squares model average + +ma_dtil <- lm(D ~ dhat_lasso + dhat_lasso_logistic + dhat_rf + dhat_tr + dhat_boost)$residuals +ma_ytil <- lm(y ~ yhat_lasso + yhat_lasso_logistic + dhat_rf + yhat_tr + yhat_boost)$residuals + +rfit <- lm(ma_ytil ~ ma_dtil) # estimate the main parameter by regressing one residual on the other +coef_est <- coef(rfit)[2] # extract coefficient +se <- sqrt(vcovHC(rfit)[2, 2]) # record robust standard error + +sum.ma <- summaryPLR(coef_est, se, ma_dtil, ma_ytil, name = "Model Average") +tableplr <- rbind(tableplr, sum.ma) +tableplr +``` + +## Interactive Regression Model (IRM) + +Next, we consider estimation of average treatment effects when treatment effects are fully heterogeneous: + + \begin{align} + & Y = g_0(D, X) + U, & \quad E[U \mid X, D]= 0,\\ + & D = m_0(X) + V, & \quad E[V\mid X] = 0. +\end{align} + +To reduce the disproportionate impact of extreme propensity score weights in the interactive model +we trim the propensity scores which are close to the bounds. + +```{r} +dml2_for_irm <- function(x, d, y, dreg, yreg0, yreg1, trimming = 0.01, nfold = 5, method = "regression") { + yhat0 <- rep(0, length(y)) + yhat1 <- rep(0, length(y)) + Dhat <- rep(0, length(d)) + + nobs <- nrow(x) # number of observations + foldid <- rep.int(1:nfold, times = ceiling(nobs / nfold))[sample.int(nobs)] # define folds indices + I <- split(1:nobs, foldid) # split observation indices into folds + ytil <- dtil <- rep(NA, nobs) + + cat("fold: ") + for (b in seq_along(I)) { + # define helpful variables + Dnotb <- d[-I[[b]]] + Xb <- X[I[[b]], ] + Xnotb <- X[-I[[b]], ] + + # training dfs subsetted on the -I[[b]] fold + XD0 <- X[-I[[b]], ][d[-I[[b]]] == 0] + yD0 <- y[-I[[b]]][d[-I[[b]]] == 0] + XD1 <- X[-I[[b]], ][d[-I[[b]]] == 1] + yD1 <- y[-I[[b]]][d[-I[[b]]] == 1] + + if (method == "regression") { + yfit0 <- yreg0(as.matrix(XD0), yD0) + yfit1 <- yreg1(as.matrix(XD1), yD1) + yhat0[I[[b]]] <- predict(yfit0, as.matrix(Xb)) # default is type = "response" for glmnet family gaussian + yhat1[I[[b]]] <- predict(yfit1, as.matrix(Xb)) + } else if (method == "randomforest") { + yfit0 <- yreg0(XD0, yD0) + yfit1 <- yreg1(XD1, yD1) + yhat0[I[[b]]] <- predict(yfit0, Xb) # default is type = "response" for rf + yhat1[I[[b]]] <- predict(yfit1, Xb) + } else if (method == "decisiontrees") { + yfit0 <- yreg0(XD0, yD0) + yfit1 <- yreg1(XD1, yD1) + yhat0[I[[b]]] <- predict(yfit0, Xb) # default is type = "vector" for decision + yhat1[I[[b]]] <- predict(yfit1, Xb) + } else if (method == "boostedtrees") { + yfit0 <- yreg0(as.data.frame(XD0), yD0) + yfit1 <- yreg1(as.data.frame(XD1), yD1) + yhat0[I[[b]]] <- predict(yfit0, Xb) # default is type = "response" for boosted + yhat1[I[[b]]] <- predict(yfit1, Xb) + } + + # propensity scores: + if (method == "regression") { + dfit_b <- dreg(as.matrix(Xnotb), Dnotb) + dhat_b <- predict(dfit_b, as.matrix(Xb), type = "response") # default is type="link" for family binomial! + } else if (method == "randomforest") { + dfit_b <- dreg(Xnotb, as.factor(Dnotb)) + dhat_b <- predict(dfit_b, Xb, type = "prob")[, 2] + } else if (method == "decisiontrees") { + dfit_b <- dreg(Xnotb, Dnotb) + dhat_b <- predict(dfit_b, Xb)[, 2] + } else if (method == "boostedtrees") { + dfit_b <- dreg(as.data.frame(Xnotb), Dnotb) + dhat_b <- predict(dfit_b, Xb, type = "response") + } + dhat_b <- pmax(pmin(dhat_b, 1 - trimming), trimming) # trimming so scores are between [trimming, (1-trimming)] + Dhat[I[[b]]] <- dhat_b + + cat(b, " ") + } + + # Prediction of treatment and outcome for observed instrument + yhat <- yhat0 * (1 - D) + yhat1 * D + # residuals + ytil <- y - yhat + dtil <- D - Dhat + # doubly robust quantity for every sample + drhat <- yhat1 - yhat0 + (y - yhat) * (D / Dhat - (1 - D) / (1 - Dhat)) + coef_est <- mean(drhat) + vari <- var(drhat) + se <- sqrt(vari / nrow(X)) + cat("point", coef_est) + cat("se", se) + return(list(coef_est = coef_est, se = se, ytil = ytil, dtil = dtil, drhat = drhat, + yhat0 = yhat0, yhat1 = yhat1, dhat = Dhat, yhat = yhat)) +} +``` + +```{r} +summaryIRM <- function(coef_est, se, ytil, dtil, drhat, name) { + summary_data <- data.frame( + estimate = coef_est, # point estimate + se = se, # standard error + lower = coef_est - 1.96 * se, # lower end of 95% confidence interval + upper = coef_est + 1.96 * se, # upper end of 95% confidence interval + rmse_y = sqrt(mean(ytil^2)), # res of model that predicts outcome y + rmse_D = sqrt(mean(dtil^2)), # res of model that predicts treatment D + accuracy_D = mean(abs(dtil) < 0.5) # binary classification accuracy of model for D + ) + row.names(summary_data) <- name + return(summary_data) +} +``` + +#### Repeat analysis in the IRM setting. + +```{r} +# DML with Lasso/Logistic +set.seed(123) +cat(sprintf("\nDML with LassoCV/Logistic \n")) + +dreg_lasso_cv <- function(x, d) { + cv.glmnet(x, d, family = "binomial", alpha = 0, nfolds = 5) +} +yreg0_lasso_cv <- function(x, y) { + cv.glmnet(x, y, family = "gaussian", alpha = 1, nfolds = 5) +} +yreg1_lasso_cv <- function(x, y) { + cv.glmnet(x, y, family = "gaussian", alpha = 1, nfolds = 5) +} + +# more folds seems to help stabilize finite sample performance +dml2_results <- dml2_for_irm(X, D, y, dreg_lasso_cv, yreg0_lasso_cv, yreg1_lasso_cv, nfold = 5) +sum_lasso_cv <- summaryIRM(dml2_results$coef_est, dml2_results$se, dml2_results$ytil, dml2_results$dtil, + dml2_results$drhat, name = "LassoCVLogistic") +tableirm <- data.frame() +tableirm <- rbind(sum_lasso_cv) +tableirm + +yhat0_lasso <- dml2_results$yhat0 +yhat1_lasso <- dml2_results$yhat1 +dhat_lasso <- dml2_results$dhat +yhat_lasso <- dml2_results$yhat +``` + +```{r} +# DML with Random Forest +set.seed(123) +cat(sprintf("\nDML with Random Forest \n")) + +dreg_rf <- function(x, d) { + randomForest(x, d, ntree = 1000, nodesize = 10) +} # ML method=Forest +yreg0_rf <- function(x, y) { + randomForest(x, y, ntree = 1000, nodesize = 10) +} # ML method=Forest +yreg1_rf <- function(x, y) { + randomForest(x, y, ntree = 1000, nodesize = 10) +} # ML method=Forest + + +dml2_results <- dml2_for_irm(as.matrix(X), D, y, dreg_rf, yreg0_rf, yreg1_rf, nfold = 5, method = "randomforest") +sum_rf <- summaryIRM(dml2_results$coef_est, dml2_results$se, dml2_results$ytil, dml2_results$dtil, + dml2_results$drhat, name = "Random Forest") +tableirm <- rbind(tableirm, sum_rf) +tableirm + +yhat0_rf <- dml2_results$yhat0 +yhat1_rf <- dml2_results$yhat1 +dhat_rf <- dml2_results$dhat +dhat_rf <- dml2_results$yhat +``` + +```{r} +# DML with Decision Trees +set.seed(123) +cat(sprintf("\nDML with Decision Trees \n")) + +dreg_tr <- function(x, d) { + rpart(as.formula("D~."), cbind(data.frame(D = d), x), method = "class", minbucket = 10, cp = 0.001) +} +yreg0_tr <- function(x, y) { + rpart(as.formula("y~."), cbind(data.frame(y = y), x), minbucket = 10, cp = 0.001) +} +yreg1_tr <- function(x, y) { + rpart(as.formula("y~."), cbind(data.frame(y = y), x), minbucket = 10, cp = 0.001) +} + +dml2_results <- dml2_for_irm(X, D, y, dreg_tr, yreg0_tr, yreg1_tr, nfold = 5, method = "decisiontrees") +sum_tr <- summaryIRM(dml2_results$coef_est, dml2_results$se, dml2_results$ytil, dml2_results$dtil, + dml2_results$drhat, name = "Decision Trees") +tableirm <- rbind(tableirm, sum_tr) +tableirm + +yhat0_tr <- dml2_results$yhat0 +yhat1_tr <- dml2_results$yhat1 +dhat_tr <- dml2_results$dhat +yhat_tr <- dml2_results$yhat +``` + +```{r} +# DML with Boosted Trees +set.seed(123) +cat(sprintf("\nDML with Boosted Trees \n")) + +# NB: early stopping cannot easily be implemented with gbm +## set n.trees = best, where best <- gbm.perf(dreg_boost, plot.it = FALSE) +dreg_boost <- function(x, d) { + gbm(as.formula("D~."), cbind(data.frame(D = d), x), distribution = "bernoulli", + interaction.depth = 2, n.trees = 100, shrinkage = .1) +} +yreg0_boost <- function(x, y) { + gbm(as.formula("y~."), cbind(data.frame(y = y), x), distribution = "gaussian", + interaction.depth = 2, n.trees = 100, shrinkage = .1) +} +yreg1_boost <- function(x, y) { + gbm(as.formula("y~."), cbind(data.frame(y = y), x), distribution = "gaussian", + interaction.depth = 2, n.trees = 100, shrinkage = .1) +} + +# passing these through regression as type="response", and D should not be factor! +dml2_results <- dml2_for_irm(X, D, y, dreg_boost, yreg0_boost, yreg1_boost, nfold = 5, method = "boostedtrees") +sum_boost <- summaryIRM(dml2_results$coef_est, dml2_results$se, dml2_results$ytil, dml2_results$dtil, + dml2_results$drhat, name = "Boosted Trees") +tableirm <- rbind(tableirm, sum_boost) +tableirm + +yhat0_boost <- dml2_results$yhat0 +yhat1_boost <- dml2_results$yhat1 +dhat_boost <- dml2_results$dhat +yhat_boost <- dml2_results$yhat +``` + +```{r} +# Ensembles + +# Best +# We'll look at model that does best for Y overall. Could also use different model for Y0 and Y1 +# Here, the best performance for Y is the random forest and for D the boosted tree + +# residuals +ytil <- y - dhat_rf +dtil <- D - dhat_boost +# doubly robust quantity for every sample +drhat <- yhat1_rf - yhat0_rf + (y - dhat_rf) * (D / dhat_boost - (1 - D) / (1 - dhat_boost)) +coef_est <- mean(drhat) +vari <- var(drhat) +se <- sqrt(vari / nrow(X)) + +sum_best <- summaryIRM(coef_est, se, ytil, dtil, drhat, name = "Best") +tableirm <- rbind(tableirm, sum_best) +tableirm +``` + +```{r} +# Least squares model average +# We'll look at weights that do best job for Y overall. Could also use different weights for Y0 and Y1 + +ma_dw <- lm(D ~ dhat_lasso + dhat_rf + dhat_tr + dhat_boost)$coef +ma_yw <- lm(y ~ yhat_lasso + dhat_rf + yhat_tr + yhat_boost)$coef + +Dhats <- cbind(as.matrix(rep(1, nrow(X))), dhat_lasso, dhat_rf, dhat_tr, dhat_boost) +Y0s <- cbind(as.matrix(rep(1, nrow(X))), yhat0_lasso, yhat0_rf, yhat0_tr, yhat0_boost) +Y1s <- cbind(as.matrix(rep(1, nrow(X))), yhat1_lasso, yhat1_rf, yhat1_tr, yhat1_boost) + +dhat <- Dhats %*% as.matrix(ma_dw) +yhat0 <- Y0s %*% as.matrix(ma_yw) +yhat1 <- Y1s %*% as.matrix(ma_yw) + +# Prediction of treatment and outcome for observed instrument +yhat <- yhat0 * (1 - D) + yhat1 * D +# residuals +ytil <- y - yhat +dtil <- D - dhat +# doubly robust quantity for every sample +drhat <- yhat1 - yhat0 + (y - yhat) * (D / dhat - (1 - D) / (1 - dhat)) +coef_est <- mean(drhat) +vari <- var(drhat) +se <- sqrt(vari / nrow(X)) + +sum.ma <- summaryIRM(coef_est, se, ytil, dtil, drhat, name = "Model Average") +tableirm <- rbind(tableirm, sum.ma) +tableirm +``` + +## Double ML package + +We are interested in valid estimators of the average treatment effect of `e401` and `p401` on `net_tfa`. There exist nice packages out there that can help us do our estimation with the simple call of a function. Such packages include `EconML` (Python) and `DoubleML` (Python and R). + +We run through PLR and IRM using `DoubleML` below to illustrate. The `DoubleML` package internally builds on `mlr3`. We use the meta package `mlr3` to generate predictions with machine learning methods. A comprehensive introduction and description of the `mlr3` package is provided in the [mlr3book](https://mlr3book.mlr-org.com/). A list of all learners that you can use in `mlr3` can be found [here](https://mlr3extralearners.mlr-org.com/articles/learners/list_learners.html). The entry in the columns *mlr3 Package* and *Packages* indicate which packages must be installed/loaded in your R session. + +You find additional information about `DoubleML` on the package on the package website https://docs.doubleml.org/ and the R documentation page https://docs.doubleml.org/r/stable/. + +```{r} +install.packages("DoubleML") +install.packages("mlr3learners") +install.packages("mlr3") +install.packages("data.table") +install.packages("randomForest") +install.packages("ranger") + +library(DoubleML) +library(mlr3learners) +library(mlr3) +library(data.table) +library(randomForest) +library(ranger) +``` + +```{r} +# Constructing the data (as DoubleMLData) +formula_flex <- paste("net_tfa ~ e401 + poly(age, 6, raw=TRUE) + poly(inc, 8, raw=TRUE) ", + "+ poly(educ, 4, raw=TRUE) + poly(fsize, 2, raw=TRUE) + marr + twoearn + db + pira + hown") +model_flex <- as.data.table(model.frame(formula_flex, pension)) +x_cols <- colnames(model_flex)[-c(1, 2)] +data_ml <- DoubleMLData$new(model_flex, y_col = "net_tfa", d_cols = "e401", x_cols = x_cols) + +p <- dim(model_flex)[2] - 2 +p +``` + +As mentioned, in the tutorial we use the meta package `mlr3` to generate predictions with machine learning methods. A comprehensive introduction and description of the `mlr3` package is provided in the [mlr3book](https://mlr3book.mlr-org.com/). A list of all learners that you can use in `mlr3` can be found [here](https://mlr3extralearners.mlr-org.com/articles/learners/list_learners.html). The entry in the columns *mlr3 Package* and *Packages* indicate which packages must be installed/loaded in your R session. + +## Partially Linear Regression Models (PLR) + +```{r} +# Estimating the PLR +lgr::get_logger("mlr3")$set_threshold("warn") +lasso <- lrn("regr.cv_glmnet", nfolds = 5, s = "lambda.min") +lasso_class <- lrn("classif.cv_glmnet", nfolds = 5, s = "lambda.min") + +dml_plr <- DoubleMLPLR$new(data_ml, ml_l = lasso, ml_m = lasso_class, n_folds = 5) +dml_plr$fit(store_predictions = TRUE) +dml_plr$summary() +lasso_plr <- dml_plr$coef +lasso_std_plr <- dml_plr$se +``` + +Let us check the predictive performance of this model. + +```{r} +dml_plr$params_names() +g_hat <- as.matrix(dml_plr$predictions$ml_l) # predictions of g_o +m_hat <- as.matrix(dml_plr$predictions$ml_m) # predictions of m_o +``` + +```{r} +# cross-fitted RMSE: outcome +y <- as.matrix(pension$net_tfa) # true observations +theta <- as.numeric(dml_plr$coef) # estimated regression coefficient +d <- as.matrix(pension$e401) +predictions_y <- as.matrix(d * theta) + g_hat # predictions for y +lasso_y_rmse <- sqrt(mean((y - predictions_y)^2)) +lasso_y_rmse +``` + +```{r} +# cross-fitted RMSE: treatment +d <- as.matrix(pension$e401) +lasso_d_rmse <- sqrt(mean((d - m_hat)^2)) +lasso_d_rmse + +# cross-fitted ce: treatment +mean(ifelse(m_hat > 0.5, 1, 0) != d) +``` + +Then, we repeat this procedure for various machine learning methods. + +```{r} +# Random Forest +lgr::get_logger("mlr3")$set_threshold("warn") +randomForest <- lrn("regr.ranger") +random_forest_class <- lrn("classif.ranger") + +dml_plr <- DoubleMLPLR$new(data_ml, ml_l = randomForest, ml_m = random_forest_class, n_folds = 5) +dml_plr$fit(store_predictions = TRUE) # set store_predictions=TRUE to evaluate the model +dml_plr$summary() +forest_plr <- dml_plr$coef +forest_std_plr <- dml_plr$se +``` + +We can compare the accuracy of this model to the model that has been estimated with lasso. + +```{r} +# Evaluation predictions +g_hat <- as.matrix(dml_plr$predictions$ml_l) # predictions of g_o +m_hat <- as.matrix(dml_plr$predictions$ml_m) # predictions of m_o +theta <- as.numeric(dml_plr$coef) # estimated regression coefficient +predictions_y <- as.matrix(d * theta) + g_hat # predictions for y +forest_y_rmse <- sqrt(mean((y - predictions_y)^2)) +forest_y_rmse + +# cross-fitted RMSE: treatment +forest_d_rmse <- sqrt(mean((d - m_hat)^2)) +forest_d_rmse + +# cross-fitted ce: treatment +mean(ifelse(m_hat > 0.5, 1, 0) != d) +``` + +```{r} +# Trees +lgr::get_logger("mlr3")$set_threshold("warn") + +trees <- lrn("regr.rpart") +trees_class <- lrn("classif.rpart") + +dml_plr <- DoubleMLPLR$new(data_ml, ml_l = trees, ml_m = trees_class, n_folds = 5) +dml_plr$fit(store_predictions = TRUE) +dml_plr$summary() +tree_plr <- dml_plr$coef +tree_std_plr <- dml_plr$se + +# Evaluation predictions +g_hat <- as.matrix(dml_plr$predictions$ml_l) # predictions of g_o +m_hat <- as.matrix(dml_plr$predictions$ml_m) # predictions of m_o +theta <- as.numeric(dml_plr$coef) # estimated regression coefficient +predictions_y <- as.matrix(d * theta) + g_hat # predictions for y +tree_y_rmse <- sqrt(mean((y - predictions_y)^2)) +tree_y_rmse + +# cross-fitted RMSE: treatment +tree_d_rmse <- sqrt(mean((d - m_hat)^2)) +tree_d_rmse + +# cross-fitted ce: treatment +mean(ifelse(m_hat > 0.5, 1, 0) != d) +``` + +```{r} +# needed to run boosting +remotes::install_github("mlr-org/mlr3extralearners") +install.packages("mlr3extralearners") +install.packages("mboost") +library(mlr3extralearners) +library(mboost) +``` + +```{r} +# Boosting +boost <- lrn("regr.glmboost") +boost_class <- lrn("classif.glmboost") + +dml_plr <- DoubleMLPLR$new(data_ml, ml_l = boost, ml_m = boost_class, n_folds = 5) +dml_plr$fit(store_predictions = TRUE) +dml_plr$summary() +boost_plr <- dml_plr$coef +boost_std_plr <- dml_plr$se + +# Evaluation predictions +g_hat <- as.matrix(dml_plr$predictions$ml_l) # predictions of g_o +m_hat <- as.matrix(dml_plr$predictions$ml_m) # predictions of m_o +theta <- as.numeric(dml_plr$coef) # estimated regression coefficient +predictions_y <- as.matrix(d * theta) + g_hat # predictions for y +boost_y_rmse <- sqrt(mean((y - predictions_y)^2)) +boost_y_rmse + +# cross-fitted RMSE: treatment +boost_d_rmse <- sqrt(mean((d - m_hat)^2)) +boost_d_rmse + +# cross-fitted ce: treatment +mean(ifelse(m_hat > 0.5, 1, 0) != d) +``` + +Let's sum up the results: + +```{r} +table <- matrix(0, 4, 4) +table[1, 1:4] <- c(lasso_plr, forest_plr, tree_plr, boost_plr) +table[2, 1:4] <- c(lasso_std_plr, forest_std_plr, tree_std_plr, boost_std_plr) +table[3, 1:4] <- c(lasso_y_rmse, forest_y_rmse, tree_y_rmse, boost_y_rmse) +table[4, 1:4] <- c(lasso_d_rmse, forest_d_rmse, tree_d_rmse, boost_d_rmse) +rownames(table) <- c("Estimate", "Std.Error", "RMSE Y", "RMSE D") +colnames(table) <- c("Lasso", "Random Forest", "Trees", "Boosting") +tab <- xtable(table, digits = 2) +tab +``` + +The best model with lowest RMSE is the PLR model estimated via lasso (or boosting based on the RSME Y). It gives the following estimate: + +```{r} +lasso_plr +``` + +## Interactive Regression Model (IRM) + +```{r} +lgr::get_logger("mlr3")$set_threshold("warn") +dml_irm <- DoubleMLIRM$new(data_ml, + ml_g = lasso, + ml_m = lasso_class, + trimming_threshold = 0.01, n_folds = 5 +) +dml_irm$fit(store_predictions = TRUE) +dml_irm$summary() +lasso_irm <- dml_irm$coef +lasso_std_irm <- dml_irm$se + + +# predictions +dml_irm$params_names() +g0_hat <- as.matrix(dml_irm$predictions$ml_g0) # predictions of g_0(D=0, X) +g1_hat <- as.matrix(dml_irm$predictions$ml_g1) # predictions of g_0(D=1, X) +g_hat <- d * g1_hat + (1 - d) * g0_hat # predictions of g_0 +m_hat <- as.matrix(dml_irm$predictions$ml_m) # predictions of m_o +``` + +```{r} +# cross-fitted RMSE: outcome +y <- as.matrix(pension$net_tfa) # true observations +d <- as.matrix(pension$e401) +lasso_y_irm <- sqrt(mean((y - g_hat)^2)) +lasso_y_irm + +# cross-fitted RMSE: treatment +lasso_d_irm <- sqrt(mean((d - m_hat)^2)) +lasso_d_irm + +# cross-fitted ce: treatment +mean(ifelse(m_hat > 0.5, 1, 0) != d) +``` + +```{r} +##### forest ##### + +dml_irm <- DoubleMLIRM$new(data_ml, + ml_g = randomForest, + ml_m = random_forest_class, + trimming_threshold = 0.01, n_folds = 5 +) +dml_irm$fit(store_predictions = TRUE) +dml_irm$summary() +forest_irm <- dml_irm$coef +forest_std_irm <- dml_plr$se + +# predictions +g0_hat <- as.matrix(dml_irm$predictions$ml_g0) # predictions of g_0(D=0, X) +g1_hat <- as.matrix(dml_irm$predictions$ml_g1) # predictions of g_0(D=1, X) +g_hat <- d * g1_hat + (1 - d) * g0_hat # predictions of g_0 +m_hat <- as.matrix(dml_irm$predictions$ml_m) # predictions of m_0 + +# cross-fitted RMSE: outcome +y <- as.matrix(pension$net_tfa) # true observations +d <- as.matrix(pension$e401) +forest_y_irm <- sqrt(mean((y - g_hat)^2)) +forest_y_irm + +# cross-fitted RMSE: treatment +forest_d_irm <- sqrt(mean((d - m_hat)^2)) +forest_d_irm + +# cross-fitted ce: treatment +mean(ifelse(m_hat > 0.5, 1, 0) != d) + +##### trees ##### + +dml_irm <- DoubleMLIRM$new(data_ml, + ml_g = trees, ml_m = trees_class, + trimming_threshold = 0.01, n_folds = 5 +) +dml_irm$fit(store_predictions = TRUE) +dml_irm$summary() +tree_irm <- dml_irm$coef +tree_std_irm <- dml_irm$se + +# predictions +g0_hat <- as.matrix(dml_irm$predictions$ml_g0) # predictions of g_0(D=0, X) +g1_hat <- as.matrix(dml_irm$predictions$ml_g1) # predictions of g_0(D=1, X) +g_hat <- d * g1_hat + (1 - d) * g0_hat # predictions of g_0 +m_hat <- as.matrix(dml_irm$predictions$ml_m) # predictions of m_o + +# cross-fitted RMSE: outcome +y <- as.matrix(pension$net_tfa) # true observations +d <- as.matrix(pension$e401) +tree_y_irm <- sqrt(mean((y - g_hat)^2)) +tree_y_irm + +# cross-fitted RMSE: treatment +tree_d_irm <- sqrt(mean((d - m_hat)^2)) +tree_d_irm + +# cross-fitted ce: treatment +mean(ifelse(m_hat > 0.5, 1, 0) != d) + + +##### boosting ##### + +dml_irm <- DoubleMLIRM$new(data_ml, + ml_g = boost, ml_m = boost_class, + trimming_threshold = 0.01, n_folds = 5 +) +dml_irm$fit(store_predictions = TRUE) +dml_irm$summary() +boost_irm <- dml_irm$coef +boost_std_irm <- dml_irm$se + +# predictions +g0_hat <- as.matrix(dml_irm$predictions$ml_g0) # predictions of g_0(D=0, X) +g1_hat <- as.matrix(dml_irm$predictions$ml_g1) # predictions of g_0(D=1, X) +g_hat <- d * g1_hat + (1 - d) * g0_hat # predictions of g_0 +m_hat <- as.matrix(dml_irm$predictions$ml_m) # predictions of m_o + +# cross-fitted RMSE: outcome +y <- as.matrix(pension$net_tfa) # true observations +d <- as.matrix(pension$e401) +boost_y_irm <- sqrt(mean((y - g_hat)^2)) +boost_y_irm + +# cross-fitted RMSE: treatment +boost_d_irm <- sqrt(mean((d - m_hat)^2)) +boost_d_irm + +# cross-fitted ce: treatment +mean(ifelse(m_hat > 0.5, 1, 0) != d) +``` + +```{r} +table <- matrix(0, 4, 4) +table[1, 1:4] <- c(lasso_irm, forest_irm, tree_irm, boost_irm) +table[2, 1:4] <- c(lasso_std_irm, forest_std_irm, tree_std_irm, boost_std_irm) +table[3, 1:4] <- c(lasso_y_irm, forest_y_irm, tree_y_irm, boost_y_irm) +table[4, 1:4] <- c(lasso_d_irm, forest_d_irm, tree_d_irm, boost_d_irm) +rownames(table) <- c("Estimate", "Std.Error", "RMSE Y", "RMSE D") +colnames(table) <- c("Lasso", "Random Forest", "Trees", "Boosting") +tab <- xtable(table, digits = 2) +tab +``` + +Here, Random Forest gives the best prediction rule for $g_0$ and Lasso the best prediction rule for $m_0$, respectively. Let us fit the IRM model using the best ML method for each equation to get a final estimate for the treatment effect of eligibility. + +```{r} +lgr::get_logger("mlr3")$set_threshold("warn") +dml_irm <- DoubleMLIRM$new(data_ml, + ml_g = randomForest, + ml_m = lasso_class, + trimming_threshold = 0.01, n_folds = 5 +) +dml_irm$fit(store_predictions = TRUE) +dml_irm$summary() +best_irm <- dml_irm$coef +best_std_irm <- dml_irm$se +``` + +These estimates that flexibly account for confounding are +substantially attenuated relative to the baseline estimate (*19559*) that does not account for confounding. They suggest much smaller causal effects of 401(k) eligiblity on financial asset holdings. + diff --git a/PM4/r-identification-analysis-of-401-k-example-w-dags.Rmd b/PM4/r-identification-analysis-of-401-k-example-w-dags.Rmd new file mode 100644 index 00000000..e721fe31 --- /dev/null +++ b/PM4/r-identification-analysis-of-401-k-example-w-dags.Rmd @@ -0,0 +1,198 @@ +--- +title: An R Markdown document converted from "PM4/r-identification-analysis-of-401-k-example-w-dags.irnb" +output: html_document +--- + +# Using Dagitty in the Analysis of Impact of 401(k) on Net Financial Wealth + +```{r} +# install and load package +install.packages("dagitty") +install.packages("ggdag") +``` + +```{r} +library(dagitty) +library(ggdag) +``` + +# Graphs for 401(K) Analsyis + +Here we have + * $Y$ -- net financial assets; + * $X$ -- worker characteristics (income, family size, other retirement plans; see lecture notes for details); + * $F$ -- latent (unobserved) firm characteristics + * $D$ -- 401(K) eligibility, deterimined by $F$ and $X$ + +**One graph (where F determines X):** + +```{r} +# generate a DAGs and plot them + +G1 <- dagitty('dag{ +Y [outcome,pos="4, 0"] +D [exposure,pos="0, 0"] +X [confounder, pos="2,-2"] +F [uobserved, pos="0, -1"] +D -> Y +X -> D +F -> X +F -> D +X -> Y}') + + +ggdag(G1) + theme_dag() +``` + +**List minimal adjustment sets to identify causal effects $D \to Y$** + + +```{r} +adjustmentSets(G1, "D", "Y", effect = "total") +``` + +**What is the underlying principle?** + +Here conditioning on X blocks backdoor paths from Y to D (Pearl). Dagitty correctly finds X (and does many more correct decisions when we consider more elaborate structures. Why do we want to consider more elaborate structures? The very empirical problem requires us to do so!) + +**Another Graph (wherere $X$ determines $F$):** + +```{r} +# generate a couple of DAGs and plot them + +G2 <- dagitty('dag{ +Y [outcome,pos="4, 0"] +D [exposure,pos="0, 0"] +X [confounder, pos="2,-2"] +F [uobserved, pos="0, -1"] +D -> Y +X -> D +X -> F +F -> D +X -> Y}') + + +ggdag(G2) + theme_dag() +``` + +```{r} +adjustmentSets(G2, "D", "Y", effect = "total") +``` + +**One more graph (encompassing the previous ones), where (F, X) are jointly determined by latent factors $A$.** + +We can allow in fact the whole triple (D, F, X) to be jointly determined by latent factors $A$. + +This is much more realistic graph to consider. + +```{r} +G3 <- dagitty('dag{ +Y [outcome,pos="4, 0"] +D [exposure,pos="0, 0"] +X [confounder, pos="2,-2"] +F [unobserved, pos="0, -1"] +A [unobserved, pos="-1, -1"] +D -> Y +X -> D +F -> D +A -> F +A -> X +A -> D +X -> Y}') + +adjustmentSets(G3, "D", "Y", effect = "total") + +ggdag(G3) + theme_dag() +``` + +# Threat to Identification: + +What if $F$ also directly affects $Y$? (Note that there are no valid adjustment sets in this case.) + +```{r} +G4 <- dagitty('dag{ +Y [outcome,pos="4, 0"] +D [exposure,pos="0, 0"] +X [confounder, pos="2,-2"] +F [unobserved, pos="0, -1"] +A [unobserved, pos="-1, -1"] +D -> Y +X -> D +F -> D +A -> F +A -> X +A -> D +F -> Y +X -> Y}') + + +ggdag(G4) + theme_dag() +``` + +```{r} +adjustmentSets(G4, "D", "Y", effect = "total") +``` + +**Note that no output means that there is no valid adustment set (among observed variables).** + +**How can F affect Y directly? Is it reasonable?** + +Introduce Match Amount $M$. The match amount is a potential important mediator (why mediator?). $M$ is not observed. Luckily, adjusting for $X$ still works if there is no arrow $F \to M$. + +```{r} +G5 <- dagitty('dag{ +Y [outcome,pos="4, 0"] +D [exposure,pos="0, 0"] +X [confounder, pos="2,-2"] +F [unobserved, pos="0, -1"] +A [unobserved, pos="-1, -1"] +M [unobserved, pos="2, -.5"] +D -> Y +X -> D +F -> D +A -> F +A -> X +A -> D +D -> M +M -> Y +X -> M +X -> Y}') + +print(adjustmentSets(G5, "D", "Y", effect = "total")) + +ggdag(G5) + theme_dag() +``` + +If there is an $F \to M$ arrow, then adjusting for $X$ is not sufficient. + +```{r} +G6 <- dagitty('dag{ +Y [outcome,pos="4, 0"] +D [exposure,pos="0, 0"] +X [confounder, pos="2,-2"] +F [unobserved, pos="0, -1"] +A [unobserved, pos="-1, -1"] +M [uobserved, pos="2, -.5"] +D -> Y +X -> D +F -> D +A -> F +A -> X +D -> M +F -> M +A -> D +M -> Y +X -> M +X -> Y}') + +print(adjustmentSets(G6, "D", "Y"), effect = "total") + +ggdag(G6) + theme_dag() +``` + +Again, note that no output was returned for the adjustment set. There is no valid adjustment set here. + + # Question: + +Given the analysis above, do you find the adjustment for workers' characteristics a credible strategy to identify the causal (total effect) of 401 (k) elligibility on net financial wealth? + diff --git a/PM4/r_debiased_ml_for_partially_linear_model_growth.Rmd b/PM4/r_debiased_ml_for_partially_linear_model_growth.Rmd new file mode 100644 index 00000000..d35fbafb --- /dev/null +++ b/PM4/r_debiased_ml_for_partially_linear_model_growth.Rmd @@ -0,0 +1,204 @@ +--- +title: An R Markdown document converted from "PM4/r_debiased_ml_for_partially_linear_model_growth.irnb" +output: html_document +--- + +# Double/Debiased Machine Learning for the Partially Linear Regression Model + +This is a simple implementation of Debiased Machine Learning for the Partially Linear Regression Model, which provides an application of DML inference to determine the causal effect of countries' intitial wealth on the rate of economic growth. + + +Reference: + +- https://arxiv.org/abs/1608.00060 +- https://www.amazon.com/Business-Data-Science-Combining-Accelerate/dp/1260452778 + +The code is based on the book. + +```{r} +install.packages("xtable") +install.packages("hdm") +install.packages("randomForest") +install.packages("glmnet") +install.packages("sandwich") +``` + +```{r} +library(xtable) +library(randomForest) +library(hdm) +library(glmnet) +library(sandwich) + +set.seed(1) +``` + +```{r} +file <- "https://raw.githubusercontent.com/CausalAIBook/MetricsMLNotebooks/main/data/GrowthData.csv" +data <- read.csv(file) +data <- subset(data, select = -1) # get rid of index column +head(data) +dim(data) +``` + +```{r} +y <- as.matrix(data[, 1]) # outcome: growth rate +d <- as.matrix(data[, 3]) # treatment: initial wealth +x <- as.matrix(data[, -c(1, 2, 3)]) # controls: country characteristics + +# some summary statistics +cat(sprintf("\nThe length of y is %g \n", length(y))) +cat(sprintf("\nThe number of features in x is %g \n", dim(x)[2])) + +lres <- summary(lm(y ~ d + x))$coef[2, 1:2] +cat(sprintf("\nNaive OLS that uses all features w/o cross-fitting Y ~ D+X yields: \ncoef (se) = %g (%g)\n", + lres[1], lres[2])) +``` + +# DML algorithm + +Here we perform estimation and inference of predictive coefficient $\alpha$ in the partially linear statistical model, +$$ +Y = D\alpha + g(X) + U, \quad E (U | D, X) = 0. +$$ +For $\tilde Y = Y- E(Y|X)$ and $\tilde D= D- E(D|X)$, we can write +$$ +\tilde Y = \alpha \tilde D + U, \quad E (U |\tilde D) =0. +$$ +Parameter $\alpha$ is then estimated using cross-fitting approach to obtain the residuals $\tilde D$ and $\tilde Y$. +The algorithm comsumes $Y, D, X$, and machine learning methods for learning the residuals $\tilde Y$ and $\tilde D$, where +the residuals are obtained by cross-validation (cross-fitting). + +The statistical parameter $\alpha$ has a causal interpretation of being the effect of $D$ on $Y$ in the causal DAG $$ D\to Y, \quad X\to (D,Y)$$ or the counterfactual outcome model with conditionally exogenous (conditionally random) assignment of treatment $D$ given $X$: +$$ +Y(d) = d\alpha + g(X) + U(d),\quad U(d) \text{ indep } D |X, \quad Y = Y(D), \quad U = U(D). +$$ + +```{r} +dml2_for_plm <- function(x, d, y, dreg, yreg, nfold = 2) { + nobs <- nrow(x) # number of observations + foldid <- rep.int(1:nfold, times = ceiling(nobs / nfold))[sample.int(nobs)] # define folds indices + I <- split(1:nobs, foldid) # split observation indices into folds + ytil <- dtil <- rep(NA, nobs) + cat("fold: ") + for (b in seq_along(I)) { + dfit <- dreg(x[-I[[b]], ], d[-I[[b]]]) # take a fold out + yfit <- yreg(x[-I[[b]], ], y[-I[[b]]]) # take a foldt out + dhat <- predict(dfit, x[I[[b]], ], type = "response") # predict the left-out fold + yhat <- predict(yfit, x[I[[b]], ], type = "response") # predict the left-out fold + dtil[I[[b]]] <- (d[I[[b]]] - dhat) # record residual for the left-out fold + ytil[I[[b]]] <- (y[I[[b]]] - yhat) # record residial for the left-out fold + cat(b, " ") + } + rfit <- lm(ytil ~ dtil) # estimate the main parameter by regressing one residual on the other + coef.est <- coef(rfit)[2] # extract coefficient + se <- sqrt(vcovHC(rfit)[2, 2]) # record robust standard error + cat(sprintf("\ncoef (se) = %g (%g)\n", coef.est, se)) # printing output + return(list(coef.est = coef.est, se = se, dtil = dtil, ytil = ytil)) # save output and residuals +} +``` + +We now run through DML using as first stage models: + 1. OLS + 2. (Rigorous) Lasso + 3. Random Forests + 4. Mix of Random Forest and Lasso + +```{r} +# DML with OLS +cat(sprintf("\nDML with OLS w/o feature selection \n")) +dreg <- function(x, d) { + glmnet(x, d, lambda = 0) +} # ML method= OLS using glmnet; using lm gives bugs +yreg <- function(x, y) { + glmnet(x, y, lambda = 0) +} # ML method = OLS +dml2_ols <- dml2_for_plm(x, d, y, dreg, yreg, nfold = 10) + + +# DML with Lasso: +cat(sprintf("\nDML with Lasso \n")) +dreg <- function(x, d) { + rlasso(x, d, post = FALSE) +} # ML method= lasso from hdm +yreg <- function(x, y) { + rlasso(x, y, post = FALSE) +} # ML method = lasso from hdm +dml2_lasso <- dml2_for_plm(x, d, y, dreg, yreg, nfold = 10) + + +# DML with Random Forest: +cat(sprintf("\nDML with Random Forest \n")) +dreg <- function(x, d) { + randomForest(x, d) +} # ML method=Forest +yreg <- function(x, y) { + randomForest(x, y) +} # ML method=Forest +dml2_rf <- dml2_for_plm(x, d, y, dreg, yreg, nfold = 10) + +# DML MIX: +cat(sprintf("\nDML with Lasso for D and Random Forest for Y \n")) +dreg <- function(x, d) { + rlasso(x, d, post = FALSE) +} # ML method=Forest +yreg <- function(x, y) { + randomForest(x, y) +} # ML method=Forest +dml2_mix <- dml2_for_plm(x, d, y, dreg, yreg, nfold = 10) +``` + +Now we examine the RMSE of D and Y to see which method performs well in the first-stage. We print all results below in the following table: + +```{r} +pr_res_d <- c(mean((dml2_ols$dtil)^2), mean((dml2_lasso$dtil)^2), mean((dml2_rf$dtil)^2), mean((dml2_mix$dtil)^2)) +pr_res_y <- c(mean((dml2_ols$ytil)^2), mean((dml2_lasso$ytil)^2), mean((dml2_rf$ytil)^2), mean((dml2_mix$ytil)^2)) +pr_res <- rbind(sqrt(pr_res_d), sqrt(pr_res_y)) +rownames(pr_res) <- c("RMSE D", "RMSE Y") +colnames(pr_res) <- c("OLS", "Lasso", "RF", "Mix") +``` + +```{r} +table <- matrix(0, 4, 4) + +# Point Estimate +table[1, 1] <- as.numeric(dml2_ols$coef.est) +table[2, 1] <- as.numeric(dml2_lasso$coef.est) +table[3, 1] <- as.numeric(dml2_rf$coef.est) +table[4, 1] <- as.numeric(dml2_mix$coef.est) + +# SE +table[1, 2] <- as.numeric(dml2_ols$se) +table[2, 2] <- as.numeric(dml2_lasso$se) +table[3, 2] <- as.numeric(dml2_rf$se) +table[4, 2] <- as.numeric(dml2_mix$se) + +# RMSE Y +table[1, 3] <- as.numeric(pr_res[2, 1]) +table[2, 3] <- as.numeric(pr_res[2, 2]) +table[3, 3] <- as.numeric(pr_res[2, 3]) +table[4, 3] <- as.numeric(pr_res[2, 4]) + +# RMSE D +table[1, 4] <- as.numeric(pr_res[1, 1]) +table[2, 4] <- as.numeric(pr_res[1, 2]) +table[3, 4] <- as.numeric(pr_res[1, 3]) +table[4, 4] <- as.numeric(pr_res[1, 4]) + + + +# print results +colnames(table) <- c("Estimate", "Standard Error", "RMSE Y", "RMSE D") +rownames(table) <- c("OLS", "Lasso", "RF", "RF/Lasso Mix") +table +``` + +```{r} +print(table, digit = 3) +``` + +```{r} +tab <- xtable(table, digits = 3) +print(tab, type = "latex") +``` + diff --git a/PM4/r_dml_inference_for_gun_ownership.Rmd b/PM4/r_dml_inference_for_gun_ownership.Rmd new file mode 100644 index 00000000..512479d3 --- /dev/null +++ b/PM4/r_dml_inference_for_gun_ownership.Rmd @@ -0,0 +1,573 @@ +--- +title: An R Markdown document converted from "PM4/r_dml_inference_for_gun_ownership.irnb" +output: html_document +--- + +# A Case Study: The Effect of Gun Ownership on Gun-Homicide Rates + +We consider the problem of estimating the effect of gun ownership on the homicide rate. For this purpose, we perform inference on $\beta$ in the following the partially linear model: +$$ +Y_{j, t}=\beta D_{j,(t-1)}+g\left(X_{j, t}, \bar{X}_j, \bar{X}_t, X_{j, 0}, Y_{j, 0}, t\right)+\epsilon_{j, t} +$$ +$Y_{j, t}$ is the log homicide rate in county $j$ at time $t. D_{j, t-1}$ is the log fraction of suicides committed with a firearm in county $j$ at time $t-1$, which we use as a proxy for gun ownership $G_{j, t}$, which is not observed. $X_{j, t}$ is a set of demographic and economic characteristics of county $j$ at time $t$. We use $\bar{X}_j$ to denote the within county average of $X_{j, t}$ and $\bar{X}_t$ to denote the within time period average of $X_{j, t} . X_{j, 0}$ and $Y_{j, 0}$ denote initial conditions in county $j$. We use $Z_{j, t}$ to denote the set of observed control variables $\left\{X_{j, t}, \bar{X}_j, \bar{X}_t, X_{j, 0}, Y_{j, 0}, t\right\}$, so that our model is + +$$ + Y_{i,t} = \beta D_{i,(t-1)} + g(Z_{i,t}) + \epsilon_{i,t}. +$$ + +## Data + +$Y_{j,t}$ is the log homicide rate in county $j$ at time $t$, $D_{j, t-1}$ is the log fraction of suicides committed with a firearm in county $j$ at time $t-1$, which we use as a proxy for gun ownership, and $Z_{j,t}$ is a set of demographic and economic characteristics of county $j$ at time $t$. Assuming the firearm suicide rate is a good proxy for gun ownership, the parameter $\beta$ is the effect of gun ownership on homicide rates, controlling for county-level demographic and economic characteristics. + +The sample covers 195 large United States counties between the years 1980 through 1999, giving us 3900 observations. + +```{r} +install.packages("glmnet") +install.packages("randomForest") +install.packages("xgboost") +install.packages("keras") +install.packages("tensorflow") +install.packages("xtable") +install.packages("dplyr") +install.packages("sandwich") +``` + +```{r} +library(glmnet) +library(randomForest) +library(xgboost) +library(keras) +library(tensorflow) +library(xtable) +library(dplyr) +library(sandwich) +``` + +```{r} +file <- "https://raw.githubusercontent.com/CausalAIBook/MetricsMLNotebooks/main/data/gun_clean.csv" +data <- read.csv(file) +dim(data) +``` + +## Preprocessing + +To attempt to flexibly account for fixed heterogeneity across counties, common time factors, and deterministic time trends, we include county-level averages, time period averages, initial conditions, and the time index as additional control variables. This strategy is related to strategies for addressing latent sources of heterogeneity via conditioning. + +We first reweight time and county variables as the data are population weighted. + +```{r} +# Note: These data are population weighted. Specifically, +# looking at the JBES replication files, they seem to be multiplied +# by sqrt((1/T sum_t population_{j,t})/100000). To get the +# unweighted variables need to divide by this number - which we can +# get from the time effects. We are mostly just going to use the weighted +# variables as inputs - except for time and county. We'll take +# cross-sectional and time series means of these weighted variables +# as well. Note that there is nothing wrong with this, but it does not +# reproduce a weighted regression in a setting where covariates may +# enter nonlinearly and flexibly. + +## County FE +county_vars <- select(data, starts_with("X_J")) + +## Time variables and population weights +# Pull out time variables +time_vars <- select(data, starts_with("X_T")) + +# Use these to construct population weights +pop_weights <- rowSums(time_vars) + +# Unweighted time variables +time_vars <- time_vars / pop_weights + +# For any columns with only zero (like the first one), just drop +time_vars <- time_vars[, colSums(time_vars != 0) > 0] + +# Create time index +time_ind <- rowSums(time_vars * (seq(1:20))) +``` + +Now we create initial conditions, county-level averages, and time period averages. + +```{r} +###### Create new data frame with variables we'll use + +# Function to find variable names +var_list <- function(df = NULL, type = c("numeric", "factor", "character"), pattern = "", exclude = NULL) { + vars <- character(0) + if (any(type %in% "numeric")) { + vars <- c(vars, names(df)[sapply(df, is.numeric)]) + } + if (any(type %in% "factor")) { + vars <- c(vars, names(df)[sapply(df, is.factor)]) + } + if (any(type %in% "character")) { + vars <- c(vars, names(df)[sapply(df, is.character)]) + } + vars[(!vars %in% exclude) & grepl(vars, pattern = pattern)] +} + +# census control variables +census <- NULL +census_var <- c("^AGE", "^BN", "^BP", "^BZ", "^ED", "^EL", "^HI", "^HS", "^INC", "^LF", "^LN", + "^PI", "^PO", "^PP", "^PV", "^SPR", "^VS") + +for (i in seq_along(census_var)) { + census <- append(census, var_list(data, pattern = census_var[i])) +} + +# other control variables +X1 <- c("logrobr", "logburg", "burg_missing", "robrate_missing") +X2 <- c("newblack", "newfhh", "newmove", "newdens", "newmal") + +# "treatment" variable +d <- "logfssl" + +# outcome variable +y <- "logghomr" + +# new data frame for time index +usedata <- as.data.frame(time_ind) +colnames(usedata) <- "time_ind" +usedata[, "weights"] <- pop_weights + +var_list <- c(y, d, X1, X2, census) +for (i in seq_along(var_list)) { + usedata[, var_list[i]] <- data[, var_list[i]] +} + +####################### Construct county specific means, +# time specific means, initial conditions + +# Initial conditions +var_list0 <- c(y, X1, X2, census) +for (i in seq_along(var_list0)) { + usedata[, paste(var_list0[i], "0", sep = "")] <- kronecker( + usedata[time_ind == 1, var_list0[i]], + rep(1, 20) + ) +} + +# County means +var_list_j <- c(X1, X2, census) +county_vars <- as.matrix(county_vars) +for (i in seq_along(var_list_j)) { + usedata[, paste(var_list_j[i], "J", sep = "")] <- + county_vars %*% qr.solve(county_vars, as.matrix(usedata[, var_list_j[i]])) +} + +# Time means +time_vars <- as.matrix(time_vars) +for (i in seq_along(var_list_j)) { + usedata[, paste(var_list_j[i], "T", sep = "")] <- + time_vars %*% qr.solve(time_vars, as.matrix(usedata[, var_list_j[i]])) +} +``` + +# Estimation + + +## Baseline OLS Estimates + +After preprocessing the data, as a baseline model, we first look at simple regression of $Y_{j,t}$ on $D_{j,t-1}$ without controls in the full data set. + +```{r} +# Simple regression +lm0 <- lm(logghomr ~ logfssl, data = usedata) +vc0 <- vcovHC(lm0) +cat("Baseline OLS:", lm0$coefficients[2], " (", sqrt(vc0[2, 2]), ")\n") +# Confidence Interval with HC3 covariance +tt <- qt(c(0.025, 0.975), summary(lm0)$df[2]) +se <- sqrt(diag(vc0)) +ci <- coef(lm0) + se %o% tt +cat("2.5%: ", ci[2, 1], "97.5%: ", ci[2, 2]) +``` + +The point estimate is $0.302$ with the confidence interval ranging from 0.277 to 0.327. This +suggests that increases in gun ownership rates are related to gun homicide rates - if gun ownership increases by 1% then the predicted gun homicide rate goes up by 0.3%, without controlling for counties' characteristics. + + +Next we estimate with the baseline set of controls. + +```{r} +# Regression on baseline controls +var_list <- c(d, X1, X2, census) +lmC <- lm(paste("logghomr ~", paste(var_list, collapse = "+")), data = usedata) +vcC <- vcovHC(lmC) +cat("OLS with Controls:", lmC$coefficients["logfssl"], " (", sqrt(vcC["logfssl", "logfssl"]), ")\n") +``` + + + +We can also run our regression with time and space averages as controls. + +```{r} +# Regression on time and cross sectional averages +var_list_x <- c(X1, X2, census) +var_list_means <- c(d, X1, X2, census) +for (i in seq_along(var_list_x)) { + var_list_means <- c(var_list_means, paste(var_list_x[i], "J", sep = "")) +} +for (i in seq_along(var_list_x)) { + var_list_means <- c(var_list_means, paste(var_list_x[i], "T", sep = "")) +} +lmM <- lm(paste("logghomr ~", paste(var_list_means, collapse = "+")), data = usedata) +vcM <- vcovHC(lmM) +cat("OLS with Averages:", lmM$coefficients["logfssl"], " (", sqrt(vcM["logfssl", "logfssl"]), ")\n") +``` + +Since our goal is to estimate the effect of gun ownership after controlling for a rich set county characteristics, we now include all controls. + +```{r} +# Regression on all controls +lmA <- lm(logghomr ~ ., data = usedata) +vcA <- vcovHC(lmA) +cat("OLS All:", lmA$coefficients["logfssl"], " (", sqrt(vcA["logfssl", "logfssl"]), ")\n") +``` + +After controlling for a rich set of characteristics, the point estimate of gun ownership attenuates to 0.179. + +***NB***: In the background, `lm()` is dropping variables based on collinearity diagnostics. These depend on system linear algebra routines and can lead to large differences in high-dimensional or other ill-conditioned settings when using otherwise identical code across languages and/or machines. + +Now we turn to our double machine learning framework, employing linear and flexible estimation methods with cross-fitting. + +## DML Estimates + +We perform inference on $\beta$ in the following the partially linear model: + $$ +Y_{j, t}=\beta D_{j,(t-1)}+g(Z_{j,t})+\epsilon_{j, t}. +$$ +In the first stage, using cross-fitting, we employ modern regression methods to build estimators $\hat \ell(Z_{j,t})$ and $\hat m(Z_{j,t})$, where +- $\ell(Z_{j,t}):=E(Y_{j,t}|Z_{j,t})$ +- $m(Z_{j,t}):=E(D_{j,t}|Z_{j,t})$ + +Using these, we obtain the estimates of the residualized quantities +- $\tilde Y_{j,t} = Y_{j,t}- E(Y_{j,t}|Z_{j,t})$ +- $\tilde D_{j,t}= D_{j,t}- E(D_{j,t}|Z_{j,t})$ + +Using these residualized quantities, we note our model can be written as +$$ +\tilde Y_{j,t} = \beta \tilde D_{j,t} + \epsilon_{j,t}, \quad E (\epsilon_{j,t} |\tilde D_{j,t}) =0. +$$ +In the final stage, using ordinary least squares of $\tilde Y_{j,t}$ on $\tilde D_{j,t}$, we obtain the +estimate of $\beta$. + +In the following, we consider 10 different methods for the first-stage models for $\ell(\cdot)$ and $m(\cdot)$ covering linear, penalized linear, and flexible methods. We also report the first-stage RMSE scores for estimating $Y$ and $D$. + +```{r} +# NB: this cell takes > 3 hours to runon colab. To reduce computation time, +# reduce the number of cross-fitting folds. Note this may affect stability +# of estimates. + +set.seed(123) + +# Cross-fitting +n <- nrow(usedata) +Kf <- 5 # Number of cross-fitting folds +sampleframe <- rep(1:Kf, ceiling(n / Kf)) +cvgroup <- sample(sampleframe, size = n, replace = FALSE) # Cross-fitting groups + +# Initialize variables for cross-fit predictions +yhat_r <- matrix(NA, n, 10) # Going to consider 10 learners +dhat_r <- matrix(NA, n, 10) + +# Cross-fitting loop +for (k in 1:Kf) { + cat("fold: ", k, "\n") + indk <- cvgroup == k + + ktrain <- usedata[!indk, ] + ktest <- usedata[indk, ] + + #### Simple regression models #### + + # Simple regression + yhat_r[indk, 1] <- ktest$logghomr - mean(ktrain$logghomr) + dhat_r[indk, 1] <- ktest$logfssl - mean(ktrain$logfssl) + + # Baseline controls + var_list <- c(X1, X2, census) + lmyk_c <- lm(paste("logghomr ~", paste(var_list, collapse = "+")), data = ktrain) + yhat_r[indk, 2] <- ktest$logghomr - predict(lmyk_c, ktest) + lmdk_c <- lm(paste("logfssl ~", paste(var_list, collapse = "+")), data = ktrain) + dhat_r[indk, 2] <- ktest$logfssl - predict(lmdk_c, ktest) + + # All controls + lmyk_a <- lm(logghomr ~ . - logfssl, data = ktrain) + yhat_r[indk, 3] <- ktest$logghomr - predict(lmyk_a, ktest) + lmdk_a <- lm(logfssl ~ . - logghomr, data = ktrain) + dhat_r[indk, 3] <- ktest$logfssl - predict(lmdk_a, ktest) + + #### Penalized Linear Models #### + + # Lasso - default CV tuning + ytrain <- as.matrix(usedata[!indk, "logghomr"]) + dtrain <- as.matrix(usedata[!indk, "logfssl"]) + xtrain <- as.matrix(usedata[!indk, !names(usedata) %in% + c("logghomr", "logfssl")]) + ytest <- as.matrix(usedata[indk, "logghomr"]) + dtest <- as.matrix(usedata[indk, "logfssl"]) + xtest <- as.matrix(usedata[indk, !names(usedata) %in% + c("logghomr", "logfssl")]) + + lassoyk <- cv.glmnet(xtrain, ytrain) + yhat_r[indk, 4] <- ytest - predict(lassoyk, newx = xtest, s = "lambda.min") + + lassodk <- cv.glmnet(xtrain, dtrain) + dhat_r[indk, 4] <- dtest - predict(lassodk, newx = xtest, s = "lambda.min") + + # Ridge + ridgeyk <- cv.glmnet(xtrain, ytrain, alpha = 0) + yhat_r[indk, 5] <- ytest - predict(ridgeyk, newx = xtest, s = "lambda.min") + + ridgedk <- cv.glmnet(xtrain, dtrain, alpha = 0) + dhat_r[indk, 5] <- dtest - predict(ridgedk, newx = xtest, s = "lambda.min") + + # EN, .5 - no cv over alpha + enyk <- cv.glmnet(xtrain, ytrain, alpha = .5) + yhat_r[indk, 6] <- ytest - predict(enyk, newx = xtest, s = "lambda.min") + + endk <- cv.glmnet(xtrain, dtrain, alpha = .5) + dhat_r[indk, 6] <- dtest - predict(endk, newx = xtest, s = "lambda.min") + + #### Flexible regression models #### + + # Random forest + rfyk <- randomForest(logghomr ~ . - logfssl, data = ktrain) + yhat_r[indk, 7] <- ktest$logghomr - predict(rfyk, ktest) + rfdk <- randomForest(logfssl ~ . - logghomr, data = ktrain) + dhat_r[indk, 7] <- ktest$logfssl - predict(rfdk, ktest) + + # Boosted tree - depth 4 + xgb_train_y <- xgb.DMatrix( + data = as.matrix(usedata[!indk, !names(usedata) %in% + c("logghomr", "logfssl")]), + label = as.matrix(usedata[!indk, "logghomr"]) + ) + xgb_test_y <- xgb.DMatrix( + data = as.matrix(usedata[indk, !names(usedata) %in% + c("logghomr", "logfssl")]), + label = as.matrix(usedata[indk, "logghomr"]) + ) + xgb_train_d <- xgb.DMatrix( + data = as.matrix(usedata[!indk, !names(usedata) %in% + c("logghomr", "logfssl")]), + label = as.matrix(usedata[!indk, "logfssl"]) + ) + xgb_test_d <- xgb.DMatrix( + data = as.matrix(usedata[indk, !names(usedata) %in% + c("logghomr", "logfssl")]), + label = as.matrix(usedata[indk, "logfssl"]) + ) + + byk <- xgb.cv( + data = xgb_train_y, + nrounds = 1000, verbose = 0, eta = .1, max_depth = 4, nfold = 5 + ) + best_iter <- which.min(as.matrix(byk$evaluation_log[, 4])) + byk <- xgboost( + data = xgb_train_y, + nrounds = 1000, verbose = 0, eta = .1, max_depth = 4 + ) + yhat_r[indk, 8] <- ktest$logghomr - predict(byk, + newdata = xgb_test_y, + iterationrange = c(1, (best_iter + 1)) + ) + + bdk <- xgb.cv( + data = xgb_train_d, + nrounds = 1000, verbose = 0, eta = .1, max_depth = 4, nfold = 5 + ) + best_iter <- which.min(as.matrix(bdk$evaluation_log[, 4])) + bdk <- xgboost( + data = xgb_train_d, + nrounds = 1000, verbose = 0, eta = .1, max_depth = 4 + ) + dhat_r[indk, 8] <- ktest$logfssl - predict(bdk, + newdata = xgb_test_d, + iterationrange = c(1, (best_iter + 1)) + ) + + #### Neural Networks #### + + # normalize the covariate data + mean <- apply(xtrain, 2, mean) + std <- apply(xtrain, 2, sd) + xtrainNN <- scale(xtrain, center = mean, scale = std) + xtestNN <- scale(xtest, center = mean, scale = std) + + xtestNN <- xtestNN[, which(!is.nan(colMeans(xtrainNN)))] + xtrainNN <- xtrainNN[, which(!is.nan(colMeans(xtrainNN)))] + + # DNN 50/50/50/50, .5 dropout + NNmodely <- keras_model_sequential() + NNmodely %>% + layer_dense(units = 50, activation = "relu", input_shape = c(ncol(xtrainNN))) %>% + layer_dropout(rate = .5) %>% + layer_dense(units = 50, activation = "relu") %>% + layer_dropout(rate = .5) %>% + layer_dense(units = 50, activation = "relu") %>% + layer_dropout(rate = .5) %>% + layer_dense(units = 50, activation = "relu") %>% + layer_dropout(rate = .5) %>% + layer_dense(units = 1) + + NNmodely %>% compile( + loss = "mse", + optimizer = optimizer_rmsprop() + ) + + fit_nn_model_y <- NNmodely %>% fit( + xtrainNN, ytrain, + epochs = 200, batch_size = 200, + validation_split = .2, verbose = 0 + ) + yhat_r[indk, 9] <- ktest$logghomr - predict(NNmodely, xtestNN) + + NNmodeld <- keras_model_sequential() + NNmodeld %>% + layer_dense(units = 50, activation = "relu", input_shape = c(ncol(xtrainNN))) %>% + layer_dropout(rate = .5) %>% + layer_dense(units = 50, activation = "relu") %>% + layer_dropout(rate = .5) %>% + layer_dense(units = 50, activation = "relu") %>% + layer_dropout(rate = .5) %>% + layer_dense(units = 50, activation = "relu") %>% + layer_dropout(rate = .5) %>% + layer_dense(units = 1) + + NNmodeld %>% compile( + loss = "mse", + optimizer = optimizer_rmsprop() + ) + + fit_nn_model_d <- NNmodeld %>% fit( + xtrainNN, dtrain, + epochs = 200, batch_size = 200, + validation_split = .2, verbose = 0 + ) + dhat_r[indk, 9] <- ktest$logfssl - predict(NNmodeld, xtestNN) + + # DNN 50/50/50/50, early stopping + NNmodely <- keras_model_sequential() + NNmodely %>% + layer_dense(units = 50, activation = "relu", input_shape = c(ncol(xtrainNN))) %>% + layer_dense(units = 50, activation = "relu") %>% + layer_dense(units = 50, activation = "relu") %>% + layer_dense(units = 50, activation = "relu") %>% + layer_dense(units = 1) + + NNmodely %>% compile( + loss = "mse", + optimizer = optimizer_rmsprop() + ) + + early_stop <- callback_early_stopping( + monitor = "val_loss", patience = 25, + restore_best_weights = TRUE + ) + + fit_nn_model_y <- NNmodely %>% fit( + xtrainNN, ytrain, + epochs = 200, batch_size = 200, + validation_split = .2, verbose = 0, + callbacks = list(early_stop) + ) + yhat_r[indk, 10] <- ktest$logghomr - predict(NNmodely, xtestNN) + + NNmodeld <- keras_model_sequential() + NNmodeld %>% + layer_dense(units = 50, activation = "relu", input_shape = c(ncol(xtrainNN))) %>% + layer_dense(units = 50, activation = "relu") %>% + layer_dense(units = 50, activation = "relu") %>% + layer_dense(units = 50, activation = "relu") %>% + layer_dense(units = 1) + + NNmodeld %>% compile( + loss = "mse", + optimizer = optimizer_rmsprop() + ) + + early_stop <- callback_early_stopping( + monitor = "val_loss", patience = 25, + restore_best_weights = TRUE + ) + + fit_nn_model_d <- NNmodeld %>% fit( + xtrainNN, dtrain, + epochs = 200, batch_size = 200, + validation_split = .2, verbose = 0, + callbacks = list(early_stop) + ) + dhat_r[indk, 10] <- ktest$logfssl - predict(NNmodeld, xtestNN) +} + +################################################################################ +# Predictions done, now DML + +rmse_y <- sqrt(colMeans(yhat_r^2)) +rmse_d <- sqrt(colMeans(dhat_r^2)) + +# dml coefficient estimates +b_dml <- rep(NA, 10) +s_dml <- rep(NA, 10) +for (k in 1:10) { + lm_k <- lm(yhat_r[, k] ~ dhat_r[, k] - 1) + v_k <- vcovHC(lm_k) + b_dml[k] <- lm_k$coefficients + s_dml[k] <- sqrt(v_k) +} + +# "best" coefficient estimate +lm_k <- lm(yhat_r[, which.min(rmse_y)] ~ dhat_r[, which.min(rmse_d)] - 1) +v_k <- vcovHC(lm_k) +b_dml[11] <- lm_k$coefficients +s_dml[11] <- sqrt(v_k) + +# ls model average +yhat <- usedata$logghomr - yhat_r +dhat <- usedata$logfssl - dhat_r + +ma_y <- lm(usedata$logghomr ~ yhat - 1) +ma_d <- lm(usedata$logfssl ~ dhat - 1) +weights_y <- ma_y$coefficients +weights_d <- ma_d$coefficients +lm_k <- lm(ma_y$residuals ~ ma_d$residuals - 1) +v_k <- vcovHC(lm_k) +b_dml[12] <- lm_k$coefficients +s_dml[12] <- sqrt(v_k) + +## Display results +table1 <- matrix(0, 10, 2) +table1[, 1] <- rmse_y +table1[, 2] <- rmse_d +colnames(table1) <- c("RMSE Y", "RMSE D") +rownames(table1) <- c( + "OLS - No Controls", "OLS - Basic", "OLS - All", + "Lasso (CV)", "Ridge (CV)", "Elastic Net (.5,CV)", + "Random Forest", "Boosted trees - depth 4", + "DNN - 50/50/50/50, dropout", "DNN - 50/50/50/50, early stopping" +) +tab1 <- xtable(table1, digits = c(0, 4, 4)) +tab1 + +table2 <- matrix(0, 12, 2) +table2[, 1] <- b_dml +table2[, 2] <- s_dml +colnames(table2) <- c("Point Estimate", "Std. Error") +rownames(table2) <- c( + "OLS - No Controls", "OLS - Basic", "OLS - All", + "Lasso (CV)", "Ridge (CV)", "Elastic Net (.5,CV)", + "Random Forest", "Boosted trees - depth 4", + "DNN - 50/50/50/50, dropout", "DNN - 50/50/50/50, early stopping", + "Best", "Least Squares Model Average" +) +tab2 <- xtable(table2, digits = c(0, 4, 4)) +tab2 +``` + +```{r} +print(xtable(table1, type = "latex")) +print(xtable(table2, type = "latex")) +``` + diff --git a/PM5/Autoencoders.Rmd b/PM5/Autoencoders.Rmd new file mode 100644 index 00000000..95a288c4 --- /dev/null +++ b/PM5/Autoencoders.Rmd @@ -0,0 +1,262 @@ +--- +title: An R Markdown document converted from "PM5/Autoencoders.irnb" +output: html_document +--- + +# Autoencoders + +In this notebook, we'll introduce and explore "autoencoders," which are a very successful family of models in modern deep learning. In particular we will: + + +1. Illustrate the connection between autoencoders and classical *Principal Component Analysis (PCA)* +3. Train a non-linear auto-encoder that uses a deep neural network + +### Overview +As explained in the text, autoencoders are a way of discovering *latent, low-dimensional structure* in a dataset. In particular, a random data vector $X \in \mathbb{R}^d$ can be said to have low-dimensional structure if we can find some functions $f: \mathbb{R}^d \to \mathbb{R}^k$ and $g: \mathbb{R}^k \to \mathbb{R}^d$, with $k \ll d$, such that +$$g(f(X)) \approx X.$$ + +In other words, $f(X)$ is a parsimonious, $k$-dimensional representation of $X$ that contains all of the information necessary to approximately reconstruct the full vector $X$. Traditionally, $f(X)$ is called an *encoding* of $X$. + +It turns out that this is meaningless unless we restrict what kinds of functions $f$ and $g$ are allowed to be, because it's possible to write down some (completely ugly) one-to-one function $\mathbb{R}^d \to \mathbb{R}^1$ for any $d$. This gives rise to the notion of *autoencoders* where, given some sets of reasonable functions $F$ and $G$, we aim to minimize +$$\mathbb{E}[\mathrm{loss}(X, f(g(X))]$$ +over functions $f \in F$ and $g \in G$. As usual, this is done by minimizing the sample analog. + + + +## Linear Autoencoders and PCA: Practice + +It turns out that linear autoencoders are the same as PCA. Let's do a small sanity check to verify this. In particular, let's perform PCA two ways: first using a standard (linear algebra) toolkit, and second as a linear autoencoder using a neural network library. +If all goes well, they should give you the same reconstructions! + +To make it a bit more fun, we will use the [*Labeled Faces in the Wild*](https://www.kaggle.com/jessicali9530/celeba-dataset) dataset which consists of standardized images of roughly 5,000 celebrities' faces. In this data, PCA amounts to looking for a small number of "proto-faces" such that a linear combination of them can accurately reconstruct any celebrity's face. + +```{r} +install.packages("keras") +``` + +```{r} +install.packages("reticulate") +install.packages("abind") +install.packages("grid") +install.packages("gridExtra") +install.packages("dplyr") +install.packages("purrr") +install.packages("reshape2") +install.packages("ggplot2") +``` + +```{r} +library(reticulate) + +# Import Python's sklearn.datasets +sklearn <- import("sklearn.datasets") + +# Fetch the dataset +faces <- sklearn$fetch_lfw_people() + +# Access the images and reshape the data similar to Python's reshape method +n_examples <- dim(faces$images)[1] +height <- dim(faces$images)[2] +width <- dim(faces$images)[3] +design_matrix <- array_reshape(faces$images, c(n_examples, height * width)) + +n_features <- dim(design_matrix)[2] + +# Print the dataset details +cat(sprintf( + paste("Labeled Faces in the Wild Dataset:\n Number of examples: %d\n ", + "Number of features: %d\n Image height: %d\n Image width: %d"), + n_examples, n_features, height, width +)) +``` + +```{r} +library(ggplot2) +library(gridExtra) +library(grid) + +# Find indices where the label is 'Arnold Schwarzenegger' +# faces$target uses python style indexing that starts at 0 rather than R style +# indexing that starts at 1, so we subtract 1 so the indexing lines up +arnold_labels <- which(faces$target_names == "Arnold Schwarzenegger") - 1 +# Get indices of all images corresponding to Arnold +arnold_pics <- which(faces$target %in% arnold_labels) + +plot_faces <- function(images, n_row = 2, n_col = 3, width, height) { + par(mfrow = c(n_row, n_col), mar = c(0.5, 0.5, 0.5, 0.5)) + for (i in seq_len(n_row * n_col)) { + if (i <= length(images)) { + # image needs to be transposed for and then flipped for correct orientation + # using R "image" + tmp <- t(images[[i]]) + tmp <- tmp[, rev(seq_len(ncol(tmp)))] + image(tmp, col = gray.colors(256), axes = FALSE, xlab = "", ylab = "") + } + } +} + +# Ensure arnold_images contains the right amount of data and is not NULL +arnold_images <- lapply(arnold_pics[seq_len(min(6, length(arnold_pics)))], function(idx) { + faces$images[idx, , ] +}) + +plot_faces(arnold_images, n_row = 2, n_col = 3, height = 62, width = 47) +``` + +```{r} +library(stats) + +# Perform PCA on the design matrix +pca <- prcomp(design_matrix, rank. = 128, retx = TRUE, center = TRUE, scale. = FALSE) + +# Extract the principal components (eigenfaces) +eigenfaces <- pca$rotation +``` + +```{r} +# 2. Plot the first 6 "eigenfaces," the six images whose linear span best explains the variation in our dataset +pca_images <- lapply(1:6, function(idx) { + array_reshape(eigenfaces[, idx], c(height, width)) +}) + +plot_faces(pca_images, height = height, width = width) +# we check the first six eigenvectors/projection axes, reshaped (the eigenvectors that +# captured the highest variation in our dataset of images) +# here, eigenvector1 orthog to eigenvector2 and all the others => decorrelation +# (there's no way to reconstruct eigenvector1 using a linear combination of all the other eigenvectors) +``` + +```{r} +reconstruct <- function(image_vector, n_components, eigenfaces) { + components <- eigenfaces[, 1:n_components, drop = FALSE] + compimage <- components %*% (t(components) %*% image_vector) + return(array_reshape(compimage, c(height, width))) +} + +# Select an Arnold image for reconstruction +face_vector <- t(design_matrix[arnold_pics[1], , drop = FALSE]) + +# Perform reconstructions with varying number of components +reconstructions <- lapply(c(1, 2, 8, 32, 64, 128), function(k) { + reconstruct(face_vector, k, eigenfaces) +}) + +# Plot the reconstructed faces +plot_faces(reconstructions, height = height, width = width) +``` + +```{r} +# 4. Train linear autoencoder with 64 neurons using Keras +# 5. Compare reconstructions of Arnold's face both using MSE and visually +``` + +```{r} +library(keras) + +encoding_dimension <- 64 +input_image <- layer_input(shape = n_features) +encoded <- layer_dense(units = encoding_dimension, activation = "linear")(input_image) +decoded <- layer_dense(units = n_features, activation = "linear")(encoded) +autoencoder <- keras_model(inputs = input_image, outputs = decoded) +autoencoder %>% compile( + optimizer = "adam", + loss = "mse" +) +autoencoder %>% fit( + design_matrix, + design_matrix, + epochs = 50, + batch_size = 256, + shuffle = TRUE, + verbose = 0 +) +``` + +```{r} +autoencoder %>% fit( + design_matrix, + design_matrix, + epochs = 50, + batch_size = 256, + shuffle = TRUE, + verbose = 0 +) +``` + +```{r} +library(ggplot2) +library(gridExtra) +library(reshape2) + +# Compute neural reconstruction +face_vector_flat <- as.numeric(face_vector) +reconstruction <- predict(autoencoder, matrix(face_vector_flat, nrow = 1)) + +# Do visual comparison +image_height <- 62 +image_width <- 47 +image1 <- matrix(reconstructions[[4]], nrow = image_height, ncol = image_width) +image2 <- t(matrix(reconstruction, nrow = image_width, ncol = image_height)) + +images <- list(image1, image2) +plot_faces(images, n_row = 1, n_col = 2, width = image_width, height = image_height) + + +# Do numeric comparison +# We also normalize the black/white gradient to take values in [0,1] (divide by 255) +img1 <- as.numeric(reconstructions[[4]]) / 255 +img2 <- as.numeric(reconstruction) / 255 +mse <- mean((img1 - img2)^2) +mse +``` + +## Neural Autoencoders + +Finally, let's train a nonlinear autoencoder for the same data where $F$ and $G$ are neural networks, and we restrict the dimension to be $k=64$. + +```{r} +library(tensorflow) + +# Use a nonlinear neural network +n_features <- 2914 +encoding_dimension <- 64 + +input_image <- layer_input(shape = n_features) +encoded <- input_image %>% + layer_dense(units = encoding_dimension, activation = "relu") %>% + layer_dense(units = encoding_dimension, activation = "relu") + +decoded <- encoded %>% + layer_dense(units = encoding_dimension, activation = "relu") %>% + layer_dense(units = n_features, activation = "relu") + +autoencoder <- keras_model(inputs = input_image, outputs = decoded) + +autoencoder %>% compile( + optimizer = "adam", + loss = "mse" +) +autoencoder %>% fit( + design_matrix, + design_matrix, + epochs = 50, + batch_size = 256, + shuffle = TRUE, + verbose = 0 +) + +# Compute neural reconstruction +reconstruction <- predict(autoencoder, matrix(face_vector, nrow = 1)) + +# Do visual comparison +plot_faces(list(reconstructions[[4]], t(matrix(reconstruction, nrow = image_width, ncol = image_height))), + n_row = 1, n_col = 2, width = image_width, height = image_height) + +# Do numeric comparison +# We also normalize the black/white gradient to take values in [0,1] (divide by 255) +img1 <- as.numeric(reconstructions[[4]]) / 255 +img2 <- as.numeric(reconstruction) / 255 +mse <- mean((img1 - img2)^2) +mse +``` + diff --git a/T/T-3 Diff-in-Diff Minimum Wage Example.Rmd b/T/T-3 Diff-in-Diff Minimum Wage Example.Rmd new file mode 100644 index 00000000..501ba7b7 --- /dev/null +++ b/T/T-3 Diff-in-Diff Minimum Wage Example.Rmd @@ -0,0 +1,667 @@ +--- +title: An R Markdown document converted from "T/T-3 Diff-in-Diff Minimum Wage Example.irnb" +output: html_document +--- + +# Minimum Wage Example Notebook with DiD + +This notebook implements Difference-in-Differences in an application on +the effect of minimum wage changes on teen employment. We use data from +[Callaway +(2022)](https://bcallaway11.github.io/files/Callaway-Chapter-2022/main.pdf). The data are annual county level data from the United States covering 2001 to 2007. The outcome variable is log county-level teen employment, and the treatment variable is an indicator for whether the county has a minimum wage above the federal minimum wage. Note that this definition of the treatment variable makes the analysis straightforward but ignores the nuances of the exact value of the minimum wage in each county and how far those values are from the federal minimum. The data also include county population and county average annual pay. +See [Callaway and Sant’Anna +(2021)](https://www.sciencedirect.com/science/article/abs/pii/S0304407620303948) +for additional details on the data. + +First, we will load some libraries. + +*(The installation of the packages might take up to 5 minutes)* + +```{r} +dependencies <- c("BMisc", "glmnet", "randomForest", "rpart", "xtable", "data.table") +install.packages(dependencies) +``` + +```{r} +lapply(dependencies, library, character.only = TRUE) +``` + +```{r} +set.seed(772023) +options(warn = -1) +``` + +## Loading the data + +```{r} +data <- read.csv("https://raw.githubusercontent.com/CausalAIBook/MetricsMLNotebooks/main/data/minwage_data.csv", + row.names = 1) +data <- data.table(data) +``` + +```{r} +head(data) +``` + +### Data Preparation + +We remove observations that are already treated in the first observed period (2001). We drop all variables that we won't use in our analysis. + +```{r} +data <- subset(data, (G == 0) | (G > 2001)) +data <- data[, -c( + "countyreal", "state_name", "FIPS", "emp0A01_BS", + "quarter", "censusdiv", "pop", "annual_avg_pay", + "state_mw", "fed_mw", "ever_treated" +)] +``` + +Next, we create the treatment groups. We focus our analysis exclusively on the set of counties that had wage increases away from the federal minimum wage in 2004. That is, we treat 2003 and earlier as the pre-treatment period. + +```{r} +treat1 <- subset(data, (G == 2004) & (year == 2001)) +treat2 <- subset(data, (G == 2004) & (year == 2002)) +treat3 <- subset(data, (G == 2004) & (year == 2003)) +treat4 <- subset(data, (G == 2004) & (year == 2004)) +treat5 <- subset(data, (G == 2004) & (year == 2005)) +treat6 <- subset(data, (G == 2004) & (year == 2006)) +treat7 <- subset(data, (G == 2004) & (year == 2007)) + +cont1 <- subset(data, (G == 0 | G > 2001) & (year == 2001)) +cont2 <- subset(data, (G == 0 | G > 2002) & (year == 2002)) +cont3 <- subset(data, (G == 0 | G > 2003) & (year == 2003)) +cont4 <- subset(data, (G == 0 | G > 2004) & (year == 2004)) +cont5 <- subset(data, (G == 0 | G > 2005) & (year == 2005)) +cont6 <- subset(data, (G == 0 | G > 2006) & (year == 2006)) +cont7 <- subset(data, (G == 0 | G > 2007) & (year == 2007)) +``` + +We assume that the basic assumptions, particularly parallel trends, hold after conditioning on pre-treatment variables: 2001 population, 2001 average pay and 2001 teen employment, as well as the region in which the county is located. (The region is characterized by four +categories.) + +Consequently, we want to extract the control variables for both treatment and control group in 2001. + +```{r} +treat1 <- treat1[, -c("year", "G", "region", "treated")] + +cont1 <- cont1[, -c("year", "G", "region", "treated")] +``` + +2003 serves as the pre-treatment period for both counties that do receive the treatment in 2004 and those that do not. + +```{r} +treatB <- merge(treat3, treat1, by = "id", suffixes = c(".pre", ".0")) +treatB <- treatB[, -c("treated", "lpop.pre", "lavg_pay.pre", "year", "G")] + +contB <- merge(cont3, cont1, by = "id", suffixes = c(".pre", ".0")) +contB <- contB[, -c("treated", "lpop.pre", "lavg_pay.pre", "year", "G")] +``` + +We estimate the ATET in 2004-2007, which corresponds to the effect in the year of treatment as well as in the three years after the treatment. The control observations are the observations that still have the federal minimum wage in each year. (The control group is shrinking in each year as additional units receive treatment). + +```{r} +treat4 <- treat4[, -c("lpop", "lavg_pay", "year", "G", "region")] +treat5 <- treat5[, -c("lpop", "lavg_pay", "year", "G", "region")] +treat6 <- treat6[, -c("lpop", "lavg_pay", "year", "G", "region")] +treat7 <- treat7[, -c("lpop", "lavg_pay", "year", "G", "region")] + +cont4 <- cont4[, -c("lpop", "lavg_pay", "year", "G", "region")] +cont5 <- cont5[, -c("lpop", "lavg_pay", "year", "G", "region")] +cont6 <- cont6[, -c("lpop", "lavg_pay", "year", "G", "region")] +cont7 <- cont7[, -c("lpop", "lavg_pay", "year", "G", "region")] + +tdid04 <- merge(treat4, treatB, by = "id") +dy <- tdid04$lemp - tdid04$lemp.pre +tdid04$dy <- dy +tdid04 <- tdid04[, -c("id", "lemp", "lemp.pre")] + +tdid05 <- merge(treat5, treatB, by = "id") +dy <- tdid05$lemp - tdid05$lemp.pre +tdid05$dy <- dy +tdid05 <- tdid05[, -c("id", "lemp", "lemp.pre")] + +tdid06 <- merge(treat6, treatB, by = "id") +dy <- tdid06$lemp - tdid06$lemp.pre +tdid06$dy <- dy +tdid06 <- tdid06[, -c("id", "lemp", "lemp.pre")] + +tdid07 <- merge(treat7, treatB, by = "id") +dy <- tdid07$lemp - tdid07$lemp.pre +tdid07$dy <- dy +tdid07 <- tdid07[, -c("id", "lemp", "lemp.pre")] + +cdid04 <- merge(cont4, contB, by = "id") +dy <- cdid04$lemp - cdid04$lemp.pre +cdid04$dy <- dy +cdid04 <- cdid04[, -c("id", "lemp", "lemp.pre")] + +cdid05 <- merge(cont5, contB, by = "id") +dy <- cdid05$lemp - cdid05$lemp.pre +cdid05$dy <- dy +cdid05 <- cdid05[, -c("id", "lemp", "lemp.pre")] + +cdid06 <- merge(cont6, contB, by = "id") +dy <- cdid06$lemp - cdid06$lemp.pre +cdid06$dy <- dy +cdid06 <- cdid06[, -c("id", "lemp", "lemp.pre")] + +cdid07 <- merge(cont7, contB, by = "id") +dy <- cdid07$lemp - cdid07$lemp.pre +cdid07$dy <- dy +cdid07 <- cdid07[, -c("id", "lemp", "lemp.pre")] +``` + +### Estimation of the ATET with DML + +We estimate the ATET of the county level minimum wage being larger than the federal minimum with the DML algorithm presented in Section 16.3 in the book. This requires estimation of the nuisance functions $E[Y|D=0,X]$, $E[D|X]$ as well as $P(D = 1)$. For the conditional expectation functions, we will consider different modern ML regression methods, namely: Constant (= no controls); a linear combination of the controls; an expansion of the raw control variables including all third order interactions; Lasso (CV); Ridge (CV); Random Forest; Shallow Tree; Deep Tree; and CV Tree. +The methods indicated with CV have their tuning parameter selected by cross-validation. + +The following code block implements the DML cross-fitting procedure. + +```{r} +att <- matrix(NA, 4, 10) +se_att <- matrix(NA, 4, 10) +rmse_d <- matrix(NA, 4, 9) +rmse_y <- matrix(NA, 4, 9) +trimmed <- matrix(NA, 4, 9) + +print("DML estimation starting, please wait") +for (ii in 1:4) { # ii refer to the 4 investigated post-treatment periods + + tdata <- get(paste("tdid0", (3 + ii), sep = "")) # Treatment data + cdata <- get(paste("cdid0", (3 + ii), sep = "")) # Control data + usedata <- rbind(tdata, cdata) + + #----------------------------------------------------------------------------- + # Cross-fit setup + n <- nrow(usedata) + Kf <- 5 # Number of folds + sampleframe <- rep(1:Kf, ceiling(n / Kf)) + cfgroup <- sample(sampleframe, size = n, replace = FALSE) # Cross-fitting groups + + # Initialize variables for CV predictions + y_gd0x_fit <- matrix(NA, n, 9) + dgx_fit <- matrix(NA, n, 9) + pd_fit <- matrix(NA, n, 1) + + #----------------------------------------------------------------------------- + # Cross-fit loop + for (k in 1:Kf) { + cat("year: ", ii + 2003, "; fold: ", k, "\n") + indk <- cfgroup == k + + ktrain <- usedata[!indk, ] + ktest <- usedata[indk, ] + + # Build some matrices for later + ytrain <- as.matrix(usedata[!indk, "dy"]) + ytest <- as.matrix(usedata[indk, "dy"]) + dtrain <- as.matrix(usedata[!indk, "treated"]) + dtest <- as.matrix(usedata[indk, "treated"]) + + # Expansion for lasso/ridge (region specific cubic polynomial) + Xexpand <- model.matrix( + ~ region * (polym(lemp.0, lpop.0, lavg_pay.0, + degree = 3, raw = TRUE + )), + data = usedata + ) + + xtrain <- as.matrix(Xexpand[!indk, ]) + xtest <- as.matrix(Xexpand[indk, ]) + + #----------------------------------------------------------------------------- + # Estimating P(D = 1) + pd_fit[indk, 1] <- mean(ktrain$treated) + + #----------------------------------------------------------------------------- + # Estimating E[D|X] + + # 1) Constant + dgx_fit[indk, 1] <- mean(ktrain$treated) + + # 2) Baseline controls + glmXdk <- glm(treated ~ region + lemp.0 + lpop.0 + lavg_pay.0, + family = "binomial", data = ktrain + ) + dgx_fit[indk, 2] <- predict(glmXdk, newdata = ktest, type = "response") + + # 3) Region specific linear index + glmRXdk <- glm(treated ~ region * (lemp.0 + lpop.0 + lavg_pay.0), + family = "binomial", data = ktrain + ) + dgx_fit[indk, 3] <- predict(glmRXdk, newdata = ktest, type = "response") + + # 4) Lasso - expansion - default CV tuning + lassoXdk <- cv.glmnet(xtrain, dtrain, family = "binomial", type.measure = "mse") + dgx_fit[indk, 4] <- predict(lassoXdk, + newx = xtest, type = "response", + s = "lambda.min" + ) + + # 5) Ridge - expansion - default CV tuning + ridgeXdk <- cv.glmnet(xtrain, dtrain, + family = "binomial", + type.measure = "mse", alpha = 0 + ) + dgx_fit[indk, 5] <- predict(ridgeXdk, + newx = xtest, type = "response", + s = "lambda.min" + ) + + # 6) Random forest + rfXdk <- randomForest(as.factor(treated) ~ region + lemp.0 + lpop.0 + lavg_pay.0, + data = ktrain, mtry = 4, ntree = 1000 + ) + dgx_fit[indk, 6] <- predict(rfXdk, ktest, type = "prob")[, 2] + + # 7) Tree (start big) + btXdk <- rpart(treated ~ region + lemp.0 + lpop.0 + lavg_pay.0, + data = ktrain, method = "anova", + control = rpart.control(maxdepth = 15, cp = 0, xval = 5, minsplit = 10) + ) + # xval is the number of cross-validation splits. E.g. xval = 5 is five fold CV + dgx_fit[indk, 7] <- predict(btXdk, ktest) + + # 8) Tree (small tree) + stXdk <- rpart(treated ~ region + lemp.0 + lpop.0 + lavg_pay.0, + data = ktrain, method = "anova", + control = rpart.control(maxdepth = 3, cp = 0, xval = 0, minsplit = 10) + ) + # xval is the number of cross-validation splits. E.g. xval = 5 is five fold CV + dgx_fit[indk, 8] <- predict(stXdk, ktest) + + # 9) Tree (cv) + bestcp <- btXdk$cptable[which.min(btXdk$cptable[, "xerror"]), "CP"] + cvXdk <- prune(btXdk, cp = bestcp) + dgx_fit[indk, 9] <- predict(cvXdk, ktest) + + #----------------------------------------------------------------------------- + # Estimating E[Y|D=0,X] + + # subset to D = 0 + ktrain0 <- ktrain[ktrain$treated == 0, ] + + ytrain0 <- ytrain[ktrain$treated == 0, ] + xtrain0 <- xtrain[ktrain$treated == 0, ] + + # 1) Constant + y_gd0x_fit[indk, 1] <- mean(ktrain0$dy) + + # 2) Baseline controls + lmXyk <- lm(dy ~ region + lemp.0 + lpop.0 + lavg_pay.0, data = ktrain0) + y_gd0x_fit[indk, 2] <- predict(lmXyk, newdata = ktest) + + # 3) Region specific linear index + lmRXyk <- lm(treated ~ region * (lemp.0 + lpop.0 + lavg_pay.0), + data = ktrain + ) + y_gd0x_fit[indk, 3] <- predict(lmRXyk, newdata = ktest) + + # 4) Lasso - expansion - default CV tuning + lassoXyk <- cv.glmnet(xtrain0, ytrain0) + y_gd0x_fit[indk, 4] <- predict(lassoXyk, newx = xtest, s = "lambda.min") + + # 5) Ridge - expansion - default CV tuning + ridgeXyk <- cv.glmnet(xtrain0, ytrain0, alpha = 0) + y_gd0x_fit[indk, 5] <- predict(ridgeXyk, newx = xtest, s = "lambda.min") + + # 6) Random forest + rfXyk <- randomForest(dy ~ region + lemp.0 + lpop.0 + lavg_pay.0, + data = ktrain0, mtry = 4, ntree = 1000 + ) + y_gd0x_fit[indk, 6] <- predict(rfXyk, ktest) + + # 7) Tree (start big) + btXyk <- rpart(dy ~ region + lemp.0 + lpop.0 + lavg_pay.0, + data = ktrain0, method = "anova", + control = rpart.control(maxdepth = 15, cp = 0, xval = 5, minsplit = 10) + ) + y_gd0x_fit[indk, 7] <- predict(btXyk, ktest) + + # 8) Tree (small tree) + stXyk <- rpart(dy ~ region + lemp.0 + lpop.0 + lavg_pay.0, + data = ktrain, method = "anova", + control = rpart.control(maxdepth = 3, cp = 0, xval = 0, minsplit = 10) + ) + y_gd0x_fit[indk, 8] <- predict(stXyk, ktest) + + # 9) Tree (cv) + bestcp <- btXyk$cptable[which.min(btXyk$cptable[, "xerror"]), "CP"] + cvXyk <- prune(btXyk, cp = bestcp) + y_gd0x_fit[indk, 9] <- predict(cvXyk, ktest) + } + + rmse_d[ii, ] <- sqrt(colMeans((usedata$treated - dgx_fit)^2)) + rmse_y[ii, ] <- sqrt(colMeans((usedata$dy[usedata$treated == 0] - + y_gd0x_fit[usedata$treated == 0, ])^2)) + + # trim propensity scores of 1 to .95 + for (r in 1:9) { + trimmed[ii, r] <- sum(dgx_fit[, r] > .95) + dgx_fit[dgx_fit[, r] > .95, r] <- .95 + } + + att_num <- c( + colMeans(((usedata$treated - dgx_fit) / ((pd_fit %*% matrix(1, 1, 9)) * (1 - dgx_fit))) * + (usedata$dy - y_gd0x_fit)), + mean(((usedata$treated - dgx_fit[, which.min(rmse_d[ii, ])]) + / (pd_fit * (1 - dgx_fit[, which.min(rmse_d[ii, ])]))) * + (usedata$dy - y_gd0x_fit[, which.min(rmse_y[ii, ])])) + ) + att_den <- mean(usedata$treated / pd_fit) + + att[ii, ] <- att_num / att_den + + phihat <- cbind( + ((usedata$treated - dgx_fit) / ((pd_fit %*% matrix(1, 1, 9)) * (1 - dgx_fit))) * + (usedata$dy - y_gd0x_fit), + ((usedata$treated - dgx_fit[, which.min(rmse_d[ii, ])]) + / (pd_fit * (1 - dgx_fit[, which.min(rmse_d[ii, ])]))) * + (usedata$dy - y_gd0x_fit[, which.min(rmse_y[ii, ])]) + ) / att_den + se_att[ii, ] <- sqrt(colMeans((phihat^2)) / n) +} +``` + +We start by reporting the RMSE obtained during cross-fitting for each learner in each period. + +```{r} +table1y <- matrix(0, 9, 4) +table1y <- t(rmse_y) +colnames(table1y) <- c("2004", "2005", "2006", "2007") +rownames(table1y) <- c( + "No Controls", "Basic", "Expansion", + "Lasso (CV)", "Ridge (CV)", + "Random Forest", "Deep Tree", + "Shallow Tree", "Tree (CV)" +) +table1y +``` + +```{r} +table1d <- matrix(0, 9, 4) +table1d <- t(rmse_d) +colnames(table1d) <- c("2004", "2005", "2006", "2007") +rownames(table1d) <- c( + "No Controls", "Basic", "Expansion", + "Lasso (CV)", "Ridge (CV)", + "Random Forest", "Deep Tree", + "Shallow Tree", "Tree (CV)" +) +table1d +``` + +Here we see that the Deep Tree systematically performs worse in terms of cross-fit predictions than the other learners for both tasks and that Expansion performs similarly poorly for the outcome prediction. It also appears there is some signal in the regressors, especially for the propensity score, as all methods outside of Deep Tree and Expansion produce smaller RMSEs than the No Controls baseline. The other methods all produce similar RMSEs, with a small edge going to Ridge and Lasso. While it would be hard to reliably conclude which of the relatively good performing methods is statistically best here, one could exclude Expansion and Deep Tree from further consideration on the basis of out-of-sample performance suggesting +they are doing a poor job approximating the nuisance functions. Best (or a different ensemble) provides a good baseline that is principled in the sense that one could pre-commit to using the best learners without having first looked at the subsequent estimation results. + +We report estimates of the ATET in each period in the following table. + +```{r} +table2 <- matrix(0, 20, 4) +table2[seq(1, 20, 2), ] <- t(att) +table2[seq(2, 20, 2), ] <- t(se_att) +colnames(table2) <- c("2004", "2005", "2006", "2007") +rownames(table2) <- c( + "No Controls", "s.e.", "Basic", "s.e.", + "Expansion", "s.e.", "Lasso (CV)", "s.e.", + "Ridge (CV)", "s.e.", "Random Forest", "s.e.", + "Deep Tree", "s.e.", "Shallow Tree", "s.e.", + "Tree (CV)", "s.e.", "Best", "s.e." +) +table2 +``` + +Here, we see that the majority of methods provide point estimates that suggest the effect of the minimum wage increase leads to decreases in youth employment with small effects in the initial period that become larger in the years following the treatment. This pattern seems economically plausible as it may take time for firms to adjust employment and other input choices in response to a minimum wage change. The methods that produce estiamtes that are not consistent with this pattern are Deep Tree and Expansion which are both suspect as they systematically underperform in terms of having poor cross-fit prediction performance. In terms of point estimates, the other pattern that emerges is that all estimates that use the covariates produce ATET estimates that are systematically smaller in magnitude than the No Controls baseline, suggesting that failing to include the controls may lead to overstatement of treatment effects in this example. + +Turning to inference, we would reject the hypothesis of no minimum wage effect two or more years after the change at the 5% level, even after multiple testing correction, if we were to focus on the row "Best" (or many of the other individual rows). Focusing on "Best" is a reasonable ex ante strategy that could be committed to prior to conducting any analysis. It is, of course, reassuring that this broad conclusion is also obtained using many of the individual learners suggesting some robustness to the exact choice of learner made. + +### Assess pre-trends + +Because we have data for the period 2001-2007, we can perform a so-called pre-trends test to provide some evidence about the plausibility of the conditional parallel trends assumption. Specifically, we can continue to use 2003 as the reference period but now consider 2002 to be the treatment period. Sensible economic mechanisms underlying the assumption would then typically suggest that the ATET in 2002 - before the 2004 minimum wage change we are considering - should be zero. Finding evidence that the ATET in 2002 is non-zero then calls into question the validity of the assumption. + +We change the treatment status of those observations, which received treatment in 2004 in the 2002 data and create a placebo treatment as well as control group. + +```{r} +treat2 <- treat2[, -c("lpop", "lavg_pay", "year", "G", "region")] +treat2$treated <- 1 # Code these observations as treated + +tdid02 <- merge(treat2, treatB, by = "id") +dy <- tdid02$lemp - tdid02$lemp.pre +tdid02$dy <- dy +tdid02 <- tdid02[, -c("id", "lemp", "lemp.pre")] + +cont2 <- cont2[, -c("lpop", "lavg_pay", "year", "G", "region")] + +cdid02 <- merge(cont2, contB, by = "id") +dy <- cdid02$lemp - cdid02$lemp.pre +cdid02$dy <- dy +cdid02 <- cdid02[, -c("id", "lemp", "lemp.pre")] +``` + +We repeat the exercise for obtaining our ATET estimates and standard error for 2004-2007. Particularly, we also use all the learners as mentioned above. + +```{r} +att_p <- matrix(NA, 1, 10) +se_att_p <- matrix(NA, 1, 10) +rmse_d_p <- matrix(NA, 1, 9) +rmse_y_p <- matrix(NA, 1, 9) +trimmed_p <- matrix(NA, 1, 9) +for (ii in 1) { + tdata <- get(paste("tdid0", (3 - ii), sep = "")) # Treatment data + cdata <- get(paste("cdid0", (3 - ii), sep = "")) # Control data + usedata <- rbind(tdata, cdata) + + #----------------------------------------------------------------------------- + # Cross-fit setup + n <- nrow(usedata) + Kf <- 5 # Number of folds + sampleframe <- rep(1:Kf, ceiling(n / Kf)) + cfgroup <- sample(sampleframe, size = n, replace = FALSE) # Cross-fitting groups + + # Initialize variables for CV predictions + y_gd0x_fit <- matrix(NA, n, 9) + dgx_fit <- matrix(NA, n, 9) + pd_fit <- matrix(NA, n, 1) + + #----------------------------------------------------------------------------- + # Cross-fit loop + for (k in 1:Kf) { + cat("year: ", ii + 2001, "; fold: ", k, "\n") + indk <- cfgroup == k + + ktrain <- usedata[!indk, ] + ktest <- usedata[indk, ] + + # Build some matrices for later + ytrain <- as.matrix(usedata[!indk, "dy"]) + ytest <- as.matrix(usedata[indk, "dy"]) + dtrain <- as.matrix(usedata[!indk, "treated"]) + dtest <- as.matrix(usedata[indk, "treated"]) + + # Expansion for lasso/ridge (region specific cubic polynomial) + Xexpand <- model.matrix( + ~ region * (polym(lemp.0, lpop.0, lavg_pay.0, + degree = 3, raw = TRUE + )), + data = usedata + ) + + xtrain <- as.matrix(Xexpand[!indk, ]) + xtest <- as.matrix(Xexpand[indk, ]) + + #----------------------------------------------------------------------------- + # Estimating P(D = 1) + pd_fit[indk, 1] <- mean(ktrain$treated) + + #----------------------------------------------------------------------------- + # Estimating E[D|X] + + # 1) Constant + dgx_fit[indk, 1] <- mean(ktrain$treated) + + # 2) Baseline controls + glmXdk <- glm(treated ~ region + lemp.0 + lpop.0 + lavg_pay.0, + family = "binomial", data = ktrain + ) + dgx_fit[indk, 2] <- predict(glmXdk, newdata = ktest, type = "response") + + # 3) Region specific linear index + glmRXdk <- glm(treated ~ region * (lemp.0 + lpop.0 + lavg_pay.0), + family = "binomial", data = ktrain + ) + dgx_fit[indk, 3] <- predict(glmRXdk, newdata = ktest, type = "response") + + # 4) Lasso - expansion - default CV tuning + lassoXdk <- cv.glmnet(xtrain, dtrain, family = "binomial", type.measure = "mse") + dgx_fit[indk, 4] <- predict(lassoXdk, + newx = xtest, type = "response", + s = "lambda.min" + ) + + # 5) Ridge - expansion - default CV tuning + ridgeXdk <- cv.glmnet(xtrain, dtrain, + family = "binomial", + type.measure = "mse", alpha = 0 + ) + dgx_fit[indk, 5] <- predict(ridgeXdk, + newx = xtest, type = "response", + s = "lambda.min" + ) + + # 6) Random forest + rfXdk <- randomForest(as.factor(treated) ~ region + lemp.0 + lpop.0 + lavg_pay.0, + data = ktrain, mtry = 4, ntree = 1000 + ) + dgx_fit[indk, 6] <- predict(rfXdk, ktest, type = "prob")[, 2] + + # 7) Tree (start big) + btXdk <- rpart(treated ~ region + lemp.0 + lpop.0 + lavg_pay.0, + data = ktrain, method = "anova", + control = rpart.control(maxdepth = 15, cp = 0, xval = 5, minsplit = 10) + ) + # xval is the number of cross-validation splits. E.g. xval = 5 is five fold CV + dgx_fit[indk, 7] <- predict(btXdk, ktest) + + # 8) Tree (small tree) + stXdk <- rpart(treated ~ region + lemp.0 + lpop.0 + lavg_pay.0, + data = ktrain, method = "anova", + control = rpart.control(maxdepth = 3, cp = 0, xval = 0, minsplit = 10) + ) + # xval is the number of cross-validation splits. E.g. xval = 5 is five fold CV + dgx_fit[indk, 8] <- predict(stXdk, ktest) + + # 9) Tree (cv) + bestcp <- btXdk$cptable[which.min(btXdk$cptable[, "xerror"]), "CP"] + cvXdk <- prune(btXdk, cp = bestcp) + dgx_fit[indk, 9] <- predict(cvXdk, ktest) + + #----------------------------------------------------------------------------- + # Estimating E[Y|D=0,X] + + # subset to D = 0 + ktrain0 <- ktrain[ktrain$treated == 0, ] + + ytrain0 <- ytrain[ktrain$treated == 0, ] + xtrain0 <- xtrain[ktrain$treated == 0, ] + + # 1) Constant + y_gd0x_fit[indk, 1] <- mean(ktrain0$dy) + + # 2) Baseline controls + lmXyk <- lm(dy ~ region + lemp.0 + lpop.0 + lavg_pay.0, data = ktrain0) + y_gd0x_fit[indk, 2] <- predict(lmXyk, newdata = ktest) + + # 3) Region specific linear index + lmRXyk <- lm(treated ~ region * (lemp.0 + lpop.0 + lavg_pay.0), + data = ktrain + ) + y_gd0x_fit[indk, 3] <- predict(lmRXyk, newdata = ktest) + + # 4) Lasso - expansion - default CV tuning + lassoXyk <- cv.glmnet(xtrain0, ytrain0) + y_gd0x_fit[indk, 4] <- predict(lassoXyk, newx = xtest, s = "lambda.min") + + # 5) Ridge - expansion - default CV tuning + ridgeXyk <- cv.glmnet(xtrain0, ytrain0, alpha = 0) + y_gd0x_fit[indk, 5] <- predict(ridgeXyk, newx = xtest, s = "lambda.min") + + # 6) Random forest + rfXyk <- randomForest(dy ~ region + lemp.0 + lpop.0 + lavg_pay.0, + data = ktrain0, mtry = 4, ntree = 1000 + ) + y_gd0x_fit[indk, 6] <- predict(rfXyk, ktest) + + # 7) Tree (start big) + btXyk <- rpart(dy ~ region + lemp.0 + lpop.0 + lavg_pay.0, + data = ktrain0, method = "anova", + control = rpart.control(maxdepth = 15, cp = 0, xval = 5, minsplit = 10) + ) + y_gd0x_fit[indk, 7] <- predict(btXyk, ktest) + + # 8) Tree (small tree) + stXyk <- rpart(dy ~ region + lemp.0 + lpop.0 + lavg_pay.0, + data = ktrain, method = "anova", + control = rpart.control(maxdepth = 3, cp = 0, xval = 0, minsplit = 10) + ) + y_gd0x_fit[indk, 8] <- predict(stXyk, ktest) + + # 9) Tree (cv) + bestcp <- btXyk$cptable[which.min(btXyk$cptable[, "xerror"]), "CP"] + cvXyk <- prune(btXyk, cp = bestcp) + y_gd0x_fit[indk, 9] <- predict(cvXyk, ktest) + } + + rmse_d_p[ii, ] <- sqrt(colMeans((usedata$treated - dgx_fit)^2)) + rmse_y_p[ii, ] <- sqrt(colMeans((usedata$dy[usedata$treated == 0] - + y_gd0x_fit[usedata$treated == 0, ])^2)) + + # trim propensity scores of 1 to .95 + for (r in 1:9) { + trimmed_p[ii, r] <- sum(dgx_fit[, r] > .95) + dgx_fit[dgx_fit[, r] > .95, r] <- .95 + } + + att_num <- c( + colMeans(((usedata$treated - dgx_fit) / ((pd_fit %*% matrix(1, 1, 9)) * (1 - dgx_fit))) * + (usedata$dy - y_gd0x_fit)), + mean(((usedata$treated - dgx_fit[, which.min(rmse_d[ii, ])]) + / (pd_fit * (1 - dgx_fit[, which.min(rmse_d[ii, ])]))) * + (usedata$dy - y_gd0x_fit[, which.min(rmse_y[ii, ])])) + ) + att_den <- mean(usedata$treated / pd_fit) + + att_p[ii, ] <- att_num / att_den + + phihat <- cbind( + ((usedata$treated - dgx_fit) / ((pd_fit %*% matrix(1, 1, 9)) * (1 - dgx_fit))) * + (usedata$dy - y_gd0x_fit), + ((usedata$treated - dgx_fit[, which.min(rmse_d[ii, ])]) + / (pd_fit * (1 - dgx_fit[, which.min(rmse_d[ii, ])]))) * + (usedata$dy - y_gd0x_fit[, which.min(rmse_y[ii, ])]) + ) / att_den + se_att_p[ii, ] <- sqrt(colMeans((phihat^2)) / n) +} +``` + +We report the results in the following table. + +```{r} +tableP <- matrix(0, 4, 10) +tableP[1, ] <- c(rmse_y_p, min(rmse_y_p)) +tableP[2, ] <- c(rmse_d_p, min(rmse_d_p)) +tableP[3, ] <- att_p +tableP[4, ] <- se_att_p +rownames(tableP) <- c("RMSE Y", "RMSE D", "ATET", "s.e.") +colnames(tableP) <- c( + "No Controls", "Basic", "Expansion", + "Lasso (CV)", "Ridge (CV)", + "Random Forest", "Deep Tree", + "Shallow Tree", "Tree (CV)", "Best" +) +tableP <- t(tableP) +tableP +``` + +Here we see broad agreement across all methods in the sense of returning point estimates that are small in magnitude and small relative to standard errors. In no case would we reject the hypothesis that the pre-event effect in 2002 is different from zero at usual levels of significance. We note that failing to reject the hypothesis of no pre-event effects certainly does not imply that the conditional DiD assumption is in fact satisfied. For example, confidence intervals include values that would be consistent with relatively large pre-event effects. However, it is reassuring to see that there is not strong evidence of a violation of the underlying identifying assumption. + diff --git a/T/T_4_Regression_Discontinuity_on_Progresa_Data.Rmd b/T/T_4_Regression_Discontinuity_on_Progresa_Data.Rmd new file mode 100644 index 00000000..8c4564e4 --- /dev/null +++ b/T/T_4_Regression_Discontinuity_on_Progresa_Data.Rmd @@ -0,0 +1,601 @@ +--- +title: An R Markdown document converted from "T/T_4_Regression_Discontinuity_on_Progresa_Data.irnb" +output: html_document +--- + +# Regression Discontinuity +This notebook illustrates the use of Regression Discontinuity in an empirical study. We analyze the effect of the antipoverty program *Progresa/Oportunidades* on the consumption behavior of families in Mexico in the early 2000s. + +The program was intended for families in extreme poverty and included financial incentives for participation in measures that improved the family's health, nutrition and children's education. The effect of this program is a widely studied problem in social and economic sciences and, according to the WHO, was a very successful measure in terms of reducing extreme poverty in Mexico. + +Eligibility for the program was determined based on a pre-intervention household poverty-index. Individuals above a certain threshold received the treatment (participation in the program) while individuals below the threshold were excluded and recorded as a control group. All observations above the threshold participated in the program, which makes the analysis fall into the standard (sharp) regression discontinuity design. + +First, we need to install and load some packages. This can take up to 15 minutes. + +```{r} +dependencies <- c("rdrobust", "fastDummies", "randomForest", "hdm", "gbm", "rdd") +install.packages(dependencies) +``` + +```{r} +lapply(dependencies, library, character.only = TRUE) +``` + +We use a dataset assembled by [Calonico et al. (2014)](https://rdpackages.github.io/references/Calonico-Cattaneo-Titiunik_2014_ECMA--Supplemental.pdf) and follow the analysis in [Noack et al. (2023)](https://arxiv.org/pdf/2107.07942.pdf). + +First, we open the data and remove any observations that have NaN values. + +```{r} +df <- read.csv("https://raw.githubusercontent.com/CausalAIBook/MetricsMLNotebooks/main/data/progresa.csv", + row.names = 1) +comp <- complete.cases(df) +df <- df[comp, ] +print("Shape of Data:") +print(dim(df)) +print("Variable Names:") +print(colnames(df)) +head(df) +``` + +The data set contains 1,944 observations for which full covariate information of 27 variables is available. + +We want to measure the local average treatment effect of program participation on four outcome variables. The outcome variables are food and non-food consumption of the recorded families at two points in time, one year and two years after the implementation of the program. + +The baseline covariates, recorded prior to program implementation, include the household's size; household head's age, sex, years of education and employment status; spouse's age and years of education; number of children not older than five years and their sex, and physical characteristics of the house: whether the house has cement floors, water connection, water connection inside the house, a bathroom, electricity, number of rooms, pre-intervention consumption, and an identifier of the urban locality in which the house is located. + +The data fits to the pattern of a sharp RD design, namely, all individuals that were below the cut-off index received no intervention, and all individuals above the cut-off were eligible to join the *progresa* program and thus participated. + +## Estimation without Covariates + +First, we will perform a very simple RD estimation with a weighted linear regression. We use a triangular kernel, which assigns weights to observations based on their distance from the cutoff point. The weights decrease linearly as the distance from the cutoff point increases. + +```{r} +triangular_kernel <- function(index, h) { + weights <- 1 - abs(index) / h + weights[weights < 0] <- 0 + return(weights) +} +``` + +The parameter `h` is the bandwidth that controls the range of observations that receive non-zero weights. We use the `IKbandwidth` function from the `rdd` package that implements the *Imbens-Kalyanaraman* method. Another standard approach would be to use the standard deviation of `index`. + +```{r} +h <- IKbandwidth(X = df$index, Y = df$conspcfood_t1, cutpoint = 0) +``` + +We use the triangular kernel function to calculate weights for each observation. After that, we can fit two seperate linear regressions for both treatment and control groups. + +```{r} +weights <- triangular_kernel(df$index, h) +model_treated <- lm(conspcfood_t1 ~ index, data = df[df$index > 0, ], weights = weights[df$index > 0]) +model_control <- lm(conspcfood_t1 ~ index, data = df[df$index < 0, ], weights = weights[df$index < 0]) +``` + +The treatment effect at the cutoff point is estimated as the difference between the predictions of the two models at the cutoff point. + +```{r} +cutoff <- 0 +treatment_effect <- predict(model_treated, newdata = data.frame(index = cutoff)) - + predict(model_control, newdata = data.frame(index = cutoff)) +treatment_effect +``` + +We estimate that the participation in the program reduced food consumption by $22.1$ units in the first year after the intervention. We can repeat the estimation using the `rdd` package, which yields us an estimate as well as a confidence band calculated according to the formulas presented in the book. We look at all four targets. + +```{r} +result <- c() +for (outcome in c("conspcfood_t1", "conspcnonfood_t1", "conspcfood_t2", "conspcnonfood_t2")) { + rdd_result <- rdrobust(df[[outcome]], df$index, c = cutoff, rho = 1) + result <- rbind(result, c(rdd_result$coef[1], rdd_result$se["Robust", ])) +} +resframe <- as.data.frame(result) +colnames(resframe) <- c("LATE", "s.e.") +rownames(resframe) <- c("Food T_1", "Non-Food T_1", "Food T_2", "Non-Food T_2") +print(resframe) +``` + +While the effects in the first year after the intervention are negative, we observe significant positive effects in the second year after an individual or household was accepted in the *Progresa* program. This is in accordance to the previous analysis of this dataset. One possible explanation for this is that the program households have more money and can thus afford more. This was the desired effect of the program to combat hunger and extreme poverty. + +The following plot visualizes the two weighted regressions at the cut-off for the last outcome variable (non-food consumption in `t2`). We can clearly see the "jump" at the cut-off, which is our LATE. + +```{r} +rdplot(df$conspcfood_t1, df$index, c = 0, x.lim = c(-1, 1), y.lim = c(250, 400)) +``` + +## Estimation with Covariates + +For identification and estimation of the average treatment effect at the cutoff value no covariate information is required except the running variable, but nevertheless in many applications additional covariates are collected which might be exploited for the analysis. + + +The standard approach is simply to take up the regressors in the weighted least squares regression. + +```{r} +model_treated <- lm(conspcfood_t1 ~ index + hhownhouse + headage + heademp + headeduc, + data = df[df$index > 0, ], weights = weights[df$index > 0]) +model_control <- lm(conspcfood_t1 ~ index + hhownhouse + headage + heademp + headeduc, + data = df[df$index < 0, ], weights = weights[df$index < 0]) +prediction_treated <- predict(model_treated, newdata = data.frame( + index = cutoff, + hhownhouse = weighted.mean(df[df$index > 0, ]$hhownhouse, w = weights[df$index > 0]), + headage = weighted.mean(df[df$index > 0, ]$headage, w = weights[df$index > 0]), + heademp = weighted.mean(df[df$index > 0, ]$heademp, w = weights[df$index > 0]), + headeduc = weighted.mean(df[df$index > 0, ]$headeduc, w = weights[df$index > 0]) +)) +prediction_control <- predict(model_control, newdata = data.frame( + index = cutoff, + hhownhouse = weighted.mean(df[df$index < 0, ]$hhownhouse, w = weights[df$index < 0]), + headage = weighted.mean(df[df$index < 0, ]$headage, w = weights[df$index < 0]), + heademp = weighted.mean(df[df$index < 0, ]$heademp, w = weights[df$index < 0]), + headeduc = weighted.mean(df[df$index < 0, ]$headeduc, w = weights[df$index < 0]) +)) +treatment_effect <- prediction_treated - prediction_control +treatment_effect +``` + +Including these selected covariates does not have a significant impact on the LATE estimation. + +Again, we can also use `rdrobust` to repeat the estimation with all other outcomes. + +```{r} +result <- c() +for (outcome in c("conspcfood_t1", "conspcnonfood_t1", "conspcfood_t2", "conspcnonfood_t2")) { + rdd_result <- rdrobust(df[[outcome]], df$index, c = cutoff, rho = 1, covs = df[, c(1:8, 10:17, 19, 22)]) + result <- rbind(result, c(rdd_result$coef[1], rdd_result$se["Robust", ])) +} +resframe_adj <- as.data.frame(result) +colnames(resframe_adj) <- c("LATE", "s.e.") +rownames(resframe_adj) <- c("Food T_1", "Non-Food T_1", "Food T_2", "Non-Food T_2") +resframe_adj["% reduction"] <- (resframe_adj["s.e."] - resframe[, 2]) * 100 / resframe[, 2] +print(resframe_adj) +``` + +Overall, the adjustment by only a few covariates has not changed the estimated coefficient much from the result without covariates. However, including covariates does reduce the standard errors. + +## Estimation using ML + +As discussed in the book, including many covariates in RDD estimation can be beneficial for multiple reasons: +1. **Efficiency and power improvements**: As in randomized control trials, using covariates can increase efficiency and improve power. +2. **Auxiliary information**: In RDD the score determines the treatment assignment and measurement errors in the running variable can distort the results. Additional covariates can be exploited to overcome these issues or to deal with missing data problems. +3. **Treatment effect heterogeneity**: Covariates can be used to define subgroups in which the treatment effects differ. +4. **Other parameters of interest and extrapolation**: As the identified treatment effect in RDD is local at the cutoff, additional covariates might help for extrapolation of the treatment effects or identify other causal parameters. + +However, including a high number of covariates also comes with additional challenges, such as variables selection, non-linearities or interactions between covariates. The best way to overcome these is the use of modern ML methods. + +There are multiple ways to implement the estimators presented in the book, we will closely follow the analysis of [Noack et al. (2023)](https://arxiv.org/pdf/2107.07942.pdf). We set up running variable and outcome as above. The baseline covariates will be all the other variables in the data. + +```{r} +# Running Variable and Outcome +df_ml <- df +investigated_outcome <- "conspcfood_t1" +names(df_ml)[names(df_ml) == "index"] <- "X" +names(df_ml)[names(df_ml) == investigated_outcome] <- "Y" + +# Baseline covariates including consumption +b_covs <- names(df_ml[, c(1:8, 10:17, 19, 22)]) + +# Fixed effects for localities +i_fe <- as.matrix(dummy_cols(df_ml$clus, remove_first_dummy = TRUE)) + +# Flexible covariates including localities indicators +f_covs <- as.matrix(model.matrix(~ .^2, data = df_ml[b_covs]))[, -1] +Zlasso <- as.matrix(cbind(i_fe, f_covs)) +``` + +We will use the package `rdrobust` for the RD estimation. Before starting the DML procedure, we have to estimate a bandwidth to restrict the samples in the first stage estimation. + +```{r} +h_fs <- 2 * rdrobust(df_ml$Y, df_ml$X, rho = 1)$bws[[1]] +``` + +The next chunk sets up the crossfitting and estimates the function $\eta(Z)$, which we will use to adjust $Y$ for the second stage. We use Random Forest, a Boosting implementation, Linear Regression and Lasso with both a baseline and flexible covariate structure. + +```{r} +first_stage <- function() { + # Set up the cross-fitting + n <- nrow(df_ml) + Kf <- 5 # Number of folds + sampleframe <- rep(1:Kf, ceiling(n / Kf)) + cfgroup <- sample(sampleframe, size = n, replace = FALSE) + + # Matrix to store eta predictions + eta_fit <- matrix(NA, n, 5) + + # Create vector of observations to be considered in the first stage model + weights <- (abs(df_ml$X) < h_fs) + + for (k in 1:Kf) { + fold <- (cfgroup == k) + + data_treated <- df_ml[df_ml$X > 0 & !fold & weights > 0, ] + data_control <- df_ml[df_ml$X < 0 & !fold & weights > 0, ] + + data_fold <- df_ml[fold, ] + + model <- as.formula(paste("Y~", paste(b_covs, collapse = "+"))) + + rf1 <- randomForest(model, data = data_treated, mtry = 4, ntree = 1000) + rf0 <- randomForest(model, data = data_control, mtry = 4, ntree = 1000) + eta_fit[fold, 1] <- (predict(rf1, data_fold) + predict(rf0, data_fold)) / 2 + + gbm1 <- gbm(model, + data = data_treated, n.trees = 100, interaction.depth = 1, + shrinkage = .1, distribution = "gaussian" + ) + gbm0 <- gbm(model, + data = data_control, n.trees = 100, interaction.depth = 1, + shrinkage = .1, distribution = "gaussian" + ) + eta_fit[fold, 2] <- (predict(gbm1, data_fold, n.trees = 100) + predict(gbm0, data_fold, n.trees = 100)) / 2 + + lm1 <- lm(model, data = data_treated) + lm0 <- lm(model, data = data_control) + eta_fit[fold, 3] <- (predict(lm1, data_fold) + predict(lm0, data_fold)) / 2 + + las_base1 <- rlasso(model, data = data_treated) + las_base0 <- rlasso(model, data = data_control) + eta_fit[fold, 4] <- (predict(las_base1, data_fold) + predict(las_base0, data_fold)) / 2 + + data_treated_extended <- cbind(Zlasso[rownames(data_treated), ], data_treated) + data_control_extended <- cbind(Zlasso[rownames(data_control), ], data_control) + data_fold_extended <- cbind(Zlasso[rownames(data_fold), ], data_fold) + model_flex <- as.formula(paste("Y~", paste(c(b_covs, colnames(Zlasso)), collapse = "+"))) + + las_flex1 <- rlasso(model_flex, data = data_treated_extended) + las_flex0 <- rlasso(model_flex, data = data_control_extended) + eta_fit[fold, 5] <- (predict(las_flex1, data_fold_extended) + predict(las_flex0, data_fold_extended)) / 2 + } + return(eta_fit) +} + +eta_fit <- first_stage() +``` + +With the estimated $\hat{\eta}(Z)$ we can correct for confounding in $Y$ and now run the RDD estimation as second stage again. + +```{r} +methods <- c("Random Forest", "Gradient Boosting", "Linear Regression", "Lasso Baseline", "Lasso Flexible") + +second_stage <- function(eta_fit) { + adj_results <- NULL + + for (i in seq_along(methods)) { + m_y <- df_ml$Y - eta_fit[, i] + rdd_result <- rdrobust(m_y, df$index, c = cutoff, rho = 1) + adj_results <- rbind(adj_results, c(rdd_result$coef[1], rdd_result$se["Robust", ])) + } + return(adj_results) +} + +adj_frame <- as.data.frame(second_stage(eta_fit)) +rownames(adj_frame) <- methods +colnames(adj_frame) <- c("LATE", "s.e.") +print(adj_frame) +``` + +Finally, we create a small simulation study with only $R=20$ repetitions to show the variance reducing effect of the inclusion of ML-based estimators for the covariates. The next block runs up to ten minutes. + +```{r} +estimates <- adj_frame[, 1] +std_err <- adj_frame[, 2] +R <- 19 + +for (i in 1:R) { + eta_fit <- first_stage() + adj_results <- second_stage(eta_fit) + estimates <- cbind(estimates, adj_results[, 1]) + std_err <- cbind(std_err, adj_results[, 2]) +} +``` + +We aggregate the median of the estimates, the mean of the standard errors and also calculate the mean reduction of standard error compared to the "no covariates" estimation. We see, that including covariates can reduce the standard error of estimation around 15-20%. + +```{r} +med_est <- apply(estimates, 1, median) +mean_se <- apply(std_err, 1, mean) +adj_frame <- as.data.frame(cbind(med_est, mean_se)) +rownames(adj_frame) <- methods +colnames(adj_frame) <- c("LATE", "s.e.") +adj_frame["% reduction"] <- (adj_frame["s.e."] - resframe[1, 2]) * 100 / resframe[1, 2] +adj_frame["Linear Adjusted (no cross-fit)", ] <- resframe_adj[1, ] +print(adj_frame) +``` + +## We now repeat the exercise for the other outcomes (excluding the simulation). + +Non-Food Consumption (Year 1) + +```{r} +# Running Variable and Outcome +df_ml <- df +investigated_outcome <- "conspcnonfood_t1" +names(df_ml)[names(df_ml) == "index"] <- "X" +names(df_ml)[names(df_ml) == investigated_outcome] <- "Y" + +# Baseline covariates including consumption +b_covs <- names(df_ml[, c(1:8, 10:17, 19, 22)]) + +# Fixed effects for localities +i_fe <- as.matrix(dummy_cols(df_ml$clus, remove_first_dummy = TRUE)) + +# Flexible covariates including localities indicators +f_covs <- as.matrix(model.matrix(~ .^2, data = df_ml[b_covs]))[, -1] +Zlasso <- as.matrix(cbind(i_fe, f_covs)) + +h_fs <- 2 * rdrobust(df_ml$Y, df_ml$X, rho = 1)$bws[[1]] + +first_stage <- function() { + # Set up the cross-fitting + n <- nrow(df_ml) + Kf <- 5 # Number of folds + sampleframe <- rep(1:Kf, ceiling(n / Kf)) + cfgroup <- sample(sampleframe, size = n, replace = FALSE) + + # Matrix to store eta predictions + eta_fit <- matrix(NA, n, 5) + + # Create vector of observations to be considered in the first stage model + weights <- (abs(df_ml$X) < h_fs) + + for (k in 1:Kf) { + fold <- (cfgroup == k) + + data_treated <- df_ml[df_ml$X > 0 & !fold & weights > 0, ] + data_control <- df_ml[df_ml$X < 0 & !fold & weights > 0, ] + + data_fold <- df_ml[fold, ] + + model <- as.formula(paste("Y~", paste(b_covs, collapse = "+"))) + + rf1 <- randomForest(model, data = data_treated, mtry = 4, ntree = 1000) + rf0 <- randomForest(model, data = data_control, mtry = 4, ntree = 1000) + eta_fit[fold, 1] <- (predict(rf1, data_fold) + predict(rf0, data_fold)) / 2 + + gbm1 <- gbm(model, + data = data_treated, n.trees = 100, interaction.depth = 1, + shrinkage = .1, distribution = "gaussian" + ) + gbm0 <- gbm(model, + data = data_control, n.trees = 100, interaction.depth = 1, + shrinkage = .1, distribution = "gaussian" + ) + eta_fit[fold, 2] <- (predict(gbm1, data_fold, n.trees = 100) + predict(gbm0, data_fold, n.trees = 100)) / 2 + + lm1 <- lm(model, data = data_treated) + lm0 <- lm(model, data = data_control) + eta_fit[fold, 3] <- (predict(lm1, data_fold) + predict(lm0, data_fold)) / 2 + + las_base1 <- rlasso(model, data = data_treated) + las_base0 <- rlasso(model, data = data_control) + eta_fit[fold, 4] <- (predict(las_base1, data_fold) + predict(las_base0, data_fold)) / 2 + + data_treated_extended <- cbind(Zlasso[rownames(data_treated), ], data_treated) + data_control_extended <- cbind(Zlasso[rownames(data_control), ], data_control) + data_fold_extended <- cbind(Zlasso[rownames(data_fold), ], data_fold) + model_flex <- as.formula(paste("Y~", paste(c(b_covs, colnames(Zlasso)), collapse = "+"))) + + las_flex1 <- rlasso(model_flex, data = data_treated_extended) + las_flex0 <- rlasso(model_flex, data = data_control_extended) + eta_fit[fold, 5] <- (predict(las_flex1, data_fold_extended) + predict(las_flex0, data_fold_extended)) / 2 + } + return(eta_fit) +} + +eta_fit <- first_stage() + +methods <- c("Random Forest", "Gradient Boosting", "Linear Regression", "Lasso Baseline", "Lasso Flexible") + +second_stage <- function(eta_fit) { + adj_results <- NULL + + for (i in seq_along(methods)) { + m_y <- df_ml$Y - eta_fit[, i] + rdd_result <- rdrobust(m_y, df$index, c = cutoff, rho = 1) + adj_results <- rbind(adj_results, c(rdd_result$coef[1], rdd_result$se["Robust", ])) + } + return(adj_results) +} + +adj_frame <- as.data.frame(second_stage(eta_fit)) +rownames(adj_frame) <- methods +colnames(adj_frame) <- c("LATE", "s.e.") +print(adj_frame) +``` + +Food Consumption (Year 2) + +```{r} +# Running Variable and Outcome +df_ml <- df +investigated_outcome <- "conspcfood_t2" +names(df_ml)[names(df_ml) == "index"] <- "X" +names(df_ml)[names(df_ml) == investigated_outcome] <- "Y" + +# Baseline covariates including consumption +b_covs <- names(df_ml[, c(1:8, 10:17, 19, 22)]) + +# Fixed effects for localities +i_fe <- as.matrix(dummy_cols(df_ml$clus, remove_first_dummy = TRUE)) + +# Flexible covariates including localities indicators +f_covs <- as.matrix(model.matrix(~ .^2, data = df_ml[b_covs]))[, -1] +Zlasso <- as.matrix(cbind(i_fe, f_covs)) + +h_fs <- 2 * rdrobust(df_ml$Y, df_ml$X, rho = 1)$bws[[1]] + +first_stage <- function() { + # Set up the cross-fitting + n <- nrow(df_ml) + Kf <- 5 # Number of folds + sampleframe <- rep(1:Kf, ceiling(n / Kf)) + cfgroup <- sample(sampleframe, size = n, replace = FALSE) + + # Matrix to store eta predictions + eta_fit <- matrix(NA, n, 5) + + # Create vector of observations to be considered in the first stage model + weights <- (abs(df_ml$X) < h_fs) + + for (k in 1:Kf) { + fold <- (cfgroup == k) + + data_treated <- df_ml[df_ml$X > 0 & !fold & weights > 0, ] + data_control <- df_ml[df_ml$X < 0 & !fold & weights > 0, ] + + data_fold <- df_ml[fold, ] + + model <- as.formula(paste("Y~", paste(b_covs, collapse = "+"))) + + rf1 <- randomForest(model, data = data_treated, mtry = 4, ntree = 1000) + rf0 <- randomForest(model, data = data_control, mtry = 4, ntree = 1000) + eta_fit[fold, 1] <- (predict(rf1, data_fold) + predict(rf0, data_fold)) / 2 + + gbm1 <- gbm(model, + data = data_treated, n.trees = 100, interaction.depth = 1, + shrinkage = .1, distribution = "gaussian" + ) + gbm0 <- gbm(model, + data = data_control, n.trees = 100, interaction.depth = 1, + shrinkage = .1, distribution = "gaussian" + ) + eta_fit[fold, 2] <- (predict(gbm1, data_fold, n.trees = 100) + predict(gbm0, data_fold, n.trees = 100)) / 2 + + lm1 <- lm(model, data = data_treated) + lm0 <- lm(model, data = data_control) + eta_fit[fold, 3] <- (predict(lm1, data_fold) + predict(lm0, data_fold)) / 2 + + las_base1 <- rlasso(model, data = data_treated) + las_base0 <- rlasso(model, data = data_control) + eta_fit[fold, 4] <- (predict(las_base1, data_fold) + predict(las_base0, data_fold)) / 2 + + data_treated_extended <- cbind(Zlasso[rownames(data_treated), ], data_treated) + data_control_extended <- cbind(Zlasso[rownames(data_control), ], data_control) + data_fold_extended <- cbind(Zlasso[rownames(data_fold), ], data_fold) + model_flex <- as.formula(paste("Y~", paste(c(b_covs, colnames(Zlasso)), collapse = "+"))) + + las_flex1 <- rlasso(model_flex, data = data_treated_extended) + las_flex0 <- rlasso(model_flex, data = data_control_extended) + eta_fit[fold, 5] <- (predict(las_flex1, data_fold_extended) + predict(las_flex0, data_fold_extended)) / 2 + } + return(eta_fit) +} + +eta_fit <- first_stage() + +methods <- c("Random Forest", "Gradient Boosting", "Linear Regression", "Lasso Baseline", "Lasso Flexible") + +second_stage <- function(eta_fit) { + adj_results <- NULL + + for (i in seq_along(methods)) { + m_y <- df_ml$Y - eta_fit[, i] + rdd_result <- rdrobust(m_y, df$index, c = cutoff, rho = 1) + adj_results <- rbind(adj_results, c(rdd_result$coef[1], rdd_result$se["Robust", ])) + } + return(adj_results) +} + +adj_frame <- as.data.frame(second_stage(eta_fit)) +rownames(adj_frame) <- methods +colnames(adj_frame) <- c("LATE", "s.e.") +print(adj_frame) +``` + +Non-Food Consumption (Year 2) + +```{r} +# Running Variable and Outcome +df_ml <- df +investigated_outcome <- "conspcnonfood_t2" +names(df_ml)[names(df_ml) == "index"] <- "X" +names(df_ml)[names(df_ml) == investigated_outcome] <- "Y" + +# Baseline covariates including consumption +b_covs <- names(df_ml[, c(1:8, 10:17, 19, 22)]) + +# Fixed effects for localities +i_fe <- as.matrix(dummy_cols(df_ml$clus, remove_first_dummy = TRUE)) + +# Flexible covariates including localities indicators +f_covs <- as.matrix(model.matrix(~ .^2, data = df_ml[b_covs]))[, -1] +Zlasso <- as.matrix(cbind(i_fe, f_covs)) + +h_fs <- 2 * rdrobust(df_ml$Y, df_ml$X, rho = 1)$bws[[1]] + +first_stage <- function() { + # Set up the cross-fitting + n <- nrow(df_ml) + Kf <- 5 # Number of folds + sampleframe <- rep(1:Kf, ceiling(n / Kf)) + cfgroup <- sample(sampleframe, size = n, replace = FALSE) + + # Matrix to store eta predictions + eta_fit <- matrix(NA, n, 5) + + # Create vector of observations to be considered in the first stage model + weights <- (abs(df_ml$X) < h_fs) + + for (k in 1:Kf) { + fold <- (cfgroup == k) + + data_treated <- df_ml[df_ml$X > 0 & !fold & weights > 0, ] + data_control <- df_ml[df_ml$X < 0 & !fold & weights > 0, ] + + data_fold <- df_ml[fold, ] + + model <- as.formula(paste("Y~", paste(b_covs, collapse = "+"))) + + rf1 <- randomForest(model, data = data_treated, mtry = 4, ntree = 1000) + rf0 <- randomForest(model, data = data_control, mtry = 4, ntree = 1000) + eta_fit[fold, 1] <- (predict(rf1, data_fold) + predict(rf0, data_fold)) / 2 + + gbm1 <- gbm(model, + data = data_treated, n.trees = 100, interaction.depth = 1, + shrinkage = .1, distribution = "gaussian" + ) + gbm0 <- gbm(model, + data = data_control, n.trees = 100, interaction.depth = 1, + shrinkage = .1, distribution = "gaussian" + ) + eta_fit[fold, 2] <- (predict(gbm1, data_fold, n.trees = 100) + predict(gbm0, data_fold, n.trees = 100)) / 2 + + lm1 <- lm(model, data = data_treated) + lm0 <- lm(model, data = data_control) + eta_fit[fold, 3] <- (predict(lm1, data_fold) + predict(lm0, data_fold)) / 2 + + las_base1 <- rlasso(model, data = data_treated) + las_base0 <- rlasso(model, data = data_control) + eta_fit[fold, 4] <- (predict(las_base1, data_fold) + predict(las_base0, data_fold)) / 2 + + data_treated_extended <- cbind(Zlasso[rownames(data_treated), ], data_treated) + data_control_extended <- cbind(Zlasso[rownames(data_control), ], data_control) + data_fold_extended <- cbind(Zlasso[rownames(data_fold), ], data_fold) + model_flex <- as.formula(paste("Y~", paste(c(b_covs, colnames(Zlasso)), collapse = "+"))) + + las_flex1 <- rlasso(model_flex, data = data_treated_extended) + las_flex0 <- rlasso(model_flex, data = data_control_extended) + eta_fit[fold, 5] <- (predict(las_flex1, data_fold_extended) + predict(las_flex0, data_fold_extended)) / 2 + } + return(eta_fit) +} + +eta_fit <- first_stage() + +methods <- c("Random Forest", "Gradient Boosting", "Linear Regression", "Lasso Baseline", "Lasso Flexible") + +second_stage <- function(eta_fit) { + adj_results <- NULL + + for (i in seq_along(methods)) { + m_y <- df_ml$Y - eta_fit[, i] + rdd_result <- rdrobust(m_y, df$index, c = cutoff, rho = 1) + adj_results <- rbind(adj_results, c(rdd_result$coef[1], rdd_result$se["Robust", ])) + } + return(adj_results) +} + +adj_frame <- as.data.frame(second_stage(eta_fit)) +rownames(adj_frame) <- methods +colnames(adj_frame) <- c("LATE", "s.e.") +print(adj_frame) +``` + diff --git a/T/dml-for-conditional-average-treatment-effect.Rmd b/T/dml-for-conditional-average-treatment-effect.Rmd new file mode 100644 index 00000000..2dad6251 --- /dev/null +++ b/T/dml-for-conditional-average-treatment-effect.Rmd @@ -0,0 +1,610 @@ +--- +title: An R Markdown document converted from "T/dml-for-conditional-average-treatment-effect.irnb" +output: html_document +--- + +# DML for CATE + +This is a simple demonstration of Debiased Machine Learning estimator for the Conditional Average Treatment Effect. +Goal is to estimate the effect of 401(k) eligibility on net financial assets for each value of income. +The method is based on the following paper. + +* Title: Debiased Machine Learning of Conditional Average Treatment Effect and Other Causal Functions + +* Authors: Semenova, Vira and Chernozhukov, Victor. + +* Arxiv version: https://arxiv.org/pdf/1702.06240.pdf + +* Published version with replication code: https://academic.oup.com/ectj/advance-article/doi/10.1093/ectj/utaa027/5899048 + +Background + +The target function is Conditional Average Treatment Effect, defined as + +$$ g(x)=E [ Y(1) - Y(0) |X=x], $$ + +where $Y(1)$ and $Y(0)$ are potential outcomes in treated and control group. In our case, $Y(1)$ is the potential Net Financial Assets if a subject is eligible for 401(k), and $Y(0)$ is the potential Net Financial Assets if a subject is ineligible. $X$ is a covariate of interest, in this case, income. +$ g(x)$ shows expected effect of eligibility on NET TFA for a subject whose income level is $x$. + + + +If eligibility indicator is independent of $Y(1), Y(0)$, given pre-401-k assignment characteristics $Z$, the function can expressed in terms of observed data (as opposed to hypothetical, or potential outcomes). Observed data consists of realized NET TFA $Y = D Y(1) + (1-D) Y(0)$, eligibility indicator $D$, and covariates $Z$ which includes $X$, income. The expression for $g(x)$ is + +$$ g(x) = E [ Y (\eta_0) \mid X=x], $$ +where the transformed outcome variable is + +$$Y (\eta) = \dfrac{D}{s(Z)} \left( Y - \mu(1,Z) \right) - \dfrac{1-D}{1-s(Z)} \left( Y - \mu(0,Z) \right) + \mu(1,Z) - \mu(0,Z),$$ + +the probability of eligibility is + +$$s_0(z) = Pr (D=1 \mid Z=z),$$ + +the expected net financial asset given $D =d \in \{1,0\}$ and $Z=z$ is + +$$ \mu(d,z) = E[ Y \mid Z=z, D=d]. $$ + +Our goal is to estimate $g(x)$. + + +In step 1, we estimate the unknown functions $s_0(z), \mu(1,z), \mu(0,z)$ and plug them into $Y (\eta)$. + + +In step 2, we approximate the function $g(x)$ by a linear combination of basis functions: + +$$ g(x) = p(x)' \beta_0, $$ + + +where $p(x)$ is a vector of polynomials or splines and + +$$ \beta_0 = (E p(X) p(X))^{-1} E p(X) Y (\eta_0) $$ + +is the best linear predictor. We report + +$$ +\widehat{g}(x) = p(x)' \widehat{\beta}, +$$ + +where $\widehat{\beta}$ is the ordinary least squares estimate of $\beta_0$ defined on the random sample $(X_i, D_i, Y_i)_{i=1}^N$ + +$$ + \widehat{\beta} :=\left( \dfrac{1}{N} \sum_{i=1}^N p(X_i) p(X_i)' \right)^{-1} \dfrac{1}{N} \sum_{i=1}^N p(X_i)Y_i(\widehat{\eta}) +$$ + + + + + + + +```{r} +## load packages +install.packages("foreign") +install.packages("quantreg") +install.packages("splines") +install.packages("lattice") +install.packages("Hmisc") +install.packages("fda") +install.packages("hdm") +install.packages("randomForest") +install.packages("ranger") +install.packages("sandwich") +install.packages("ggplot2") +``` + +```{r} +library(foreign) +library(quantreg) +library(splines) +library(lattice) +library(Hmisc) +library(fda) +library(hdm) +library(randomForest) +library(ranger) +library(sandwich) +library(ggplot2) +``` + +```{r} +## 401k dataset +data(pension) +pension$net_tfa <- pension$net_tfa / 10000 +## covariate of interest -- log income -- +pension$inc <- log(pension$inc) +pension <- pension[!is.na(pension$inc) & pension$inc != -Inf & pension$inc != Inf, ] + + +## outcome variable -- total net financial assets +Y <- pension$net_tfa +## binary treatment -- indicator of 401(k) eligibility +D <- pension$e401 + + +X <- pension$inc +## target parameter is CATE = E[ Y(1) - Y(0) | X] + + +## raw covariates so that Y(1) and Y(0) are independent of D given Z +Z <- pension[, c( + "age", "inc", "fsize", "educ", "male", "db", "marr", "twoearn", "pira", "hown", "hval", "hequity", "hmort", + "nohs", "hs", "smcol" +)] + + +y_name <- "net_tfa" +d_name <- "e401" +form_z <- paste("(poly(age, 6) + poly(inc, 8) + poly(educ, 4) + poly(fsize,2) + as.factor(marr) + ", + "as.factor(twoearn) + as.factor(db) + as.factor(pira) + as.factor(hown))^2") +cat(sprintf("\n sample size is %g \n", length(Y))) +cat(sprintf("\n num raw covariates z is %g \n", dim(Z)[2])) +``` + +In Step 1, we estimate three functions: + +1. probability of treatment assignment $s_0(z)$ + +2.-3. regression functions $\mu_0(1,z)$ and $\mu_0(0,z)$. + +We use the cross-fitting procedure with $K=2$ holds. For definition of cross-fitting with $K$ folds, check the sample splitting in ```DML2.for.PLM``` function defined in https://www.kaggle.com/victorchernozhukov/debiased-ml-for-partially-linear-model-in-r + +For each function, we try random forest. + +First Stage: estimate $\mu_0(1,z)$ and $\mu_0(0,z)$ and $s_0(z)$ by lasso + +```{r} +first_stage_lasso <- function(data, d_name, y_name, form_z, seed = 1) { + # Sample size + N <- dim(data)[1] + # Estimated regression function in control group + mu0_hat <- rep(1, N) + # Estimated regression function in treated group + mu1_hat <- rep(1, N) + # Propensity score + s_hat <- rep(1, N) + seed <- 1 + ## define sample splitting + set.seed(seed) + inds_train <- sample(1:N, floor(N / 2)) + inds_eval <- setdiff(1:N, inds_train) + + print("Estimate treatment probability, first half") + ## conditional probability of 401 k eligibility (i.e., propensity score) based on random forest + fitted_lasso_pscore <- rlassologit(as.formula(paste0(d_name, "~", form_z)), data = data[inds_train, ]) + + s_hat[inds_eval] <- predict(fitted_lasso_pscore, data[inds_eval, ], type = "response") + print("Estimate treatment probability, second half") + fitted_lasso_pscore <- rlassologit(as.formula(paste0(d_name, "~", form_z)), data = data[inds_eval, ]) + s_hat[inds_train] <- predict(fitted_lasso_pscore, data[inds_train, ], type = "response") + + + data1 <- data + data1[, d_name] <- 1 + + data0 <- data + data0[, d_name] <- 0 + + print("Estimate expectation function, first half") + fitted_lasso_mu <- rlasso(as.formula(paste0(y_name, "~", d_name, "+(", form_z, ")")), data = data[inds_train, ]) + mu1_hat[inds_eval] <- predict(fitted_lasso_mu, data1[inds_eval, ]) + mu0_hat[inds_eval] <- predict(fitted_lasso_mu, data0[inds_eval, ]) + + print("Estimate expectation function, second half") + fitted_lasso_mu <- rlasso(as.formula(paste0(y_name, "~", d_name, "+(", form_z, ")")), data = data[inds_eval, ]) + mu1_hat[inds_train] <- predict(fitted_lasso_mu, data1[inds_train, ]) + mu0_hat[inds_train] <- predict(fitted_lasso_mu, data0[inds_train, ]) + + return(list( + mu1_hat = mu1_hat, + mu0_hat = mu0_hat, + s_hat = s_hat + )) +} +``` + +First Stage: estimate $\mu_0(1,z)$ and $\mu_0(0,z)$ and $s_0(z)$ by random forest + +```{r} +first_stage_rf <- function(Y, D, Z, seed = 1) { + # Sample size + N <- length(D) + # Estimated regression function in control group + mu0_hat <- rep(1, N) + # Estimated regression function in treated group + mu1_hat <- rep(1, N) + # Propensity score + s_hat <- rep(1, N) + + + ## define sample splitting + set.seed(seed) + inds_train <- sample(1:N, floor(N / 2)) + inds_eval <- setdiff(1:N, inds_train) + + print("Estimate treatment probability, first half") + ## conditional probability of 401 k eligibility (i.e., propensity score) based on random forest + Df <- as.factor(as.character(D)) + fitted_rf_pscore <- randomForest(Z, Df, subset = inds_train) + s_hat[inds_eval] <- predict(fitted_rf_pscore, Z[inds_eval, ], type = "prob")[, 2] + print("Estimate treatment probability, second half") + fitted_rf <- randomForest(Z, Df, subset = inds_eval) + s_hat[inds_train] <- predict(fitted_rf_pscore, Z[inds_train, ], type = "prob")[, 2] + + ## conditional expected net financial assets (i.e., regression function) based on random forest + + covariates <- cbind(Z, D) + + covariates1 <- cbind(Z, D = rep(1, N)) + covariates0 <- cbind(Z, D = rep(0, N)) + + print("Estimate expectation function, first half") + fitted_rf_mu <- randomForest(cbind(Z, D), Y, subset = inds_train) + mu1_hat[inds_eval] <- predict(fitted_rf_mu, covariates1[inds_eval, ]) + mu0_hat[inds_eval] <- predict(fitted_rf_mu, covariates0[inds_eval, ]) + + print("Estimate expectation function, second half") + fitted_rf_mu <- randomForest(cbind(Z, D), Y, subset = inds_eval) + mu1_hat[inds_train] <- predict(fitted_rf_mu, covariates1[inds_train, ]) + mu0_hat[inds_train] <- predict(fitted_rf_mu, covariates0[inds_train, ]) + + return(list( + mu1_hat = mu1_hat, + mu0_hat = mu0_hat, + s_hat = s_hat + )) +} +``` + +In Step 2, we approximate $Y(\eta_0)$ by a vector of basis functions. There are two use cases: +**** +2.A. Group Average Treatment Effect, described above + + +2.B. Average Treatment Effect conditional on income value. There are three smoothing options: + +1. splines offered in ```least_squares_splines``` + +2. orthogonal polynomials with the highest degree chosen by cross-validation ```least_squares_series``` + +3. standard polynomials with the highest degree input by user ```least_squares_series_old``` + + +The default option is option 3. + +2.A. The simplest use case of Conditional Average Treatment Effect is GATE, or Group Average Treatment Effect. Partition the support of income as + +$$ - \infty = \ell_0 < \ell_1 < \ell_2 \dots \ell_K = \infty $$ + +define intervals $I_k = [ \ell_{k-1}, \ell_{k})$. Let $X$ be income covariate. For $X$, define a group indicator + +$$ G_k(X) = 1[X \in I_k], $$ + +and the vector of basis functions + +$$ p(X) = (G_1(X), G_2(X), \dots, G_K(X)) $$ + +Then, the Best Linear Predictor $\beta_0$ vector shows the average treatment effect for each group. + +```{r} +## estimate first stage functions by random forest +## may take a while +fs_hat_rf <- first_stage_rf(Y, D, Z) +``` + +```{r} +X <- pension$inc +fs_hat <- fs_hat_rf +min_cutoff <- 0.01 +# regression function +mu1_hat <- fs_hat[["mu1_hat"]] +mu0_hat <- fs_hat[["mu0_hat"]] +# propensity score +s_hat <- fs_hat[["s_hat"]] +s_hat <- sapply(s_hat, max, min_cutoff) + +### Construct Orthogonal Signal +RobustSignal <- (Y - mu1_hat) * D / s_hat - (Y - mu0_hat) * (1 - D) / (1 - s_hat) + mu1_hat - mu0_hat +``` + +```{r} +qtmax <- function(C, S = 10000, alpha) { + p <- nrow(C) + tmaxs <- apply(abs(matrix(rnorm(p * S), nrow = p, ncol = S)), 2, max) + return(quantile(tmaxs, 1 - alpha)) +} + +# This function computes the square root of a symmetric matrix using the spectral decomposition; + +group_average_treatment_effect <- function(X, Y, max_grid = 5, alpha = 0.05, B = 10000) { + grid <- quantile(X, probs = c((0:max_grid) / max_grid)) + Xraw <- matrix(NA, nrow = length(Y), ncol = length(grid) - 1) + + for (k in 2:((length(grid)))) { + Xraw[, k - 1] <- sapply(X, function(x) ifelse(x >= grid[k - 1] & x < grid[k], 1, 0)) + } + k <- length(grid) + Xraw[, k - 1] <- sapply(X, function(x) ifelse(x >= grid[k - 1] & x <= grid[k], 1, 0)) + + ols_fit <- lm(Y ~ Xraw - 1) + coefs <- coef(ols_fit) + vars <- names(coefs) + hcv_coefs <- vcovHC(ols_fit, type = "HC") + coefs_se <- sqrt(diag(hcv_coefs)) # White std errors + ## this is an identity matrix + ## qtmax is simplified + c_coefs <- (diag(1 / sqrt(diag(hcv_coefs)))) %*% hcv_coefs %*% (diag(1 / sqrt(diag(hcv_coefs)))) + + + tes <- coefs + tes_se <- coefs_se + tes_cor <- c_coefs + crit_val <- qtmax(tes_cor, B, alpha) + + tes_ucb <- tes + crit_val * tes_se + tes_lcb <- tes - crit_val * tes_se + + tes_uci <- tes + qnorm(1 - alpha / 2) * tes_se + tes_lci <- tes + qnorm(alpha / 2) * tes_se + + + return(list( + beta_hat = coefs, ghat_lower_point = tes_lci, ghat_upper_point = tes_uci, + ghat_lower = tes_lcb, ghat_upper = tes_ucb, crit_val = crit_val + )) +} +``` + +```{r} +res <- group_average_treatment_effect(X = X, Y = RobustSignal) +``` + +```{r} +## this code is taken from L1 14.382 taught at MIT +## author: Mert Demirer +options(repr.plot.width = 10, repr.plot.height = 8) + +tes <- res$beta_hat +tes_lci <- res$ghat_lower_point +tes_uci <- res$ghat_upper_point + +tes_lcb <- res$ghat_lower +tes_ucb <- res$ghat_upper +tes_lev <- c("0%-20%", "20%-40%", "40%-60%", "60%-80%", "80%-100%") + +plot(c(1, 5), las = 2, xlim = c(0.6, 5.4), ylim = c(.05, 2.09), type = "n", xlab = "Income group", + ylab = "Average Effect on NET TFA (per 10 K)", + main = "Group Average Treatment Effects on NET TFA", + xaxt = "n") +axis(1, at = 1:5, labels = tes_lev) +for (i in 1:5) { + rect(i - 0.2, tes_lci[i], i + 0.2, tes_uci[i], col = NA, border = "red", lwd = 3) + rect(i - 0.2, tes_lcb[i], i + 0.2, tes_ucb[i], col = NA, border = 4, lwd = 3) + segments(i - 0.2, tes[i], i + 0.2, tes[i], lwd = 5) +} +abline(h = 0) + +legend(2.5, 2.0, + c("Regression Estimate", "95% Simultaneous Confidence Interval", "95% Pointwise Confidence Interval"), + col = c(1, 4, 2), lwd = c(4, 3, 3), horiz = FALSE, bty = "n", cex = 0.8) + +dev.off() +``` + +```{r} +least_squares_splines <- function(X, Y, max_knot = 9, norder, nderiv, ...) { + ## Create technical regressors + cv_bsp <- rep(0, max_knot - 1) + for (knot in 2:max_knot) { + breaks <- quantile(X, c(0:knot) / knot) + formula.bsp <- Y ~ bsplineS(X, breaks = breaks, norder = norder, nderiv = nderiv)[, -1] + fit <- lm(formula.bsp) + cv_bsp[knot - 1] <- sum((fit$res / (1 - hatvalues(fit)))^2) + } + ## Number of knots chosen by cross-validation + cv_knot <- which.min(cv_bsp) + 1 + breaks <- quantile(X, c(0:cv_knot) / cv_knot) + formula.bsp <- Y ~ bsplineS(X, breaks = breaks, norder = norder, nderiv = 0)[, -1] + fit <- lm(formula.bsp) + + return(list(cv_knot = cv_knot, fit = fit)) +} + + +least_squares_series <- function(X, Y, max_degree, ...) { + cv_pol <- rep(0, max_degree) + for (degree in 1:max_degree) { + formula.pol <- Y ~ poly(X, degree) + fit <- lm(formula.pol) + cv_pol[degree] <- sum((fit$res / (1 - hatvalues(fit)))^2) + } + ## Number of knots chosen by cross-validation + cv_degree <- which.min(cv_pol) + ## Estimate coefficients + formula.pol <- Y ~ poly(X, cv_degree) + fit <- lm(formula.pol) + + return(list(fit = fit, cv_degree = cv_degree)) +} +``` + +```{r} +msqrt <- function(C) { + Ceig <- eigen(C) + return(Ceig$vectors %*% diag(sqrt(Ceig$values)) %*% solve(Ceig$vectors)) +} + +tboot <- function(regressors_grid, omega_hat, alpha, B = 10000) { + numerator_grid <- regressors_grid %*% msqrt(omega_hat) + denominator_grid <- sqrt(diag(regressors_grid %*% omega_hat %*% t(regressors_grid))) + + norm_numerator_grid <- numerator_grid + for (k in seq_len(dim(numerator_grid)[1])) { + norm_numerator_grid[k, ] <- numerator_grid[k, ] / denominator_grid[k] + } + + tmaxs <- apply(abs(norm_numerator_grid %*% matrix(rnorm(dim(numerator_grid)[2] * B), + nrow = dim(numerator_grid)[2], ncol = B)), 2, max) + return(quantile(tmaxs, 1 - alpha)) +} +``` + +```{r} +second_stage <- function(fs_hat, Y, D, X, max_degree = 3, norder = 4, nderiv = 0, + ss_method = "poly", min_cutoff = 0.01, alpha = 0.05, eps = 0.1, ...) { + x_grid <- seq(min(X), max(X), eps) + mu1_hat <- fs_hat[["mu1_hat"]] + mu0_hat <- fs_hat[["mu0_hat"]] + # propensity score + s_hat <- fs_hat[["s_hat"]] + s_hat <- sapply(s_hat, max, min_cutoff) + ### Construct Orthogonal Signal + + RobustSignal <- (Y - mu1_hat) * D / s_hat - (Y - mu0_hat) * (1 - D) / (1 - s_hat) + mu1_hat - mu0_hat + + # Estimate the target function using least squares series + if (ss_method == "ortho_poly") { + res <- least_squares_series(X = X, Y = RobustSignal, eps = 0.1, max_degree = max_degree) + fit <- res$fit + cv_degree <- res$cv_degree + regressors_grid <- cbind(rep(1, length(x_grid)), poly(x_grid, cv_degree)) + } + if (ss_method == "splines") { + res <- least_squares_splines(X = X, Y = RobustSignal, eps = 0.1, norder = norder, nderiv = nderiv) + fit <- res$fit + cv_knot <- res$cv_knot + breaks <- quantile(X, c(0:cv_knot) / cv_knot) + regressors_grid <- cbind(rep(1, length(x_grid)), + bsplineS(x_grid, breaks = breaks, norder = norder, nderiv = nderiv)[, -1]) + degree <- cv_knot + } + + g_hat <- regressors_grid %*% coef(fit) + + hcv_coefs <- vcovHC(fit, type = "HC") + standard_error <- sqrt(diag(regressors_grid %*% hcv_coefs %*% t(regressors_grid))) + + ### Lower Pointwise CI + ghat_lower_point <- g_hat + qnorm(alpha / 2) * standard_error + ### Upper Pointwise CI + ghat_upper_point <- g_hat + qnorm(1 - alpha / 2) * standard_error + + max_tstat <- tboot(regressors_grid = regressors_grid, omega_hat = hcv_coefs, alpha = alpha) + + ## Lower Uniform CI + ghat_lower <- g_hat - max_tstat * standard_error + ## Upper Uniform CI + ghat_upper <- g_hat + max_tstat * standard_error + return(list( + ghat_lower = ghat_lower, g_hat = g_hat, ghat_upper = ghat_upper, fit = fit, + ghat_lower_point = ghat_lower_point, ghat_upper_point = ghat_upper_point, x_grid = x_grid + )) +} +``` + +```{r} +make_plot <- function(res, lowy, highy, degree, ss_method = "series", uniform = TRUE, ...) { + title <- paste0("Effect of 401(k) on Net TFA, ", ss_method) + x_grid <- res$x_grid + len <- length(x_grid) + + + if (uniform) { + group <- c(rep("UCI", len), rep("PCI", len), rep("Estimate", len), rep("PCIL", len), rep("UCIL", len)) + group_type <- c(rep("CI", len), rep("CI", len), rep("Estimate", len), rep("CI", len), rep("CI", len)) + group_ci_type <- c(rep("Uniform", len), rep("Point", len), + rep("Uniform", len), rep("Point", len), rep("Uniform", len)) + + df <- data.frame(income = rep(x_grid, 5), + outcome = c(res$ghat_lower, res$ghat_lower_point, + res$g_hat, res$ghat_upper_point, res$ghat_upper), + group = group, group_col = group_type, group_line = group_ci_type) + p <- ggplot(data = df) + + aes(x = exp(income), y = outcome, colour = group) + + theme_bw() + + xlab("Income") + + ylab("Net TFA, (thousand dollars)") + + scale_colour_manual(values = c("black", "blue", "blue", "blue", "blue")) + + theme(plot.title = element_text(hjust = 0.5), text = element_text(size = 20, family = "serif")) + + theme(legend.title = element_blank()) + + theme(legend.position = "none") + + ylim(low = lowy, high = highy) + + geom_line(aes(linetype = group_line), size = 1.5) + + scale_linetype_manual(values = c("dashed", "solid")) + + ggtitle(title) + } + + if (!uniform) { + group <- c(rep("PCI", len), rep("Estimate", len), rep("PCIL", len)) + group_type <- c(rep("CI", len), rep("Estimate", len), rep("CI", len)) + group_ci_type <- c(rep("Point", len), rep("Uniform", len), rep("Point", len)) + + df <- data.frame(income = rep(x_grid, 3), + outcome = c(res$ghat_lower_point, res$g_hat, res$ghat_upper_point), + group = group, group_col = group_type, group_line = group_ci_type) + + p <- ggplot(data = df) + + aes(x = exp(income), y = outcome, colour = group) + + theme_bw() + + xlab("Income") + + ylab("Net TFA, (thousand dollars)") + + scale_colour_manual(values = c("black", "blue", "blue", "blue", "blue")) + + theme(plot.title = element_text(hjust = 0.5), text = element_text(size = 20, family = "serif")) + + theme(legend.title = element_blank()) + + theme(legend.position = "none") + + ylim(low = lowy, high = highy) + + geom_line(aes(linetype = group_line), size = 1.5) + + scale_linetype_manual(values = c("dashed", "solid")) + + ggtitle(title) + } + + return(p) +} +``` + +```{r} +res_ortho_rf_splines <- second_stage(fs_hat = fs_hat_rf, X = X, D = D, Y = Y, + ss_method = "splines", max_degree = 3) +``` + +```{r} +res_ortho_rf_ortho_poly <- second_stage(fs_hat = fs_hat_rf, X = X, D = D, Y = Y, + ss_method = "ortho_poly", max_degree = 3) +``` + +plot findings + +-- black solid line shows estimated function $p(x)' \widehat{\beta}$ + +-- blue dashed lines show pointwise confidence bands for this function + +```{r} +p <- make_plot(res_ortho_rf_ortho_poly, ss_method = "ortho_poly", uniform = FALSE, lowy = -10, highy = 20) +options(repr.plot.width = 15, repr.plot.height = 10) +print(p) +``` + +plot findings: + +-- black solid line shows estimated function $p(x)' \widehat{\beta}$ + +-- blue dashed lines show pointwise confidence bands for this function. I.e., for each fixed point $x_0$, i.e., $x_0=1$, they cover $p(x_0)'\beta_0$ with probability 0.95 + +-- blue solid lines show uniform confidence bands for this function. I.e., they cover the whole function $x \rightarrow p(x)'\beta_0$ with probability 0.95 on some compact range + +```{r} +p <- make_plot(res_ortho_rf_ortho_poly, ss_method = "ortho polynomials", uniform = TRUE, lowy = -10, highy = 25) +options(repr.plot.width = 15, repr.plot.height = 10) +print(p) +``` + +```{r} +p <- make_plot(res_ortho_rf_splines, ss_method = "splines", uniform = FALSE, lowy = -15, highy = 10) +options(repr.plot.width = 15, repr.plot.height = 10) +print(p) +``` + +```{r} +p <- make_plot(res_ortho_rf_splines, ss_method = "splines", uniform = TRUE, lowy = -20, highy = 20) +options(repr.plot.width = 15, repr.plot.height = 10) +print(p) +``` +