diff --git a/T/deprecated/debiased-ml-for-partially-linear-iv-model-in-r.irnb.Rmd b/T/deprecated/debiased-ml-for-partially-linear-iv-model-in-r.irnb.Rmd new file mode 100644 index 00000000..7d506973 --- /dev/null +++ b/T/deprecated/debiased-ml-for-partially-linear-iv-model-in-r.irnb.Rmd @@ -0,0 +1,264 @@ +--- +jupyter: + jupytext: + text_representation: + extension: .Rmd + format_name: rmarkdown + format_version: '1.2' + jupytext_version: 1.13.7 + kernelspec: + display_name: R + language: R + name: ir +--- + + +# Double/Debiased ML for Partially Linear IV Model + +References: + +https://arxiv.org/abs/1608.00060 + + +https://www.amazon.com/Business-Data-Science-Combining-Accelerate/dp/1260452778 + +The code is based on the book. + + + + + +# Partially Linear IV Model + +We consider the partially linear structural equation model: +\begin{eqnarray} + & Y - D\theta_0 = g_0(X) + \zeta, & E[\zeta \mid Z,X]= 0,\\ + & Z = m_0(X) + V, & E[V \mid X] = 0. +\end{eqnarray} + + +Note that this model is not a regression model unless $Z=D$. The model is a canonical +model in causal inference, going back to P. Wright's work on IV methods for estimaing demand/supply equations, with the modern difference being that $g_0$ and $m_0$ are nonlinear, potentially complicated functions of high-dimensional $X$. + + +The idea of this model is that there is a structural or causal relation between $Y$ and $D$, captured by $\theta_0$, and $g_0(X) + \zeta$ is the stochastic error, partly explained by covariates $X$. $V$ and $\zeta$ are stochastic errors that are not explained by $X$. Since $Y$ and $D$ are jointly determined, we need an external factor, commonly referred to as an instrument, $Z$ to create exogenous variation +in $D$. Note that $Z$ should affect $D$. The $X$ here serve again as confounding factors, so we can think of variation in $Z$ as being exogenous only conditional on $X$. + + +The causal DAG this model corresponds to is given by: +$$ +Z \to D, X \to (Y, Z, D), L \to (Y,D), +$$ +where $L$ is the latent confounder affecting both $Y$ and $D$, but not $Z$. + + + +--- + +# Example + +A simple contextual example is from biostatistics, where $Y$ is a health outcome and $D$ is indicator of smoking. Thus, $\theta_0$ is captures the effect of smoking on health. Health outcome $Y$ and smoking behavior $D$ are treated as being jointly determined. $X$ represents patient characteristics, and $Z$ could be a doctor's advice not to smoke (or another behavioral treatment) that may affect the outcome $Y$ only through shifting the behavior $D$, conditional on characteristics $X$. + +---- + + + +# PLIVM in Residualized Form + +The PLIV model above can be rewritten in the following residualized form: +$$ + \tilde Y = \tilde D \theta_0 + \zeta, \quad E[\zeta \mid V,X]= 0, +$$ +where +$$ + \tilde Y = (Y- \ell_0(X)), \quad \ell_0(X) = E[Y \mid X] \\ + \tilde D = (D - r_0(X)), \quad r_0(X) = E[D \mid X] \\ + \tilde Z = (Z- m_0(X)), \quad m_0(X) = E[Z \mid X]. +$$ + The tilded variables above represent original variables after taking out or "partialling out" + the effect of $X$. Note that $\theta_0$ is identified from this equation if $V$ + and $U$ have non-zero correlation, which automatically means that $U$ and $V$ + must have non-zero variation. + + + +----- + +# DML for PLIV Model + +Given identification, DML proceeds as follows + +Compute the estimates $\hat \ell_0$, $\hat r_0$, and $\hat m_0$ , which amounts +to solving the three problems of predicting $Y$, $D$, and $Z$ using +$X$, using any generic ML method, giving us estimated residuals +$$ +\tilde Y = Y - \hat \ell_0(X), \\ \tilde D= D - \hat r_0(X), \\ \tilde Z = Z- \hat m_0(X). +$$ +The estimates should be of a cross-validated form, as detailed in the algorithm below. + +Estimate $\theta_0$ by the the intstrumental +variable regression of $\tilde Y$ on $\tilde D$ using $\tilde Z$ as an instrument. +Use the conventional inference for the IV regression estimator, ignoring +the estimation error in these residuals. + +The reason we work with this residualized form is that it eliminates the bias +arising when solving the prediction problem in stage 1. The role of cross-validation +is to avoid another source of bias due to potential overfitting. + +The estimator is adaptive, +in the sense that the first stage estimation errors do not affect the second +stage errors. + + + +```{r _kg_hide-output=TRUE} +install.packages("hdm") +install.packages("AER") +install.packages("randomForest") +``` + +```{r} + +library(AER) #applied econometrics library +library(randomForest) #random Forest library +library(hdm) #high-dimensional econometrics library +library(glmnet) #glm net + + +# DML for PLIVM + +DML2.for.PLIVM <- function(x, d, z, y, dreg, yreg, zreg, nfold=2) { + # this implements DML2 algorithm, where there moments are estimated via DML, before constructing + # the pooled estimate of theta randomly split data into folds + nobs <- nrow(x) + foldid <- rep.int(1:nfold,times = ceiling(nobs/nfold))[sample.int(nobs)] + I <- split(1:nobs, foldid) + # create residualized objects to fill + ytil <- dtil <- ztil<- rep(NA, nobs) + # obtain cross-fitted residuals + cat("fold: ") + for(b in 1:length(I)){ + dfit <- dreg(x[-I[[b]],], d[-I[[b]]]) #take a fold out + zfit <- zreg(x[-I[[b]],], z[-I[[b]]]) #take a fold out + yfit <- yreg(x[-I[[b]],], y[-I[[b]]]) # take a folot out + dhat <- predict(dfit, x[I[[b]],], type="response") #predict the fold out + zhat <- predict(zfit, x[I[[b]],], type="response") #predict the fold out + yhat <- predict(yfit, x[I[[b]],], type="response") #predict the fold out + dtil[I[[b]]] <- (d[I[[b]]] - dhat) #record residual + ztil[I[[b]]] <- (z[I[[b]]] - zhat) #record residual + ytil[I[[b]]] <- (y[I[[b]]] - yhat) #record residial + cat(b," ") + } + ivfit= tsls(y=ytil,d=dtil, x=NULL, z=ztil, intercept=FALSE) + coef.est <- ivfit$coef #extract coefficient + se <- ivfit$se #record standard error + cat(sprintf("\ncoef (se) = %g (%g)\n", coef.est , se)) + return( list(coef.est =coef.est , se=se, dtil=dtil, ytil=ytil, ztil=ztil) ) +} + + +``` + + +----- + +# Emprical Example: Acemoglu, Jonsohn, Robinson (AER). + + +* Y is log GDP; +* D is a measure of Protection from Expropriation, a proxy for +quality of insitutions; +* Z is the log of Settler's mortality; +* W are geographical variables (latitude, latitude squared, continent dummies as well as interactions) + + + + +```{r} + +data(AJR); + +y = AJR$GDP; +d = AJR$Exprop; +z = AJR$logMort +xraw= model.matrix(~ Latitude+ Africa+Asia + Namer + Samer, data=AJR) +x = model.matrix(~ -1 + (Latitude + Latitude2 + Africa + + Asia + Namer + Samer)^2, data=AJR) +dim(x) + +# DML with Random Forest +cat(sprintf("\n DML with Random Forest \n")) + +dreg <- function(x,d){ randomForest(x, d) } #ML method=Forest +yreg <- function(x,y){ randomForest(x, y) } #ML method=Forest +zreg<- function(x,z){ randomForest(x, z)} #ML method=Forest + +set.seed(1) +DML2.RF = DML2.for.PLIVM(xraw, d, z, y, dreg, yreg, zreg, nfold=20) + +# DML with PostLasso +cat(sprintf("\n DML with Post-Lasso \n")) + +dreg <- function(x,d){ rlasso(x, d) } #ML method=lasso +yreg <- function(x,y){ rlasso(x, y) } #ML method=lasso +zreg<- function(x,z){ rlasso(x, z)} #ML method=lasso + +set.seed(1) +DML2.lasso = DML2.for.PLIVM(x, d, z, y, dreg, yreg, zreg, nfold=20) + + +# Compare Forest vs Lasso + +comp.tab= matrix(NA, 3, 2) +comp.tab[1,] = c( sqrt(mean((DML2.RF$ytil)^2)), sqrt(mean((DML2.lasso$ytil)^2)) ) +comp.tab[2,] = c( sqrt(mean((DML2.RF$dtil)^2)), sqrt(mean((DML2.lasso$dtil)^2)) ) +comp.tab[3,] = c( sqrt(mean((DML2.RF$ztil)^2)), sqrt(mean((DML2.lasso$ztil)^2)) ) +rownames(comp.tab) = c("RMSE for Y:", "RMSE for D:", "RMSE for Z:") +colnames(comp.tab) = c("RF", "LASSO") +print(comp.tab, digits=3) +``` + +# Examine if we have weak instruments + +```{r} +install.packages("lfe") +library(lfe) +summary(felm(DML2.lasso$dtil~DML2.lasso$ztil), robust=T) +summary(felm(DML2.RF$dtil~DML2.RF$ztil), robust=T) +``` + +# We do have weak instruments, because t-stats in regression $\tilde D \sim \tilde Z$ are less than 4 in absolute value + + +So let's carry out DML inference combined with Anderson-Rubin Idea + +```{r} +# DML-AR (DML with Anderson-Rubin) + +DML.AR.PLIV<- function(rY, rD, rZ, grid, alpha=.05){ + n = length(rY) + Cstat = rep(0, length(grid)) + for (i in 1:length(grid)) { + Cstat[i] <- n* (mean( (rY - grid[i]*rD)*rZ) )^2/var ( (rY - grid[i]*rD) * rZ ) + }; + LB<- min(grid[ Cstat < qchisq(1-alpha,1)]); + UB <- max(grid[ Cstat < qchisq(1-alpha,1)]); + plot(range(grid),range(c( Cstat)) , type="n",xlab="Effect of institutions", ylab="Statistic", main=" "); + lines(grid, Cstat, lty = 1, col = 1); + abline(h=qchisq(1-alpha,1), lty = 3, col = 4); + abline(v=LB, lty = 3, col = 2); + abline(v=UB, lty = 3, col = 2); + return(list(UB=UB, LB=LB)) + } + +``` + +```{r} +DML.AR.PLIV(rY = DML2.lasso$ytil, rD= DML2.lasso$dtil, rZ= DML2.lasso$ztil, + grid = seq(-2, 2, by =.01)) + + +DML.AR.PLIV(rY = DML2.RF$ytil, rD= DML2.RF$dtil, rZ= DML2.RF$ztil, + grid = seq(-2, 2, by =.01)) + +``` diff --git a/T/deprecated/dml-for-ate-and-late-of-401-k-on-wealth.irnb.Rmd b/T/deprecated/dml-for-ate-and-late-of-401-k-on-wealth.irnb.Rmd new file mode 100644 index 00000000..f5daee27 --- /dev/null +++ b/T/deprecated/dml-for-ate-and-late-of-401-k-on-wealth.irnb.Rmd @@ -0,0 +1,677 @@ +--- +jupyter: + jupytext: + text_representation: + extension: .Rmd + format_name: rmarkdown + format_version: '1.2' + jupytext_version: 1.13.7 + kernelspec: + display_name: R + language: R + name: ir +--- + +# Inference on Predictive and Causal Effects in High-Dimensional Nonlinear Models + + +## Impact of 401(k) on Financial Wealth + +As a practical illustration of the methods developed in this lecture, we consider estimation of the effect of 401(k) eligibility and participation +on accumulated assets. 401(k) plans are pension accounts sponsored by employers. The key problem in determining the effect of participation in 401(k) plans on accumulated assets is saver heterogeneity coupled with the fact that the decision to enroll in a 401(k) is non-random. It is generally recognized that some people have a higher preference for saving than others. It also seems likely that those individuals with high unobserved preference for saving would be most likely to choose to participate in tax-advantaged retirement savings plans and would tend to have otherwise high amounts of accumulated assets. The presence of unobserved savings preferences with these properties then implies that conventional estimates that do not account for saver heterogeneity and endogeneity of participation will be biased upward, tending to overstate the savings effects of 401(k) participation. + +One can argue that eligibility for enrolling in a 401(k) plan in this data can be taken as exogenous after conditioning on a few observables of which the most important for their argument is income. The basic idea is that, at least around the time 401(k)’s initially became available, people were unlikely to be basing their employment decisions on whether an employer offered a 401(k) but would instead focus on income and other aspects of the job. + + +### Data + +The data set can be loaded from the `hdm` package for R by typing + + + +```{r} +library(hdm) +library(ggplot2) +data(pension) +data <- pension +dim(data) +``` + +See the "Details" section on the description of the data set, which can be accessed by + + +```{r} +help(pension) +``` + +The data consist of 9,915 observations at the household level drawn from the 1991 Survey of Income and Program Participation (SIPP). All the variables are referred to 1990. We use net financial assets (*net\_tfa*) as the outcome variable, $Y$, in our analysis. The net financial assets are computed as the sum of IRA balances, 401(k) balances, checking accounts, saving bonds, other interest-earning accounts, other interest-earning assets, stocks, and mutual funds less non mortgage debts. + + +Among the $9915$ individuals, $3682$ are eligible to participate in the program. The variable *e401* indicates eligibility and *p401* indicates participation, respectively. + +```{r} +hist_e401 = ggplot(data, aes(x = e401, fill = factor(e401))) + geom_bar() +hist_e401 +``` + +Eligibility is highly associated with financial wealth: + +```{r} +dens_net_tfa = ggplot(data, aes(x = net_tfa, color = factor(e401), fill = factor(e401)) ) + + geom_density() + xlim(c(-20000, 150000)) + + facet_wrap(.~e401) + +dens_net_tfa +``` + +The unconditional APE of e401 is about $19559$: + +```{r} +e1 <- data[data$e401==1,] +e0 <- data[data$e401==0,] +round(mean(e1$net_tfa)-mean(e0$net_tfa),0) +``` + +Among the $3682$ individuals that are eligible, $2594$ decided to participate in the program. The unconditional APE of p401 is about $27372$: + +```{r} +p1 <- data[data$p401==1,] +p0 <- data[data$p401==0,] +round(mean(p1$net_tfa)-mean(p0$net_tfa),0) +``` + +As discussed, these estimates are biased since they do not account for saver heterogeneity and endogeneity of participation. + + +## Double ML package + + +We are interested in valid estimators of the average treatment effect of `e401` and `p401` on `net_tfa`. To get those estimators, we use the `DoubleML` package that internally builds on mlr3. You find additional information on the package on the package website https://docs.doubleml.org/ and the R documentation page https://docs.doubleml.org/r/stable/. + +```{r} +# installing Double ML +remotes::install_github("DoubleML/doubleml-for-r",quiet=TRUE) + + +# loading the packages +library(DoubleML) +library(mlr3learners) +library(mlr3) +library(data.table) +library(randomForest) + +``` + +As mentioned, in the tutorial we use the meta package `mlr3` to generate predictions with machine learning methods. A comprehensive introduction and description of the `mlr3` package is provided in the [mlr3book](https://mlr3book.mlr-org.com/). A list of all learners that you can use in `mlr3` can be found [here](https://mlr3extralearners.mlr-org.com/articles/learners/list_learners.html). The entry in the columns *mlr3 Package* and *Packages* indicate which packages must be installed/loaded in your R session. + + +## Estimating the ATE of 401(k) Eligibility on Net Financial Assets + + +We first look at the treatment effect of e401 on net total financial assets. We give estimates of the ATE and ATT that corresponds to the linear model + +\begin{equation*} +Y = D \alpha + f(X)'\beta+ \epsilon, +\end{equation*} + +where $f(X)$ includes indicators of marital status, two-earner status, defined benefit pension status, IRA participation status, and home ownership status, and orthogonal polynomials of degrees 2, 4, 6 and 8 in family size, education, age and income, respectively. The dimensions of $f(X)$ is 25. + +In the first step, we report estimates of the average treatment effect (ATE) of 401(k) eligibility on net financial assets both in the partially linear regression (PLR) model and in the interactive regression model (IRM) allowing for heterogeneous treatment effects. + + +```{r} +# Constructing the data (as DoubleMLData) +formula_flex = "net_tfa ~ e401 + poly(age, 6, raw=TRUE) + poly(inc, 8, raw=TRUE) + poly(educ, 4, raw=TRUE) + poly(fsize, 2, raw=TRUE) + marr + twoearn + db + pira + hown" +model_flex = as.data.table(model.frame(formula_flex, pension)) +x_cols = colnames(model_flex)[-c(1,2)] +data_ml = DoubleMLData$new(model_flex, y_col = "net_tfa", d_cols = "e401", x_cols=x_cols) + + +p <- dim(model_flex)[2]-2 +p + +# complex model with two-way interactions +#data_interactions = fetch_401k(polynomial_features = TRUE, instrument = FALSE) + +``` + +## Partially Linear Regression Models (PLR) + + +We start using lasso to estimate the function $g_0$ and $m_0$ in the following PLR model: + + +\begin{eqnarray} + & Y = D\theta_0 + g_0(X) + \zeta, & E[\zeta \mid D,X]= 0,\\ + & D = m_0(X) + V, & E[V \mid X] = 0. +\end{eqnarray} + +```{r} +# Estimating the PLR +lgr::get_logger("mlr3")$set_threshold("warn") +set.seed(123) +lasso <- lrn("regr.cv_glmnet",nfolds = 5, s = "lambda.min") +lasso_class <- lrn("classif.cv_glmnet", nfolds = 5, s = "lambda.min") + +dml_plr <- DoubleMLPLR$new(data_ml, ml_g = lasso, ml_m = lasso_class, n_folds=3) +dml_plr$fit(store_predictions=TRUE) +dml_plr$summary() +lasso_plr <- dml_plr$coef +lasso_std_plr <- dml_plr$se +``` + +Let us check the predictive performance of this model. + +```{r} +dml_plr$params_names() +g_hat <- as.matrix(dml_plr$predictions$ml_g) # predictions of g_o +m_hat <- as.matrix(dml_plr$predictions$ml_m) # predictions of m_o +``` + +```{r} +# cross-fitted RMSE: outcome +y <- as.matrix(pension$net_tfa) # true observations +theta <- as.numeric(dml_plr$coef) # estimated regression coefficient +d <- as.matrix(pension$e401) +predictions_y <- as.matrix(d*theta)+g_hat # predictions for y +lasso_y_rmse <- sqrt(mean((y-predictions_y)^2)) +lasso_y_rmse +``` + +```{r} +# cross-fitted RMSE: treatment +d <- as.matrix(pension$e401) +lasso_d_rmse <- sqrt(mean((d-m_hat)^2)) +lasso_d_rmse + +# cross-fitted ce: treatment +mean(ifelse(m_hat > 0.5, 1, 0) != d) +``` + +Then, we repeat this procedure for various machine learning methods. + +```{r} +# Random Forest +lgr::get_logger("mlr3")$set_threshold("warn") +randomForest <- lrn("regr.ranger") +randomForest_class <- lrn("classif.ranger") + +dml_plr <- DoubleMLPLR$new(data_ml, ml_g = randomForest, ml_m = randomForest_class, n_folds=3) +dml_plr$fit(store_predictions=TRUE) # set store_predictions=TRUE to evaluate the model +dml_plr$summary() +forest_plr <- dml_plr$coef +forest_std_plr <- dml_plr$se +``` + +We can compare the accuracy of this model to the model that has been estimated with lasso. + +```{r} +# Evaluation predictions +g_hat <- as.matrix(dml_plr$predictions$ml_g) # predictions of g_o +m_hat <- as.matrix(dml_plr$predictions$ml_m) # predictions of m_o +theta <- as.numeric(dml_plr$coef) # estimated regression coefficient +predictions_y <- as.matrix(d*theta)+g_hat # predictions for y +forest_y_rmse <- sqrt(mean((y-predictions_y)^2)) +forest_y_rmse + +# cross-fitted RMSE: treatment +forest_d_rmse <- sqrt(mean((d-m_hat)^2)) +forest_d_rmse + +# cross-fitted ce: treatment +mean(ifelse(m_hat > 0.5, 1, 0) != d) +``` + +```{r} +# Trees +lgr::get_logger("mlr3")$set_threshold("warn") + +trees <- lrn("regr.rpart") +trees_class <- lrn("classif.rpart") + +dml_plr <- DoubleMLPLR$new(data_ml, ml_g = trees, ml_m = trees_class, n_folds=3) +dml_plr$fit(store_predictions=TRUE) +dml_plr$summary() +tree_plr <- dml_plr$coef +tree_std_plr <- dml_plr$se + +# Evaluation predictions +g_hat <- as.matrix(dml_plr$predictions$ml_g) # predictions of g_o +m_hat <- as.matrix(dml_plr$predictions$ml_m) # predictions of m_o +theta <- as.numeric(dml_plr$coef) # estimated regression coefficient +predictions_y <- as.matrix(d*theta)+g_hat # predictions for y +tree_y_rmse <- sqrt(mean((y-predictions_y)^2)) +tree_y_rmse + +# cross-fitted RMSE: treatment +tree_d_rmse <- sqrt(mean((d-m_hat)^2)) +tree_d_rmse + +# cross-fitted ce: treatment +mean(ifelse(m_hat > 0.5, 1, 0) != d) +``` + +```{r} +# Boosting +lgr::get_logger("mlr3")$set_threshold("warn") +boost<- lrn("regr.xgboost",objective="reg:squarederror") +boost_class <- lrn("classif.xgboost",objective = "binary:logistic",eval_metric ="logloss") + +dml_plr <- DoubleMLPLR$new(data_ml, ml_g = boost, ml_m = boost_class, n_folds=3) +dml_plr$fit(store_predictions=TRUE) +dml_plr$summary() +boost_plr <- dml_plr$coef +boost_std_plr <- dml_plr$se + +# Evaluation predictions +g_hat <- as.matrix(dml_plr$predictions$ml_g) # predictions of g_o +m_hat <- as.matrix(dml_plr$predictions$ml_m) # predictions of m_o +theta <- as.numeric(dml_plr$coef) # estimated regression coefficient +predictions_y <- as.matrix(d*theta)+g_hat # predictions for y +boost_y_rmse <- sqrt(mean((y-predictions_y)^2)) +boost_y_rmse + +# cross-fitted RMSE: treatment +boost_d_rmse <- sqrt(mean((d-m_hat)^2)) +boost_d_rmse + +# cross-fitted ce: treatment +mean(ifelse(m_hat > 0.5, 1, 0) != d) +``` + +Let's sum up the results: + +```{r} +library(xtable) +table <- matrix(0, 4, 4) +table[1,1:4] <- c(lasso_plr,forest_plr,tree_plr,boost_plr) +table[2,1:4] <- c(lasso_std_plr,forest_std_plr,tree_std_plr,boost_std_plr) +table[3,1:4] <- c(lasso_y_rmse,forest_y_rmse,tree_y_rmse,boost_y_rmse) +table[4,1:4] <- c(lasso_d_rmse,forest_d_rmse,tree_d_rmse,boost_d_rmse) +rownames(table) <- c("Estimate","Std.Error","RMSE Y","RMSE D") +colnames(table) <- c("Lasso","Random Forest","Trees","Boosting") +tab<- xtable(table, digits = 2) +tab +``` + +The best model with lowest RMSE in both equation is the PLR model estimated via lasso. It gives the following estimate: + +```{r} +lasso_plr +``` + +## Interactive Regression Model (IRM) + + +Next, we consider estimation of average treatment effects when treatment effects are fully heterogeneous: + + + \begin{eqnarray}\label{eq: HetPL1} + & Y = g_0(D, X) + U, & \quad E[U \mid X, D]= 0,\\ + & D = m_0(X) + V, & \quad E[V\mid X] = 0. +\end{eqnarray} + + +To reduce the disproportionate impact of extreme propensity score weights in the interactive model +we trim the propensity scores which are close to the bounds. + +```{r} +set.seed(123) +lgr::get_logger("mlr3")$set_threshold("warn") +dml_irm = DoubleMLIRM$new(data_ml, ml_g = lasso, + ml_m = lasso_class, + trimming_threshold = 0.01, n_folds=3) +dml_irm$fit(store_predictions=TRUE) +dml_irm$summary() +lasso_irm <- dml_irm$coef +lasso_std_irm <- dml_irm$se + + +# predictions +dml_irm$params_names() +g0_hat <- as.matrix(dml_irm$predictions$ml_g0) # predictions of g_0(D=0, X) +g1_hat <- as.matrix(dml_irm$predictions$ml_g1) # predictions of g_0(D=1, X) +g_hat <- d*g1_hat+(1-d)*g0_hat # predictions of g_0 +m_hat <- as.matrix(dml_irm$predictions$ml_m) # predictions of m_o + +``` + +```{r} +# cross-fitted RMSE: outcome +y <- as.matrix(pension$net_tfa) # true observations +d <- as.matrix(pension$e401) +lasso_y_irm <- sqrt(mean((y-g_hat)^2)) +lasso_y_irm + +# cross-fitted RMSE: treatment +lasso_d_irm <- sqrt(mean((d-m_hat)^2)) +lasso_d_irm + +# cross-fitted ce: treatment +mean(ifelse(m_hat > 0.5, 1, 0) != d) +``` + +```{r} +##### forest ##### + +dml_irm = DoubleMLIRM$new(data_ml, ml_g = randomForest, + ml_m = randomForest_class, + trimming_threshold = 0.01, n_folds=3) +dml_irm$fit(store_predictions=TRUE) +dml_irm$summary() +forest_irm <- dml_irm$coef +forest_std_irm <- dml_plr$se + +# predictions +g0_hat <- as.matrix(dml_irm$predictions$ml_g0) # predictions of g_0(D=0, X) +g1_hat <- as.matrix(dml_irm$predictions$ml_g1) # predictions of g_0(D=1, X) +g_hat <- d*g1_hat+(1-d)*g0_hat # predictions of g_0 +m_hat <- as.matrix(dml_irm$predictions$ml_m) # predictions of m_o + +# cross-fitted RMSE: outcome +y <- as.matrix(pension$net_tfa) # true observations +d <- as.matrix(pension$e401) +forest_y_irm <- sqrt(mean((y-g_hat)^2)) +forest_y_irm + +# cross-fitted RMSE: treatment +forest_d_irm <- sqrt(mean((d-m_hat)^2)) +forest_d_irm + +# cross-fitted ce: treatment +mean(ifelse(m_hat > 0.5, 1, 0) != d) + +##### trees ##### + +dml_irm <- DoubleMLIRM$new(data_ml, ml_g = trees, ml_m = trees_class, + trimming_threshold = 0.01, n_folds=3) +dml_irm$fit(store_predictions=TRUE) +dml_irm$summary() +tree_irm <- dml_irm$coef +tree_std_irm <- dml_irm$se + +# predictions +g0_hat <- as.matrix(dml_irm$predictions$ml_g0) # predictions of g_0(D=0, X) +g1_hat <- as.matrix(dml_irm$predictions$ml_g1) # predictions of g_0(D=1, X) +g_hat <- d*g1_hat+(1-d)*g0_hat # predictions of g_0 +m_hat <- as.matrix(dml_irm$predictions$ml_m) # predictions of m_o + +# cross-fitted RMSE: outcome +y <- as.matrix(pension$net_tfa) # true observations +d <- as.matrix(pension$e401) +tree_y_irm <- sqrt(mean((y-g_hat)^2)) +tree_y_irm + +# cross-fitted RMSE: treatment +tree_d_irm <- sqrt(mean((d-m_hat)^2)) +tree_d_irm + +# cross-fitted ce: treatment +mean(ifelse(m_hat > 0.5, 1, 0) != d) + + +##### boosting ##### + +dml_irm <- DoubleMLIRM$new(data_ml, ml_g = boost, ml_m = boost_class, + trimming_threshold = 0.01, n_folds=3) +dml_irm$fit(store_predictions=TRUE) +dml_irm$summary() +boost_irm <- dml_irm$coef +boost_std_irm <- dml_irm$se + +# predictions +g0_hat <- as.matrix(dml_irm$predictions$ml_g0) # predictions of g_0(D=0, X) +g1_hat <- as.matrix(dml_irm$predictions$ml_g1) # predictions of g_0(D=1, X) +g_hat <- d*g1_hat+(1-d)*g0_hat # predictions of g_0 +m_hat <- as.matrix(dml_irm$predictions$ml_m) # predictions of m_o + +# cross-fitted RMSE: outcome +y <- as.matrix(pension$net_tfa) # true observations +d <- as.matrix(pension$e401) +boost_y_irm <- sqrt(mean((y-g_hat)^2)) +boost_y_irm + +# cross-fitted RMSE: treatment +boost_d_irm <- sqrt(mean((d-m_hat)^2)) +boost_d_irm + +# cross-fitted ce: treatment +mean(ifelse(m_hat > 0.5, 1, 0) != d) +``` + +```{r} +library(xtable) +table <- matrix(0, 4, 4) +table[1,1:4] <- c(lasso_irm,forest_irm,tree_irm,boost_irm) +table[2,1:4] <- c(lasso_std_irm,forest_std_irm,tree_std_irm,boost_std_irm) +table[3,1:4] <- c(lasso_y_irm,forest_y_irm,tree_y_irm,boost_y_irm) +table[4,1:4] <- c(lasso_d_irm,forest_d_irm,tree_d_irm,boost_d_irm) +rownames(table) <- c("Estimate","Std.Error","RMSE Y","RMSE D") +colnames(table) <- c("Lasso","Random Forest","Trees","Boosting") +tab<- xtable(table, digits = 2) +tab +``` + +Here, Random Forest gives the best prediction rule for $g_0$ and Lasso the best prediction rule for $m_0$, respectively. Let us fit the IRM model using the best ML method for each equation to get a final estimate for the treatment effect of eligibility. + +```{r} +set.seed(123) +lgr::get_logger("mlr3")$set_threshold("warn") +dml_irm = DoubleMLIRM$new(data_ml, ml_g = randomForest, + ml_m = lasso_class, + trimming_threshold = 0.01, n_folds=3) +dml_irm$fit(store_predictions=TRUE) +dml_irm$summary() +best_irm <- dml_irm$coef +best_std_irm <- dml_irm$se +``` + +These estimates that flexibly account for confounding are +substantially attenuated relative to the baseline estimate (*19559*) that does not account for confounding. They suggest much smaller causal effects of 401(k) eligiblity on financial asset holdings. + + +## Local Average Treatment Effects of 401(k) Participation on Net Financial Assets + + +## Interactive IV Model (IIVM) + + +Now, we consider estimation of local average treatment effects (LATE) of participation with the binary instrument `e401`. As before, $Y$ denotes the outcome `net_tfa`, and $X$ is the vector of covariates. Here the structural equation model is: + +\begin{eqnarray} +& Y = g_0(Z,X) + U, &\quad E[U\mid Z,X] = 0,\\ +& D = r_0(Z,X) + V, &\quad E[V\mid Z, X] = 0,\\ +& Z = m_0(X) + \zeta, &\quad E[\zeta \mid X] = 0. +\end{eqnarray} + +```{r} +# Constructing the data (as DoubleMLData) +formula_flex2 = "net_tfa ~ p401+ e401 + poly(age, 6, raw=TRUE) + poly(inc, 8, raw=TRUE) + poly(educ, 4, raw=TRUE) + poly(fsize, 2, raw=TRUE) + marr + twoearn + db + pira + hown" +model_flex2 = as.data.table(model.frame(formula_flex2, data)) +x_cols = colnames(model_flex2)[-c(1,2,3)] +data_IV = DoubleMLData$new(model_flex2, y_col = "net_tfa", d_cols = "p401", z_cols ="e401",x_cols=x_cols) +``` + +```{r} +set.seed(123) +lgr::get_logger("mlr3")$set_threshold("warn") +dml_MLIIVM = DoubleMLIIVM$new(data_IV, ml_g = lasso, + ml_m = lasso_class, ml_r = lasso_class,n_folds=3, subgroups = list(always_takers = FALSE, + never_takers = TRUE)) +dml_MLIIVM$fit(store_predictions=TRUE) +dml_MLIIVM$summary() +lasso_MLIIVM <- dml_MLIIVM$coef +lasso_std_MLIIVM <- dml_MLIIVM$se +``` + +The confidence interval for the local average treatment effect of participation is given by + +```{r} +dml_MLIIVM$confint(level = 0.95) +``` + +Here we can also check the accuracy of the model: + +```{r} +# variables +y <- as.matrix(pension$net_tfa) # true observations +d <- as.matrix(pension$p401) +z <- as.matrix(pension$e401) + +# predictions +dml_MLIIVM$params_names() +g0_hat <- as.matrix(dml_MLIIVM$predictions$ml_g0) # predictions of g_0(z=0, X) +g1_hat <- as.matrix(dml_MLIIVM$predictions$ml_g1) # predictions of g_0(z=1, X) +g_hat <- z*g1_hat+(1-z)*g0_hat # predictions of g_0 +r0_hat <- as.matrix(dml_MLIIVM$predictions$ml_r0) # predictions of r_0(z=0, X) +r1_hat <- as.matrix(dml_MLIIVM$predictions$ml_r1) # predictions of r_0(z=1, X) +r_hat <- z*r1_hat+(1-z)*r0_hat # predictions of r_0 +m_hat <- as.matrix(dml_MLIIVM$predictions$ml_m) # predictions of m_o +``` + +```{r} +# cross-fitted RMSE: outcome +lasso_y_MLIIVM <- sqrt(mean((y-g_hat)^2)) +lasso_y_MLIIVM + +# cross-fitted RMSE: treatment +lasso_d_MLIIVM <- sqrt(mean((d-r_hat)^2)) +lasso_d_MLIIVM + +# cross-fitted RMSE: instrument +lasso_z_MLIIVM <- sqrt(mean((z-m_hat)^2)) +lasso_z_MLIIVM + +``` + +Again, we repeat the procedure for the other machine learning methods: + +```{r} +### random forest ### + +set.seed(123) +lgr::get_logger("mlr3")$set_threshold("warn") +dml_MLIIVM = DoubleMLIIVM$new(data_IV, ml_g = randomForest, + ml_m = randomForest_class, ml_r = randomForest_class,n_folds=3, subgroups = list(always_takers = FALSE, + never_takers = TRUE)) +dml_MLIIVM$fit(store_predictions=TRUE) +dml_MLIIVM$summary() +forest_MLIIVM <- dml_MLIIVM$coef +forest_std_MLIIVM <- dml_MLIIVM$se + +# predictions +g0_hat <- as.matrix(dml_MLIIVM$predictions$ml_g0) # predictions of g_0(Z=0, X) +g1_hat <- as.matrix(dml_MLIIVM$predictions$ml_g1) # predictions of g_0(Z=1, X) +g_hat <- z*g1_hat+(1-z)*g0_hat # predictions of g_0 +r0_hat <- as.matrix(dml_MLIIVM$predictions$ml_r0) # predictions of r_0(Z=0, X) +r1_hat <- as.matrix(dml_MLIIVM$predictions$ml_r1) # predictions of r_0(Z=1, X) +r_hat <- z*r1_hat+(1-z)*r0_hat # predictions of r_0 +m_hat <- as.matrix(dml_MLIIVM$predictions$ml_m) # predictions of m_o + +# cross-fitted RMSE: outcome +forest_y_MLIIVM <- sqrt(mean((y-g_hat)^2)) +forest_y_MLIIVM + +# cross-fitted RMSE: treatment +forest_d_MLIIVM <- sqrt(mean((d-r_hat)^2)) +forest_d_MLIIVM + +# cross-fitted RMSE: instrument +forest_z_MLIIVM <- sqrt(mean((z-m_hat)^2)) +forest_z_MLIIVM + +### trees ### + +dml_MLIIVM = DoubleMLIIVM$new(data_IV, ml_g = trees, + ml_m = trees_class, ml_r = trees_class,n_folds=3, subgroups = list(always_takers = FALSE, + never_takers = TRUE)) +dml_MLIIVM$fit(store_predictions=TRUE) +dml_MLIIVM$summary() +tree_MLIIVM <- dml_MLIIVM$coef +tree_std_MLIIVM <- dml_MLIIVM$se + +# predictions +g0_hat <- as.matrix(dml_MLIIVM$predictions$ml_g0) # predictions of g_0(Z=0, X) +g1_hat <- as.matrix(dml_MLIIVM$predictions$ml_g1) # predictions of g_0(Z=1, X) +g_hat <- z*g1_hat+(1-z)*g0_hat # predictions of g_0 +r0_hat <- as.matrix(dml_MLIIVM$predictions$ml_r0) # predictions of r_0(Z=0, X) +r1_hat <- as.matrix(dml_MLIIVM$predictions$ml_r1) # predictions of r_0(Z=1, X) +r_hat <- z*r1_hat+(1-z)*r0_hat # predictions of r_0 +m_hat <- as.matrix(dml_MLIIVM$predictions$ml_m) # predictions of m_o + +# cross-fitted RMSE: outcome +tree_y_MLIIVM <- sqrt(mean((y-g_hat)^2)) +tree_y_MLIIVM + +# cross-fitted RMSE: treatment +tree_d_MLIIVM <- sqrt(mean((d-r_hat)^2)) +tree_d_MLIIVM + +# cross-fitted RMSE: instrument +tree_z_MLIIVM <- sqrt(mean((z-m_hat)^2)) +tree_z_MLIIVM + + +### boosting ### +dml_MLIIVM = DoubleMLIIVM$new(data_IV, ml_g = boost, + ml_m = boost_class, ml_r = boost_class,n_folds=3, subgroups = list(always_takers = FALSE, + never_takers = TRUE)) +dml_MLIIVM$fit(store_predictions=TRUE) +dml_MLIIVM$summary() +boost_MLIIVM <- dml_MLIIVM$coef +boost_std_MLIIVM <- dml_MLIIVM$se + +# predictions +g0_hat <- as.matrix(dml_MLIIVM$predictions$ml_g0) # predictions of g_0(Z=0, X) +g1_hat <- as.matrix(dml_MLIIVM$predictions$ml_g1) # predictions of g_0(Z=1, X) +g_hat <- z*g1_hat+(1-z)*g0_hat # predictions of g_0 +r0_hat <- as.matrix(dml_MLIIVM$predictions$ml_r0) # predictions of r_0(Z=0, X) +r1_hat <- as.matrix(dml_MLIIVM$predictions$ml_r1) # predictions of r_0(Z=1, X) +r_hat <- z*r1_hat+(1-z)*r0_hat # predictions of r_0 +m_hat <- as.matrix(dml_MLIIVM$predictions$ml_m) # predictions of m_o + +# cross-fitted RMSE: outcome +boost_y_MLIIVM <- sqrt(mean((y-g_hat)^2)) +boost_y_MLIIVM + +# cross-fitted RMSE: treatment +boost_d_MLIIVM <- sqrt(mean((d-r_hat)^2)) +boost_d_MLIIVM + +# cross-fitted RMSE: instrument +boost_z_MLIIVM <- sqrt(mean((z-m_hat)^2)) +boost_z_MLIIVM +``` + +```{r} +library(xtable) +table <- matrix(0, 5, 4) +table[1,1:4] <- c(lasso_MLIIVM,forest_MLIIVM,tree_MLIIVM,boost_MLIIVM) +table[2,1:4] <- c(lasso_std_MLIIVM,forest_std_MLIIVM,tree_std_MLIIVM,boost_std_MLIIVM) +table[3,1:4] <- c(lasso_y_MLIIVM,forest_y_MLIIVM,tree_y_MLIIVM,boost_y_MLIIVM) +table[4,1:4] <- c(lasso_d_MLIIVM,forest_d_MLIIVM,tree_d_MLIIVM,boost_d_MLIIVM) +table[5,1:4] <- c(lasso_z_MLIIVM,forest_z_MLIIVM,tree_z_MLIIVM,boost_z_MLIIVM) +rownames(table) <- c("Estimate","Std.Error","RMSE Y","RMSE D","RMSE Z") +colnames(table) <- c("Lasso","Random Forest","Trees","Boosting") +tab<- xtable(table, digits = 2) +tab +``` + +We report results based on four ML methods for estimating the nuisance functions used in +forming the orthogonal estimating equations. We find again that the estimates of the treatment effect are stable across ML methods. The estimates are highly significant, hence we would reject the hypothesis +that the effect of 401(k) participation has no effect on financial health. + + +We might rerun the model using the best ML method for each equation to get a final estimate for the treatment effect of participation: + +```{r} +set.seed(123) +lgr::get_logger("mlr3")$set_threshold("warn") +dml_MLIIVM = DoubleMLIIVM$new(data_IV, ml_g = randomForest, + ml_m = lasso_class, ml_r = lasso_class,n_folds=3, subgroups = list(always_takers = FALSE, + never_takers = TRUE)) +dml_MLIIVM$fit(store_predictions=TRUE) +dml_MLIIVM$summary() +best_MLIIVM <- dml_MLIIVM$coef +best_std_MLIIVM <- dml_MLIIVM$se +``` diff --git a/T/deprecated/dml-inference-for-gun-ownership.irnb.Rmd b/T/deprecated/dml-inference-for-gun-ownership.irnb.Rmd new file mode 100644 index 00000000..67145e54 --- /dev/null +++ b/T/deprecated/dml-inference-for-gun-ownership.irnb.Rmd @@ -0,0 +1,377 @@ +--- +jupyter: + jupytext: + text_representation: + extension: .Rmd + format_name: rmarkdown + format_version: '1.2' + jupytext_version: 1.13.7 + kernelspec: + display_name: R + language: R + name: ir +--- + +This notebook contains an example for teaching. + + +# A Case Study: The Effect of Gun Ownership on Gun-Homicide Rates + + +We consider the problem of estimating the effect of gun +ownership on the homicide rate. For this purpose, we estimate the following partially +linear model + +$$ + Y_{j,t} = \beta D_{j,(t-1)} + g(Z_{j,t}) + \epsilon_{j,t}. +$$ + + +## Data + + +$Y_{j,t}$ is the log homicide rate in county $j$ at time $t$, $D_{j, t-1}$ is the log fraction of suicides committed with a firearm in county $j$ at time $t-1$, which we use as a proxy for gun ownership, and $Z_{j,t}$ is a set of demographic and economic characteristics of county $j$ at time $t$. The parameter $\beta$ is the effect of gun ownership on homicide rates, controlling for county-level demographic and economic characteristics. + +The sample covers 195 large United States counties between the years 1980 through 1999, giving us 3900 observations. + +```{r} +data <- read.csv("../input/gun-example/gun_clean.csv") +dim(data)[1] +``` + +### Preprocessing + + +To account for heterogeneity across counties and time trends in all variables, we remove from them county-specific and time-specific effects in the following preprocessing. + +```{r} +##################### Find Variable Names from Dataset ###################### + +varlist <- function (df=NULL,type=c("numeric","factor","character"), pattern="", exclude=NULL) { + vars <- character(0) + if (any(type %in% "numeric")) { + vars <- c(vars,names(df)[sapply(df,is.numeric)]) + } + if (any(type %in% "factor")) { + vars <- c(vars,names(df)[sapply(df,is.factor)]) + } + if (any(type %in% "character")) { + vars <- c(vars,names(df)[sapply(df,is.character)]) + } + vars[(!vars %in% exclude) & grepl(vars,pattern=pattern)] +} + +############################# Create Variables ############################## + +# dummy variables for year and county fixed effects +fixed <- grep("X_Jfips", names(data), value=TRUE, fixed=TRUE) +year <- varlist(data, pattern="X_Tyear") + +# census control variables +census <- NULL +census_var <- c("^AGE", "^BN", "^BP", "^BZ", "^ED", "^EL","^HI", "^HS", "^INC", "^LF", "^LN", "^PI", "^PO", "^PP", "^PV", "^SPR", "^VS") + +for(i in 1:length(census_var)){ + census <- append(census, varlist(data, pattern=census_var[i])) +} + +################################ Variables ################################## +# treatment variable +d <- "logfssl" + +# outcome variable +y <- "logghomr" + +# other control variables +X1 <- c("logrobr", "logburg", "burg_missing", "robrate_missing") +X2 <- c("newblack", "newfhh", "newmove", "newdens", "newmal") + +######################## Partial out Fixed Effects ########################## + +# new dataset for partialled-out variables +rdata <- as.data.frame(data$CountyCode) +colnames(rdata) <- "CountyCode" + +# variables to partial out +varlist <- c(y, d,X1, X2, census) + +# partial out year and county fixed effect from variables in varlist +for(i in 1:length(varlist)){ + form <- as.formula(paste(varlist[i], "~", paste(paste(year,collapse="+"), paste(fixed,collapse="+"), sep="+"))) + rdata[, varlist[i]] <- lm(form, data)$residuals +} +``` + +Now, we can construct the treatment variable, the outcome variable and the matrix $Z$ that includes the control variables. + +```{r} +# treatment variable +D <- rdata[which(colnames(rdata) == d)] + +# outcome variable +Y <- rdata[which(colnames(rdata) == y)] + +# construct matrix Z +Z <- rdata[which(colnames(rdata) %in% c(X1,X2,census))] +dim(Z) +``` + +We have 195 control variables in total. The control variables $Z_{j,t}$ are from the U.S. Census Bureau and contain demographic and economic characteristics of the counties such as the age distribution, the income distribution, crime rates, federal spending, home ownership rates, house prices, educational attainment, voting paterns, employment statistics, and migration rates. + +```{r} +clu <- rdata[which(colnames(rdata) == "CountyCode")] # for clustering the standard errors +data <- data.frame(cbind(Y, D, Z,as.matrix(clu))) +``` + +```{r} +library(lfe) # linear group fixed effects package +``` + +## The effect of gun ownership + + +### OLS + + +After preprocessing the data, as a baseline model, we first look at simple regression of $Y_{j,t}$ on $D_{j,t-1}$ without controls. + +```{r} +# baseline_formula <- as.formula(paste(y, "~", d )) +# baseline.ols <- lm(baseline_formula,data=rdata) + +baseline.ols <- felm(logghomr ~ logfssl |0|0| CountyCode,data=data) # ols with clustered standard errors +est_baseline <- summary(baseline.ols)$coef[2,] +confint(baseline.ols)[2,] +est_baseline +``` + +The point estimate is $0.282$ with the confidence interval ranging from 0.155 to 0.41. This +suggests that increases in gun ownership rates are related to gun homicide rates - if gun ownership increases by 1% relative +to a trend then the predicted gun homicide rate goes up by 0.28%, without controlling for counties' characteristics. + +Since our goal is to estimate the effect of gun ownership after controlling for a rich set county characteristics, we next include the controls. First, we estimate the model by ols and then by an array of the modern regression methods using the double machine learning approach. + +```{r} +control_formula <- as.formula(paste("logghomr", "~", paste("logfssl",paste(colnames(Z),collapse="+"), + sep="+"),"|0|0| CountyCode")) +control.ols <- felm(control_formula,data=data) # fixed effects lm function +est_ols <- summary(control.ols)$coef[2,] +confint(control.ols)[2,] +est_ols +``` + +After controlling for a rich set of characteristics, the point estimate of gun ownership reduces to $0.19$. + + +# DML algorithm + +Here we perform inference on the predictive coefficient $\beta$ in our partially linear statistical model, + +$$ +Y = D\beta + g(Z) + \epsilon, \quad E (\epsilon | D, Z) = 0, +$$ + +using the **double machine learning** approach. + +For $\tilde Y = Y- E(Y|Z)$ and $\tilde D= D- E(D|Z)$, we can write +$$ +\tilde Y = \alpha \tilde D + \epsilon, \quad E (\epsilon |\tilde D) =0. +$$ + +Using cross-fitting, we employ modern regression methods +to build estimators $\hat \ell(Z)$ and $\hat m(Z)$ of $\ell(Z):=E(Y|Z)$ and $m(Z):=E(D|Z)$ to obtain the estimates of the residualized quantities: + +$$ +\tilde Y_i = Y_i - \hat \ell (Z_i), \quad \tilde D_i = D_i - \hat m(Z_i), \quad \text{ for each } i = 1,\dots,n. +$$ + +Finally, using ordinary least squares of $\tilde Y_i$ on $\tilde D_i$, we obtain the +estimate of $\beta$. + + +The following algorithm comsumes $Y, D, Z$, and a machine learning method for learning the residuals $\tilde Y$ and $\tilde D$, where the residuals are obtained by cross-validation (cross-fitting). Then, it prints the estimated coefficient $\beta$ and the corresponding standard error from the final OLS regression. + +```{r} +DML2.for.PLM <- function(z, d, y, dreg, yreg, nfold=2, clu) { + nobs <- nrow(z) # number of observations + foldid <- rep.int(1:nfold,times = ceiling(nobs/nfold))[sample.int(nobs)] # define folds indices + I <- split(1:nobs, foldid) # split observation indices into folds + ytil <- dtil <- rep(NA, nobs) + cat("fold: ") + for(b in 1:length(I)){ + dfit <- dreg(z[-I[[b]],], d[-I[[b]]]) # take a fold out + yfit <- yreg(z[-I[[b]],], y[-I[[b]]]) # take a fold out + dhat <- predict(dfit, z[I[[b]],], type="response") # predict the left-out fold + yhat <- predict(yfit, z[I[[b]],], type="response") # predict the left-out fold + dtil[I[[b]]] <- (d[I[[b]]] - dhat) # record residual for the left-out fold + ytil[I[[b]]] <- (y[I[[b]]] - yhat) # record residial for the left-out fold + cat(b," ") + } + #rfit <- lm(ytil ~ dtil) # estimate the main parameter by regressing one residual on the other + data <- data.frame(cbind(ytil, dtil, as.matrix(clu))) + rfit <- felm(ytil ~ dtil|0|0|CountyCode,data=data) + coef.est <- coef(rfit)[2] # extract coefficient + #HC <- vcovHC(rfit) + se <- summary(rfit,robust=T)$coefficients[2,2] # record robust standard error by county + cat(sprintf("\ncoef (se) = %g (%g)\n", coef.est , se)) # print output + return( list(coef.est =coef.est , se=se, dtil=dtil, ytil=ytil, rfit=rfit) ) # save output and residuals +} +``` + +Now, we apply the Double Machine Learning (DML) approach with different machine learning methods. First, we load the relevant libraries. + +```{r} +library(hdm) +library(glmnet) +library(sandwich) +library(randomForest) +``` + +Let us, construct the input matrices. + +```{r} +y <- as.matrix(Y) +d <- as.matrix(D) +z <- as.matrix(Z) +clu <- rdata[which(colnames(rdata) == "CountyCode")] +head(data.frame(cbind(y,d,as.matrix(clu)))) +``` + +In the following, we apply the DML approach with the different versions of lasso. + + + +## Lasso + +```{r} +# DML with Lasso: +set.seed(123) +dreg <- function(z,d){ rlasso(z,d, post=FALSE) } # ML method= lasso from hdm +yreg <- function(z,y){ rlasso(z,y, post=FALSE) } # ML method = lasso from hdm +DML2.lasso = DML2.for.PLM(z, d, y, dreg, yreg, nfold=10,clu) +``` + +```{r} +# DML with Post-Lasso: +dreg <- function(z,d){ rlasso(z,d, post=T) } # ML method= lasso from hdm +yreg <- function(z,y){ rlasso(z,y, post=T) } # ML method = lasso from hdm +DML2.post = DML2.for.PLM(z, d, y, dreg, yreg, nfold=10, clu) +``` + +```{r} +# DML with cross-validated Lasso: +dreg <- function(z,d){ cv.glmnet(z,d,family="gaussian", alpha=1) } # ML method = lasso from glmnet +yreg <- function(z,y){ cv.glmnet(z,y,family="gaussian", alpha=1) } # ML method = lasso from glmnet +DML2.lasso.cv = DML2.for.PLM(z, d, y, dreg, yreg, nfold=10, clu) + +dreg <- function(z,d){ cv.glmnet(z,d,family="gaussian", alpha=0.5) } # ML method = elastic net from glmnet +yreg <- function(z,y){ cv.glmnet(z,y,family="gaussian", alpha=0.5) } # ML method = elastic net from glmnet +DML2.elnet = DML2.for.PLM(z, d, y, dreg, yreg, nfold=10, clu) + +dreg <- function(z,d){ cv.glmnet(z,d,family="gaussian", alpha=0) } # ML method = ridge from glmnet +yreg <- function(z,y){ cv.glmnet(z,y,family="gaussian", alpha=0) } # ML method = ridge from glmnet +DML2.ridge = DML2.for.PLM(z, d, y, dreg, yreg, nfold=10, clu) +``` + +Here we also compute DML with OLS used as the ML method + +```{r} +dreg <- function(z,d){ glmnet(z,d,family="gaussian", lambda=0) } # ML method = ols from glmnet +yreg <- function(z,y){ glmnet(z,y,family="gaussian", lambda=0) } # ML method = ols from glmnet +DML2.ols = DML2.for.PLM(z, d, y, dreg, yreg, nfold=10, clu) +``` + +Next, we also apply Random Forest for comparison purposes. + + +### Random Forest + + +```{r} +# DML with Random Forest: +dreg <- function(z,d){ randomForest(z, d) } # ML method = random forest +yreg <- function(z,y){ randomForest(z, y) } # ML method = random forest +set.seed(1) +DML2.RF = DML2.for.PLM(z, d, y, dreg, yreg, nfold=2, clu) # set folds to 2 to limit computation time +``` + +We conclude that the gun ownership rates are related to gun homicide rates - if gun ownership increases by 1% relative +to a trend then the predicted gun homicide rate goes up by about 0.20% controlling for counties' characteristics. + + +Finally, let's see which method is best. We compute RMSE for predicting D and Y, and see which +of the methods works better. + + +```{r} +mods<- list(DML2.ols, DML2.lasso, DML2.post, DML2.lasso.cv, DML2.ridge, DML2.elnet, DML2.RF) + +RMSE.mdl<- function(mdl) { +RMSEY <- sqrt(mean(mdl$ytil)^2) +RMSED <- sqrt(mean(mdl$dtil)^2) +return( list(RMSEY=RMSEY, RMSED=RMSED)) +} + +#RMSE.mdl(DML2.lasso) +#DML2.lasso$ytil + +Res<- lapply(mods, RMSE.mdl) + +prRes.Y<- c( Res[[1]]$RMSEY,Res[[2]]$RMSEY, Res[[3]]$RMSEY, Res[[4]]$RMSEY, Res[[5]]$RMSEY, Res[[6]]$RMSEY, Res[[7]]$RMSEY) +prRes.D<- c( Res[[1]]$RMSED,Res[[2]]$RMSED, Res[[3]]$RMSED, Res[[4]]$RMSED, Res[[5]]$RMSED, Res[[6]]$RMSED, Res[[7]]$RMSED) + +prRes<- rbind(prRes.Y, prRes.D); +rownames(prRes)<- c("RMSE D", "RMSE Y"); +colnames(prRes)<- c("OLS", "Lasso", "Post-Lasso", "CV Lasso", "CV Ridge", "CV Elnet", "RF") +print(prRes,digit=6) +``` + +It looks like the best method for predicting D is Lasso, and the best method for predicting Y is CV Ridge. + + +```{r} +dreg <- function(z,d){ rlasso(z,d, post=T) } # ML method = lasso from hdm +yreg <- function(z,y){ cv.glmnet(z,y,family="gaussian", alpha=0) } # ML method = ridge from glmnet +DML2.best= DML2.for.PLM(z, d, y, dreg, yreg, nfold=10, clu) +``` + +Let's organize the results in a table. + +```{r} +library(xtable) + +table <- matrix(0,9,2) +table[1,1] <- as.numeric(est_baseline[1]) +table[2,1] <- as.numeric(est_ols[1]) +table[3,1] <- as.numeric(DML2.lasso$coef.est) +table[4,1] <- as.numeric(DML2.post$coef.est) +table[5,1] <-as.numeric(DML2.lasso.cv$coef.est) +table[6,1] <-as.numeric(DML2.elnet$coef.est) +table[7,1] <-as.numeric(DML2.ridge$coef.est) +table[8,1] <-as.numeric(DML2.RF$coef.est) +table[9,1] <-as.numeric(DML2.best$coef.est) +table[1,2] <- as.numeric(est_baseline[2]) +table[2,2] <- as.numeric(est_ols[2]) +table[3,2] <- as.numeric(DML2.lasso$se) +table[4,2] <- as.numeric(DML2.post$se) +table[5,2] <-as.numeric(DML2.lasso.cv$se) +table[6,2] <-as.numeric(DML2.elnet$se) +table[7,2] <-as.numeric(DML2.ridge$se) +table[8,2] <-as.numeric(DML2.RF$se) +table[9,2] <-as.numeric(DML2.best$se) + +# print results +colnames(table) <- c("Estimate","Standard Error") +rownames(table) <- c("Baseline OLS", "Least Squares with controls", "Lasso", "Post-Lasso", "CV Lasso","CV Elnet", "CV Ridge", "Random Forest", + "Best") +table +``` + +```{r} +print(table, digit=3) +``` + +```{r} +tab<- xtable(table, digits=3) +print(tab, type="latex") +``` diff --git a/T/deprecated/dml-inference-using-nn-for-gun-ownership.irnb.Rmd b/T/deprecated/dml-inference-using-nn-for-gun-ownership.irnb.Rmd new file mode 100644 index 00000000..621999bf --- /dev/null +++ b/T/deprecated/dml-inference-using-nn-for-gun-ownership.irnb.Rmd @@ -0,0 +1,172 @@ +--- +jupyter: + jupytext: + text_representation: + extension: .Rmd + format_name: rmarkdown + format_version: '1.2' + jupytext_version: 1.13.7 + kernelspec: + display_name: R + language: R + name: ir +--- + + +This notebook contains an example for teaching. + + +# The Effect of Gun Ownership on Gun-Homicide Rates using DML for neural nets + + +In this lab, we estimate the effect of gun ownership on the homicide rate using a neural network. + +```{r} +library(keras) +library(lfe) +``` + +First, we need to load and preprocess the data. + +```{r} +# read in dataset +data <- read.csv("../input/gun-example/gun_clean.csv") + + +################## Find Variable Names from the Dataset ################### + +varlist <- function (df=NULL,type=c("numeric","factor","character"), pattern="", exclude=NULL) { + vars <- character(0) + if (any(type %in% "numeric")) { + vars <- c(vars,names(df)[sapply(df,is.numeric)]) + } + if (any(type %in% "factor")) { + vars <- c(vars,names(df)[sapply(df,is.factor)]) + } + if (any(type %in% "character")) { + vars <- c(vars,names(df)[sapply(df,is.character)]) + } + vars[(!vars %in% exclude) & grepl(vars,pattern=pattern)] +} + +########################### Create Variables ############################## + +# dummy variables for year and county fixed effects +fixed <- grep("X_Jfips", names(data), value=TRUE, fixed=TRUE) +year <- varlist(data, pattern="X_Tyear") + +# census control variables +census <- NULL +census_var <- c("^AGE", "^BN", "^BP", "^BZ", "^ED", "^EL","^HI", "^HS", "^INC", "^LF", "^LN", "^PI", "^PO", "^PP", "^PV", "^SPR", "^VS") + +for(i in 1:length(census_var)){ + census <- append(census, varlist(data, pattern=census_var[i])) +} + +############################### Variables ################################# + +# treatment variable +d <- "logfssl" + +# outcome variable +y <- "logghomr" + +# other control variables +X1 <- c("logrobr", "logburg", "burg_missing", "robrate_missing") +X2 <- c("newblack", "newfhh", "newmove", "newdens", "newmal") + +###################### Partial-out Fixed Effects ######################### + +# new dataset for partialled-out variables +rdata <- as.data.frame(data$CountyCode) +colnames(rdata) <- "CountyCode" + +# variables to partial-out +varlist <- c(y, d,X1, X2, census) + +# partial out year and county fixed effects from variables in varlist +for(i in 1:length(varlist)){ + form <- as.formula(paste(varlist[i], "~", paste(paste(year,collapse="+"), paste(fixed,collapse="+"), sep="+"))) + rdata[, varlist[i]] <- lm(form, data)$residuals +} +``` + +# DML for neural nets + + + +The following algorithm consumes $Y$,$D$ and $Z$, and learns the residuals $\tilde{Y}$ and $\tilde{D}$ via a neural network, where the residuals are obtained by cross-validation (cross-fitting). Then, it prints the estimated coefficient $\beta$ and the clustered standard error from the final OLS regression. + +```{r} +DML2.for.NN <- function(z, d, y, nfold=2, clu, num_epochs, batch_size) { + nobs <- nrow(z) # number of observations + foldid <- rep.int(1:nfold,times = ceiling(nobs/nfold))[sample.int(nobs)] # define folds indices + I <- split(1:nobs, foldid) # split observation indices into folds + ytil <- dtil <- rep(NA, nobs) + cat("fold: ") + for(b in 1:length(I)){ + # normalize the data + mean <- apply(z[-I[[b]],], 2, mean) + std <- apply(z[-I[[b]],], 2, sd) + z[-I[[b]],] <- scale(z[-I[[b]],], center = mean, scale = std) + z[I[[b]],] <- scale(z[I[[b]],], center = mean, scale = std) + # building the model with 3 layers, the ReLU activation function, mse loss and rmsprop optimizer + build_model <- function(){ + model <- keras_model_sequential() %>% + layer_dense(units = 16, activation = "relu", + input_shape = dim(z[-I[[b]],][2]))%>% + layer_dense(units = 16, activation = "relu") %>% + layer_dense(units = 1) + + model %>% compile( + optimizer = "rmsprop", + loss = "mse", + metrics = c("mae") + ) + } + model.Y <- build_model() + model.D <- build_model() + # fitting the model + model.D %>% fit(z[-I[[b]],], d[-I[[b]]], + epochs = num_epochs, batch_size = batch_size, verbose = 0) + model.Y %>% fit(z[-I[[b]],], y[-I[[b]]], + epochs = num_epochs, batch_size = batch_size, verbose = 0) + dhat <- model.D %>% predict(z[I[[b]],]) + yhat <- model.Y %>% predict(z[I[[b]],]) + dtil[I[[b]]] <- (d[I[[b]]] - dhat) # record residual for the left-out fold + ytil[I[[b]]] <- (y[I[[b]]] - yhat) # record residial for the left-out fold + cat(b," ") + } + #rfit <- lm(ytil ~ dtil) # estimate the main parameter by regressing one residual on the other + data <- data.frame(cbind(ytil, dtil, as.matrix(clu))) + rfit <- felm(ytil ~ dtil|0|0|CountyCode,data=data) + coef.est <- coef(rfit)[2] # extract the coefficient + #HC <- vcovHC(rfit) + se <- summary(rfit,robust=T)$coefficients[2,2] # record robust standard error by county + cat(sprintf("\ncoef (se) = %g (%g)\n", coef.est , se)) # print the output + return( list(coef.est =coef.est , se=se, dtil=dtil, ytil=ytil, rfit=rfit) ) # save the output and residuals +} +``` + +# Estimating the effect with DML for neural nets + +```{r} +# treatment variable +D <- rdata[which(colnames(rdata) == d)] +# outcome variable +Y <- rdata[which(colnames(rdata) == y)] +# construct matrix Z +Z <- rdata[which(colnames(rdata) %in% c(X1,X2,census))] + +# inputs +y_nn <- as.matrix(Y) +d_nn <- as.matrix(D) +z_nn <- as.matrix(Z) +clu <- rdata[which(colnames(rdata) == "CountyCode")] +``` + +```{r} +# DML with a NN +set.seed(123) +DML2.nn = DML2.for.NN(z_nn, d_nn, y_nn, nfold=2, clu, 100, 10) +``` diff --git a/T/deprecated/r-weak-iv-experiments.irnb.Rmd b/T/deprecated/r-weak-iv-experiments.irnb.Rmd new file mode 100644 index 00000000..68758644 --- /dev/null +++ b/T/deprecated/r-weak-iv-experiments.irnb.Rmd @@ -0,0 +1,92 @@ +--- +jupyter: + jupytext: + text_representation: + extension: .Rmd + format_name: rmarkdown + format_version: '1.2' + jupytext_version: 1.13.7 + kernelspec: + display_name: R + language: R + name: ir +--- + +# A Simple Example of Properties of IV estimator when Instruments are Weak + + +Simulation Design + +```{r} +# Simulation Design + +library(hdm) +set.seed(1) +B= 10000 # trials +IVEst = rep(0, B) +n=100 +beta = .25 # .2 weak IV +#beta = 1 # 1 strong IV + + +U = rnorm(n) +Z = rnorm(n) #generate instrument +D = beta*Z + U #generate endogenougs variable +Y = D+ U # the true causal effect is 1 + + +summary(lm(D~Z)) # first stage is very weak here + +summary(tsls(x=NULL, d=D, y=Y, z=Z)) # + +``` + +Note that the instrument is weak here (contolled by $\beta$) -- the t-stat is less than 4. + + +# Run 1000 trials to evaluate distribution of the IV estimator + +```{r} +# Simulation Design + +set.seed(1) +B= 10000 # trials +IVEst = rep(0, B) + +for(i in 1:B){ +U = rnorm(n) +Z = rnorm(n) #generate instrument +D = beta*Z + U #generate endogenougs variable +Y = D+ U # the true causal effect is 1 +IVEst[i] = coef(tsls(x=NULL, d=D, y=Y, z=Z))[1,1] +} + + +``` + +# Plot the Actual Distribution against the Normal Approximation (based on Strong Instrument Assumption) + +```{r} +plot(density(IVEst-1, n=1000, from=-5, to=5),col=4, xlim= c(-5, 5), + xlab= "IV Estimator -True Effect", main="Actual Distribution vs Gaussian") + +val=seq(-5, 5, by=.05) +var = (1/beta^2)*(1/100) # theoretical variance of IV +sd = sqrt(var) +lines(val, dnorm(val, sd=sd), col=2, lty=2) + +rejection.frequency = sum(( abs(IVEst-1)/sd > 1.96))/B + +cat(c("Rejection Frequency is ", rejection.frequency, " while we expect it to be .05")) + +``` + +# Some Help Functions + +```{r} +help(tsls) +``` + +```{r} +help(density) +```