old

CausalAIBook · Jul 22, 2024 · 3bdbc74 · 3bdbc74
1 parent 774266c
commit 3bdbc74
Show file tree

Hide file tree

Showing 5 changed files with 1,582 additions and 0 deletions.
diff --git a/T/deprecated/debiased-ml-for-partially-linear-iv-model-in-r.irnb.Rmd b/T/deprecated/debiased-ml-for-partially-linear-iv-model-in-r.irnb.Rmd
@@ -0,0 +1,264 @@
+---
+jupyter:
+  jupytext:
+    text_representation:
+      extension: .Rmd
+      format_name: rmarkdown
+      format_version: '1.2'
+      jupytext_version: 1.13.7
+  kernelspec:
+    display_name: R
+    language: R
+    name: ir
+---
+
+<!-- #region -->
+# Double/Debiased ML for Partially Linear IV Model
+
+References: 
+
+https://arxiv.org/abs/1608.00060
+
+
+https://www.amazon.com/Business-Data-Science-Combining-Accelerate/dp/1260452778
+
+The code is based on the book.
+
+<!-- #endregion -->
+
+<!-- #region -->
+
+# Partially Linear IV Model
+
+We consider the partially linear structural equation model:
+\begin{eqnarray}
+ &  Y - D\theta_0 = g_0(X) + \zeta,  & E[\zeta \mid Z,X]= 0,\\
+  & Z = m_0(X) +  V,   &  E[V \mid X] = 0. 
+\end{eqnarray}
+
+
+Note that this model is not a regression model unless $Z=D$.  The model  is a canonical
+model in causal inference, going back to P. Wright's work on IV methods for estimaing demand/supply equations, with the modern difference being that $g_0$ and $m_0$ are nonlinear, potentially complicated functions of high-dimensional $X$.  
+
+
+The idea of this model is that there is a structural or causal relation between $Y$ and $D$, captured by $\theta_0$, and $g_0(X) + \zeta$ is the stochastic error, partly explained by covariates $X$.  $V$ and $\zeta$ are stochastic errors that are not explained by $X$. Since $Y$ and $D$ are jointly determined, we need an external factor, commonly referred to as an instrument, $Z$ to create exogenous variation
+in $D$.   Note that $Z$ should affect $D$.  The $X$ here serve again as confounding factors, so we can think of variation in $Z$ as being exogenous only conditional on $X$. 
+
+
+The causal DAG this model corresponds to is given by:
+$$
+Z \to D,  X \to (Y, Z, D),  L \to (Y,D),
+$$
+where $L$ is the latent confounder affecting both $Y$ and $D$, but not $Z$.
+
+
+
+---
+
+# Example
+
+A simple contextual example is from biostatistics, where $Y$ is a health outcome and $D$ is indicator of smoking.  Thus, $\theta_0$ is captures the effect of smoking on health.  Health outcome $Y$ and smoking behavior $D$ are treated as being jointly determined.  $X$ represents patient characteristics, and $Z$ could be a doctor's advice not to smoke (or another behavioral treatment) that may affect the outcome $Y$ only through shifting the behavior $D$, conditional on characteristics $X$.   
+
+----
+
+
+
+# PLIVM in Residualized Form
+
+The PLIV model above can be rewritten in the following residualized form:
+$$
+  \tilde Y = \tilde D \theta_0 + \zeta,   \quad  E[\zeta \mid V,X]= 0,
+$$
+where
+$$
+ \tilde Y = (Y- \ell_0(X)),  \quad \ell_0(X) = E[Y \mid X] \\
+   \tilde D = (D - r_0(X)), \quad r_0(X) = E[D \mid X] \\
+   \tilde Z = (Z- m_0(X)), \quad m_0(X) = E[Z \mid X].
+$$
+   The tilded variables above represent original variables after taking out or "partialling out"
+  the effect of $X$.  Note that $\theta_0$ is identified from this equation if $V$ 
+  and $U$ have non-zero correlation, which automatically means that $U$ and $V$
+  must have non-zero variation.
+
+
+
+-----
+
+# DML for PLIV Model
+
+Given identification, DML  proceeds as follows
+
+Compute the estimates $\hat \ell_0$, $\hat r_0$, and $\hat m_0$ , which amounts
+to solving the three problems of predicting $Y$, $D$, and $Z$ using
+$X$, using any generic  ML method, giving us estimated residuals 
+$$
+\tilde Y = Y - \hat \ell_0(X), \\ \tilde D= D - \hat r_0(X), \\ \tilde Z = Z- \hat m_0(X).
+$$ 
+The estimates should be of a cross-validated form, as detailed in the algorithm below. 
+
+Estimate $\theta_0$ by the the intstrumental
+variable regression of $\tilde Y$ on $\tilde D$ using $\tilde Z$ as an instrument.
+Use the conventional inference for the IV regression estimator, ignoring
+the estimation error in these residuals. 
+
+The reason we work with this residualized form is that it eliminates the bias
+arising when solving the prediction problem in stage 1. The role of cross-validation
+is to avoid another source of bias due to potential overfitting.
+
+The estimator is adaptive,
+in the sense that the first stage estimation errors do not affect the second 
+stage errors.
+
+<!-- #endregion -->
+
+```{r _kg_hide-output=TRUE}
+install.packages("hdm")
+install.packages("AER")
+install.packages("randomForest")
+```
+
+```{r}
+
+library(AER)  #applied econometrics library
+library(randomForest)  #random Forest library
+library(hdm) #high-dimensional econometrics library
+library(glmnet) #glm net
+
+
+# DML for PLIVM
+
+DML2.for.PLIVM <- function(x, d, z, y, dreg, yreg, zreg, nfold=2) {
+  # this implements DML2 algorithm, where there moments are estimated via DML, before constructing
+  # the pooled estimate of theta randomly split data into folds
+  nobs <- nrow(x)
+  foldid <- rep.int(1:nfold,times = ceiling(nobs/nfold))[sample.int(nobs)]
+  I <- split(1:nobs, foldid)
+  # create residualized objects to fill
+  ytil <- dtil <- ztil<- rep(NA, nobs)
+  # obtain cross-fitted residuals
+  cat("fold: ")
+  for(b in 1:length(I)){
+    dfit <- dreg(x[-I[[b]],], d[-I[[b]]])  #take a fold out
+    zfit <- zreg(x[-I[[b]],], z[-I[[b]]])  #take a fold out
+    yfit <- yreg(x[-I[[b]],], y[-I[[b]]])  # take a folot out
+    dhat <- predict(dfit, x[I[[b]],], type="response")  #predict the fold out
+    zhat <- predict(zfit, x[I[[b]],], type="response")  #predict the fold out
+    yhat <- predict(yfit, x[I[[b]],], type="response")  #predict the fold out
+    dtil[I[[b]]] <- (d[I[[b]]] - dhat) #record residual
+    ztil[I[[b]]] <- (z[I[[b]]] - zhat) #record residual
+    ytil[I[[b]]] <- (y[I[[b]]] - yhat) #record residial
+    cat(b," ")
+  }
+  ivfit= tsls(y=ytil,d=dtil, x=NULL, z=ztil, intercept=FALSE)
+  coef.est <-  ivfit$coef          #extract coefficient 
+  se <-  ivfit$se                  #record standard error
+  cat(sprintf("\ncoef (se) = %g (%g)\n", coef.est , se))
+  return( list(coef.est =coef.est , se=se, dtil=dtil, ytil=ytil, ztil=ztil) )
+}
+
+
+```
+
+<!-- #region -->
+-----
+
+# Emprical Example:  Acemoglu, Jonsohn, Robinson (AER).
+
+
+* Y is log GDP;
+* D is a measure of Protection from Expropriation, a proxy for 
+quality of insitutions;
+* Z is the log of Settler's mortality;
+* W are geographical variables (latitude, latitude squared, continent dummies as well as interactions)
+
+
+<!-- #endregion -->
+
+```{r}
+
+data(AJR); 
+
+y = AJR$GDP; 
+d = AJR$Exprop; 
+z = AJR$logMort
+xraw= model.matrix(~ Latitude+ Africa+Asia + Namer + Samer, data=AJR)
+x = model.matrix(~ -1 + (Latitude + Latitude2 + Africa + 
+                           Asia + Namer + Samer)^2, data=AJR)
+dim(x)
+
+# DML with Random Forest
+cat(sprintf("\n DML with Random Forest \n"))
+
+dreg <- function(x,d){ randomForest(x, d) }  #ML method=Forest
+yreg <- function(x,y){ randomForest(x, y) }  #ML method=Forest
+zreg<-  function(x,z){ randomForest(x, z)}  #ML method=Forest 
+  
+set.seed(1)
+DML2.RF = DML2.for.PLIVM(xraw, d, z, y, dreg, yreg, zreg, nfold=20)
+
+# DML with PostLasso
+cat(sprintf("\n DML with Post-Lasso \n"))
+
+dreg <- function(x,d){ rlasso(x, d) }  #ML method=lasso
+yreg <- function(x,y){ rlasso(x, y) }  #ML method=lasso
+zreg<-  function(x,z){ rlasso(x, z)}  #ML method=lasso 
+
+set.seed(1)
+DML2.lasso = DML2.for.PLIVM(x, d, z, y, dreg, yreg, zreg, nfold=20)
+
+
+# Compare Forest vs Lasso
+
+comp.tab= matrix(NA, 3, 2)
+comp.tab[1,] = c( sqrt(mean((DML2.RF$ytil)^2)),  sqrt(mean((DML2.lasso$ytil)^2)) )
+comp.tab[2,] = c( sqrt(mean((DML2.RF$dtil)^2)),  sqrt(mean((DML2.lasso$dtil)^2)) )
+comp.tab[3,] = c( sqrt(mean((DML2.RF$ztil)^2)),  sqrt(mean((DML2.lasso$ztil)^2)) )
+rownames(comp.tab) = c("RMSE for Y:", "RMSE for D:", "RMSE for Z:")
+colnames(comp.tab) = c("RF", "LASSO")
+print(comp.tab, digits=3)
+```
+
+# Examine if we have weak instruments
+
+```{r}
+install.packages("lfe")
+library(lfe)
+summary(felm(DML2.lasso$dtil~DML2.lasso$ztil), robust=T)
+summary(felm(DML2.RF$dtil~DML2.RF$ztil), robust=T)
+```
+
+#  We do have weak instruments, because t-stats in regression $\tilde D \sim \tilde Z$ are less than 4 in absolute value
+
+
+So let's carry out DML inference combined with Anderson-Rubin Idea
+
+```{r}
+# DML-AR (DML with Anderson-Rubin) 
+
+DML.AR.PLIV<- function(rY, rD, rZ, grid, alpha=.05){
+    n = length(rY)
+    Cstat = rep(0, length(grid))
+    for (i in 1:length(grid)) {
+    Cstat[i] <-  n* (mean( (rY - grid[i]*rD)*rZ)  )^2/var ( (rY - grid[i]*rD) * rZ )
+    };
+    LB<- min(grid[ Cstat < qchisq(1-alpha,1)]);
+    UB <- max(grid[ Cstat < qchisq(1-alpha,1)]); 
+    plot(range(grid),range(c( Cstat)) , type="n",xlab="Effect of institutions", ylab="Statistic", main=" ");  
+    lines(grid, Cstat,   lty = 1, col = 1);       
+    abline(h=qchisq(1-alpha,1), lty = 3, col = 4);
+    abline(v=LB, lty = 3, col = 2);
+    abline(v=UB, lty = 3, col = 2);
+    return(list(UB=UB, LB=LB))
+    }
+
+```
+
+```{r}
+DML.AR.PLIV(rY = DML2.lasso$ytil, rD= DML2.lasso$dtil, rZ= DML2.lasso$ztil,
+           grid = seq(-2, 2, by =.01))
+
+
+DML.AR.PLIV(rY = DML2.RF$ytil, rD= DML2.RF$dtil, rZ= DML2.RF$ztil,
+           grid = seq(-2, 2, by =.01))
+
+```