Skip to content

Commit

Permalink
Update math
Browse files Browse the repository at this point in the history
  • Loading branch information
thomasmarwitz committed Aug 20, 2024
1 parent ab1c702 commit 1c9c45a
Show file tree
Hide file tree
Showing 2 changed files with 63 additions and 555 deletions.
101 changes: 63 additions & 38 deletions docs/background.md
Original file line number Diff line number Diff line change
Expand Up @@ -157,10 +157,12 @@ $$

The T-Learner was introduced by [Kuenzel et al. (2019)](https://arxiv.org/pdf/1706.03461.pdf). In the T-Learner, the conditional average outcomes are estimated using one estimator for each treatment variant:

$$
\[
\begin{align*} % Use the align environment for several lines separated by double backslash
\mu_0 (x) &:= \mathbb{E}[Y(0) | X = x] \\
\mu_1 (x) &:= \mathbb{E}[Y(1) | X = x]
$$
\end{align*}
\]

$\hat{\mu}_0$ and $\hat{\mu}_1$ are estimated using the untreated and treated observations respectively.

Expand Down Expand Up @@ -190,17 +192,21 @@ The X-Learner was introduced by [Kuenzel et al. (2019)](https://arxiv.org/pdf/17

1. Estimate the conditional average outcomes for each variant:

$$
\[
\begin{align*}
\mu_0 (x) &:= \mathbb{E}[Y(0) | X = x] \\
\mu_1 (x) &:= \mathbb{E}[Y(1) | X = x]
$$
\end{align*}
\]

2. Impute the treatment effect for the observations in the treated group based on the control-outcome estimator as well as the treatment effect for the observations in the control group based on the treatment-outcome estimator:

$$
\widetilde{D}_1^i &:= Y^i_1 - \hat{\mu}_0(X^i_1) \\
\widetilde{D}_0^i &:= \hat{\mu}_1(X^i_0) - Y^i_0
$$
\[
\begin{align*}
\widetilde{D}\_1^i &:= Y^i_1 - \hat{\mu}\_0(X^i_1) \\
\widetilde{D}\_0^i &:= \hat{\mu}\_1(X^i_0) - Y^i_0
\end{align*}
\]

Then estimate $\tau_1(x) := \mathbb{E}[\widetilde{D}^i_1 | X=x]$ and $\tau_0(x) := \mathbb{E}[\widetilde{D}^i_0 | X=x]$ using the observations in the treatment group and the ones in the control group respectively.

Expand All @@ -218,17 +224,21 @@ In the case of multiple discrete treatments, the stages are similar to the binar

1. One outcome model is estimated for each variant (including the control), and one propensity model is trained as a multiclass classifier, $\forall k \in \{0,\dots, K-1\}$:

$$
\[
\begin{align*}
\mu_k (x) &:= \mathbb{E}[Y(k) | X = x]\\
e(x, k) &:= \mathbb{E}[\mathbb{I}\{W = k\} | X=x] = \mathbb{P}[W = k | X=x]
$$
\end{align*}
\]

2. The treatment effects are imputed using the corresponding outcome estimator, $\forall k \in \{1,\dots, K-1\}$:

$$
\widetilde{D}_k^i &:= Y^i_k - \hat{\mu}_0(X^i_k) \\
\widetilde{D}_{0,k}^i &:= \hat{\mu}_k(X^i_0) - Y^i_0
$$
\[
\begin{align*}
\widetilde{D}*k^i &:= Y^i*k - \hat{\mu}\_0(X^i_k) \\
\widetilde{D}*{0,k}^i &:= \hat{\mu}\_k(X^i*0) - Y^i_0
\end{align*}
\]

Then $\tau_k(x) := \mathbb{E}[\widetilde{D}^i_k | X=x]$ is estimated using the observations which received treatment $k$ and $\tau_{0,k}(x) := \mathbb{E}[\widetilde{D}^i_{0,k} | X=x]$ using the observations in the control group.

Expand All @@ -250,26 +260,31 @@ The R-Learner was introduced by [Nie et al. (2017)](https://arxiv.org/pdf/1712.0

1. Estimate a general outcome model and a propensity model:

$$
\[
\begin{align*}
m(x) &:= \mathbb{E}[Y | X=x] \\
e(x) &:= \mathbb{P}[W = 1 | X=x]
$$
\end{align*}
\]

2. Estimate the treatment effect by minimizing the R-Loss:

$$
\DeclareMathOperator*{\argmin}{arg\,min}
\hat{\tau}^R (\cdot) &:= \argmin_{\tau}\Bigg\{\mathbb{E}\Bigg[\bigg(\left\{Y^i - \hat{m}(X^i)\right\} - \left\{W^i - \hat{e}(X^i)\right\}\tau(X^i)\bigg)^2\Bigg]\Bigg\} \\
&=\argmin_{\tau}\left\{\mathbb{E}\left[\left\{W^i - \hat{e}(X^i)\right\}^2\bigg(\frac{\left\{Y^i - \hat{m}(X^i)\right\}}{\left\{W^i - \hat{e}(X^i)\right\}} - \tau(X^i)\bigg)^2\right]\right\} \\
&= \argmin_{\tau}\left\{\mathbb{E}\left[{\widetilde{W}^i}^2\bigg(\frac{\widetilde{Y}^i}{\widetilde{W}^i} - \tau(X^i)\bigg)^2\right]\right\}
$$
\[
\begin{align*}
\hat{\tau}^R (\cdot) &:= \argmin*{\tau}\Bigg\{\mathbb{E}\Bigg[\bigg(\left\{Y^i - \hat{m}(X^i)\right\} - \left\{W^i - \hat{e}(X^i)\right\}\tau(X^i)\bigg)^2\Bigg]\Bigg\} \\
&=\argmin*{\tau}\left\{\mathbb{E}\left[\left\{W^i - \hat{e}(X^i)\right\}^2\bigg(\frac{\left\{Y^i - \hat{m}(X^i)\right\}}{\left\{W^i - \hat{e}(X^i)\right\}} - \tau(X^i)\bigg)^2\right]\right\} \\
&= \argmin\_{\tau}\left\{\mathbb{E}\left[{\widetilde{W}^i}^2\bigg(\frac{\widetilde{Y}^i}{\widetilde{W}^i} - \tau(X^i)\bigg)^2\right]\right\}
\end{align*}
\]

Where

$$
\[
\begin{align*}
\widetilde{W}^i &= W^i - \hat{e}(X^i) \\
\widetilde{Y}^i &= Y^i - \hat{m}(X^i)
$$
\end{align*}
\]

And therefore any ML model which supports weighting each observation differently can be used for the final model.

Expand All @@ -279,10 +294,12 @@ In the case of multiple discrete treatments, the stages are similar to the binar

1. Estimate a general outcome model and a propensity model:

$$
\[
\begin{align*}
m(x) &:= \mathbb{E}[Y | X=x] \\
e(x) &:= \mathbb{P}[W = k | X=x]
$$
\end{align*}
\]

2. For each $k \neq 0$, estimate the pairwise treatment effect $\hat{\tau}_{0,k}^R$ between 0 and $k$ by minimizing the R-Loss from above. In order to fit these models, we fit the pseudo outcomes only on observations of either the control group or the treatment variant group $k$.

Expand All @@ -300,17 +317,21 @@ The DR-Learner was introduced by [Kennedy (2020)](https://arxiv.org/pdf/2004.144

1. Estimate the conditional average outcomes for each variant and a propensity model:

$$
\[
\begin{align*}
\mu_0 (x, w) &:= \mathbb{E}[Y(0) | X = x] \\
\mu_1 (x, w) &:= \mathbb{E}[Y(1) | X = x] \\
e(x) &:= \mathbb{E}[W = 1 | X=x]
$$
\end{align*}
\]

And construct the pseudo-outcomes:

$$
\varphi(X^i, W^i, Y^i) := \frac{W^i - \hat{e}(X^i)}{\hat{e}(X^i)(1-\hat{e}(X^i))}\big\{Y^i - \hat{\mu}_{W^i}(X^i)\big\} + \hat{\mu}_{1}(X^i) - \hat{\mu}_{0}(X^i)
$$
\[
\begin{align*}
\varphi(X^i, W^i, Y^i) := \frac{W^i - \hat{e}(X^i)}{\hat{e}(X^i)(1-\hat{e}(X^i))}\big\{Y^i - \hat{\mu}*{W^i}(X^i)\big\} + \hat{\mu}_{1}(X^i) - \hat{\mu}\_{0}(X^i)
\end{align_}
\]

2. Estimate the CATE by regressing $\varphi$ on $X$:

Expand All @@ -324,19 +345,23 @@ In the case of multiple discrete treatments, the stages are similar to the binar

1. One outcome model is estimated for each variant (including the control), and one propensity model is trained as a multiclass classifier, $\forall k \in \{0,\dots, K-1\}$:

$$
\[
\begin{align*}
\mu_k (x) &:= \mathbb{E}[Y(k) | X = x]\\
e(x, k) &:= \mathbb{E}[\mathbb{I}\{W = k\} | X=x] = \mathbb{P}[W = k | X=x]
$$
\end{align*}
\]

The pseudo-outcomes are constructed for each treatment variant, $\forall k \in \{1,\dots, K-1\}$:

$$
\varphi_k(X^i, W^i, Y^i) := &\frac{Y^i - \hat{\mu}_{k}(X^i)}{\hat{e}(k, X^i)}\mathbb{I}\{W^i = k\} + \hat{\mu}_k(X^i) \\
&- \frac{Y^i - \hat{\mu}_{0}(X^i)}{\hat{e}(0, X^i)}\mathbb{I}\{W^i = 0\} - \hat{\mu}_0(X^i)
$$
\[
\begin{align*}
\varphi*k(X^i, W^i, Y^i) := &\frac{Y^i - \hat{\mu}*{k}(X^i)}{\hat{e}(k, X^i)}\mathbb{I}\{W^i = k\} + \hat{\mu}*k(X^i) \\
&- \frac{Y^i - \hat{\mu}_{0}(X^i)}{\hat{e}(0, X^i)}\mathbb{I}\{W^i = 0\} - \hat{\mu}\_0(X^i)
\end{align_}
\]

2. Finally, the CATE is estimated by regressing $\varphi_k$ on $X$ for each treatment variant, $\forall k \in \{1,\dots, K-1\}$:
1. Finally, the CATE is estimated by regressing $\varphi_k$ on $X$ for each treatment variant, $\forall k \in \{1,\dots, K-1\}$:

$$
\hat{\tau}_k^{DR}(x) := \mathbb{E}[\varphi_k(X^i, W^i, Y^i) | X^i=x]
Expand Down
Loading

0 comments on commit 1c9c45a

Please sign in to comment.