14_Distro.tex

\section{Distributional results}
In this chapter we assume that $\bY \sim N(\bX \bbeta, \sigma^2 \bI)$.  Under this assumption we will investigate some distributional results for the least-squares estimate $\hat \bfbet$, and the residual variance estimate $s^2$.


%This is the
%standard normal linear model. We saw in the previous chapter that the maximum
%likelihood estimate for 
%
%
%\subsection{Quadratic forms}
%\label{sec:qf}
%
%
%Let $\bA$ be a symmetric matrix (not necessarily full rank)
%and let $\bX \sim N(\bmu, \bSigma)$. Then
%$(\bX - \bmu)^t \bA (\bX - \bmu) \sim \chi^2_{p}$ if 
%$\bA \bSigma$ is idempotent and $p=\mbox{Rank}(\bA)$. 
%
%As an example, note that $(\bX - \bmu)^t \bSigma^{-1} (\bX - \bmu)$
%is clearly Chi-squared($n$). This is most easily seen by the fact that
%$\bSigma^{-1/2}(\bX - \bmu) = \bZ$ is a vector of iid standard normals
%and thus the quadratic form is the sum of their squares. Using
%our result, $\bA = \bSigma^{-1}$ in this case and  
%$\bA \bSigma = \bI$, which is idempotent. The rank of $\bSigma^{-1}$ is $n$.
%
%
%Let's prove our result. Let $\bA = \bV \bD^2 \bV^t$ where $\bD$ is diagonal (with $p$ non zero entries)
%and $\bV \bV^t = \bI$. The assumption of idempotency gives us that
%$\bA \bSigma \bA \bSigma = \bA \bSigma$. Plugging in our decomposition for $\bA$
%and using the orthonormality of the columns of $\bV$ we get that
%$\bD \bV^t \bSigma \bV \bD = \bI$. Then note that
%\begin{equation}
%\label{eq:quad}
%(\bX - \bmu)^t \bA (\bX - \bmu) = (\bX - \bmu)^t \bV \bD  \bD \bV^t (\bX - \bmu).
%\end{equation}
%But $\bD \bV^t (\bX - \bmu) \sim N(\bzero, \bD \bV^t \bSigma \bV \bD)$, 
%which has variance equal to $\bI$. Thus in Equation \eqref{eq:quad} we have
%the sum of $p$ squared iid standard normals and is thus Chi-squared $p$. 
%


\subsection{Statistical properties}
Note that $\hat \bbeta =(\bfX' \bfX)^{-1} \bfX' \bfy$ is a linear combination of normally distributed random variables, and hence itself normally distributed with moments: 
$$
E[\hat \bbeta] = \bbeta ~~~\mbox{and}~~~ \Var(\hat \bbeta) = (\bfX' \bfX)^{-1} \sigma^2.
$$
The residual variance estimate is 
\begin{eqnarray*}
s^2  &=& \frac{1}{n-p}\be' \be,
\end{eqnarray*}
which we previously showed was unbiased, i.e. $E[s^2] = \sigma^2$.  

In addition, note that:
\begin{eqnarray*}
\frac{n-p}{\sigma^2} s^2 & = & \frac{1}{\sigma^2}\by' (\bI - \bfX (\bfX' \bfX)^{-1} \bfX') \by 
\end{eqnarray*}
is a quadratic form as discussed in the previous section.  
Furthermore,
$$
\frac{1}{\sigma^2}\{\bI - \bfX (\bfX' \bfX)^{-1} \bfX'\} \Var(\bfy) = \{\bI - \bfX (\bfX' \bfX)^{-1} \bfX'\},
$$
which is idempotent. For symmetric idempotent matrices, the rank equals
the trace; the latter of which is easily calculated as
$$
\mbox{tr}\{\bI - \bfX (\bfX' \bfX)^{-1} \bfX'\} = \mbox{tr}\{\bI\} - \mbox{tr}\{\bfX (\bfX' \bfX)^{-1} \bfX'\}
= n - \mbox{tr}\{ (\bfX' \bfX)^{-1} \bfX' \bfX\} = n - p.
$$
Thus, $\frac{n-p}{\sigma^2} s^2$ is Chi-squared $n-p$. The special
case of this where $\bX$ has only an intercept yields the usual
empirical variance estimate.


\subsection{T statistics}


We are now in a position to develop $T$ statistics.
Consider the linear contrast $\bfc' \hat \bbeta$. First note that
$\bfc' \hat \bbeta$ is $N(\bfc' \bbeta, \bfc' (\bfX' \bfX)^{-1} \bfc \sigma^2)$. 
Furthermore,
$$\Cov(\hat \bbeta, \be) = \Cov((\bfX' \bfX)^{-1}  \bX' \bY, \{\bI - \bfX (\bfX' \bfX)^{-1} \bfX' \} \bY) = 0$$
since $\xtxinv \bX' \{\bI - \bfX (\bfX' \bfX)^{-1} \bfX'\} = \bzero$. Thus the residuals
and estimated coefficients are independent, implying that $\bfc'\hat \bbeta$
and $s^2$ are independent.
Therefore,
$$
\frac{\bfc' \hat \bbeta - \bfc' \bbeta }{\sqrt{\bfc' (\bfX' \bfX)^{-1} \bfc \sigma^2}}
/ \sqrt{\frac{n-p}{\sigma^2} s^2 / (n-p)}
=\frac{\bfc' \hat \bbeta - \bfc'\bbeta}{\sqrt{\bfc' (\bfX' \bfX)^{-1} \bfc s^2}}
$$
is a standard normal divided by the square root of an independent Chi-squared
over its degrees of freedom, thus is $T$ distributed with $n-p$ degrees of freedom.

\subsection{F statistic}

\label{sec:ftest}
Consider testing the hypothesis that $H_0:\bK \bbeta = 0$ versus not equal for $\bK$
of full row rank $q$. 
Notice that $\bK \hat \bbeta \sim N(\bK \bbeta, \bK (\bfX' \bfX)^{-1} \bK' \sigma^2)$ and thus
$$
(\bK \hat \bbeta - \bK \bbeta)'  \{\bK (\bfX' \bfX)^{-1} \bK' \sigma^2 \}^{-1} (\bK \hat \bbeta - \bK \bbeta)
$$
is Chi-squared with $q$ degrees of freedom. Furthermore, it is independent of
$\be$ being a function of $\hat \bbeta$. Thus:
$$
(\bK \hat \bbeta - \bK \bbeta)'  \{\bK (\bfX' \bfX)^{-1} \bK'\}^{-1} (\bK \hat \bbeta - \bK \bbeta) / q s^2
$$
forms the ratio of two independent Chi-squared random variables over their degrees of freedom, which follows an F distribution with $q$ and $n-p$ degrees of freedom. . 


\subsection{Coding example}

Consider the \texttt{swiss} dataset. Let's first make sure that we can replicate
the coefficient table obtained by R.
\begin{verbatim}
> ## First let's see the coeficient table
> fit = lm(Fertility ~ ., data = swiss)
> round(summary(fit)$coef, 3)
                 Estimate Std. Error t value Pr(>|t|)
(Intercept)        66.915     10.706   6.250    0.000
Agriculture        -0.172      0.070  -2.448    0.019
Examination        -0.258      0.254  -1.016    0.315
Education          -0.871      0.183  -4.758    0.000
Catholic            0.104      0.035   2.953    0.005
Infant.Mortality    1.077      0.382   2.822    0.007
> # Now let's do it more manually
> x = cbind(1, as.matrix(swiss[,-1]))
> y = swiss$Fertility
> beta = solve(t(x) %*% x, t(x) %*% y)
> e = y - x %*% beta
> n = nrow(x); p = ncol(x)
> s = sqrt(sum(e^2) / (n - p))
> #Compare with lm
> c(s, summary(fit)$sigma)
[1] 7.165369 7.165369
> #calculate the t statistics
> betaVar = solve(t(x) %*% x) * s ^ 2 
> ## Show that standard errors agree with lm
> cbind(summary(fit)$coef[,2], sqrt(diag(betaVar)))
                        [,1]        [,2]
(Intercept)      10.70603759 10.70603759
Agriculture       0.07030392  0.07030392
Examination       0.25387820  0.25387820
Education         0.18302860  0.18302860
Catholic          0.03525785  0.03525785
Infant.Mortality  0.38171965  0.38171965
> # Show that the tstats agree
> tstat = beta / sqrt(diag(betaVar))
> cbind(summary(fit)$coef[,3],  tstat)
                     [,1]      [,2]
(Intercept)       6.250229  6.250229
Agriculture      -2.448142 -2.448142
Examination      -1.016268 -1.016268
Education        -4.758492 -4.758492
Catholic          2.952969  2.952969
Infant.Mortality  2.821568  2.821568
> # Show that the P-values agree
> cbind(summary(fit)$coef[,4],  2 * pt(- abs(tstat), n - p)
                         [,1]         [,2]
(Intercept)      1.906051e-07 1.906051e-07
Agriculture      1.872715e-02 1.872715e-02
Examination      3.154617e-01 3.154617e-01
Education        2.430605e-05 2.430605e-05
Catholic         5.190079e-03 5.190079e-03
Infant.Mortality 7.335715e-03 7.335715e-03
> # Get the F statistic
> # Set K to grab everything except the intercept
> k = cbind(0, diag(rep(1, p - 1)))
> kvar = k %*% solve(t(x) %*% x) %*% t(k)
> fstat = t(k %*% beta) %*% solve(kvar) %*% (k %*% beta) / (p - 1) / s ^ 2
> #Show that it's equal to what lm is giving
> cbind(summary(fit)$fstat, fstat)
> #Calculate the p-value
> pf(fstat, p - 1, n - p, lower.tail = FALSE)
             [,1]
[1,] 5.593799e-10
> summary(fit)
## ... only showing the one relevant line ...
F-statistic: 19.76 on 5 and 41 DF,  p-value: 5.594e-10
\end{verbatim}


\subsection{Confidence ellipsoids}

An hyper-ellipsoid with center $\bv$ is defined as the solutions in $\bx$ of 
$(\bx - \bv)' \bA (\bx - \bv)=1$. The eigenvalues of $\bA$ determine the length
of the axes of the ellipsoid. The set of points $\{\bx ~|~ (\bx - \bv)' \bA (\bx - \bv) \leq 1\}$
lie in the interior of the hyper-ellipsoid. 

Now consider our F statistic from earlier on in the chapter:
$$
(\bK \hat \bbeta - \bm)'  \{\bK \xtxinv \bK'\}^{-1} (\bK \hat \bbeta - \bm) / v S^2
$$
We would fail to reject $H_0:\bK \bbeta = \bm$ is less than the appropriate cut off 
from an $F$ distribution, say $F_{1-\alpha}$. So, the set of points
$$
\{\bm ~|~ 
(\bK \hat \bbeta - \bm)'  \{\bK \xtxinv \bK'\}^{-1} (\bK \hat \bbeta - \bm) / v S^2 F_{1-\alpha} \leq 1
\}
$$
forms a confidence set. From the discussion above, we see that this is a hyper ellipsoid.
This multivariate form of confidence interval is called a confidence ellipse. These are
of course most useful when the dimension is 2 or 3 so that we can visualize it as
an actual ellipse.

\subsubsection{Coding example}
\begin{verbatim}
fit = lm(mpg ~ disp + hp , mtcars)
open3d()
plot3d(ellipse3d(fit), col = "red", alpha = .5, aspect = TRUE)

## Doing it directly
beta = coef(fit)
Sigma = vcov(fit)
n = nrow(mtcars); p = length(beta)

A = Sigma * (3 * qf(.95, 3, n - p))
nms = names(beta)

open3d()
## Using the definition of an elipse
##(x - b)' A (x - b) = 1
plot3d(ellipse3d(A, centre = beta, t = 1), 
                 color = "blue",  alpha = .5, aspect = TRUE, 
       xlab = nms[1], ylab = nms[2], zlab = nms[3]) 

## Using the more statistical version
## Provide ellipse3d with the variance covariance matrix
plot3d(ellipse3d(Sigma, centre = beta, t = sqrt(3 * qf(.95, 3, n - p))), 
                 color = "green",  alpha = .5, aspect = TRUE, 
       xlab = nms[1], ylab = nms[2], zlab = nms[3], add = TRUE) 

\end{verbatim}


\subsection{Confidence intervals}

Consider creating a confidence interval for $\bfc' \bfbet$. Begin by recalling that
\begin{eqnarray*}
\frac{\bfc' \hat \bfbet - \bfc'\bfbet}{s \sqrt{\bfc' (\bfX' \bfX)^{-1} \bfc }} \sim t_{n-p}.
\end{eqnarray*}
Thus, we can show that a $100(1-\alpha)\%$ confidence interval for $\bfc' \bfbet$ is given by:
\begin{eqnarray*}
\bfc' \hat \bfbet  \pm t_{\alpha/2, n-p} s {\sqrt{\bfc' (\bfX' \bfX)^{-1} \bfc }}
\end{eqnarray*}


\subsection{Prediction intervals}

It's worth discussing prediction intervals. The obvious prediction
at a new set of covariates, $\bx_0$, is $\bx_0' \hat \bbeta$. This is 
then just a linear contrast of the $\bbeta$ and so the interval would be
$$
\bx_0' \hat \bbeta
\pm t_{n-p, 1-\alpha/2} s \sqrt{\bx_0' \xtxinv \bx_0}.
$$

If you've taken an introductory regression class, you it will have noted
the difference between a prediction interval and a confidence interval
for a mean at a new value of $\bx$. For a prediction interval, we
want to estimate a range of possible values for $y$ at that value
of $\bx$, a different statement than trying to estimate the average
value of $y$ at that value of $\bx$. As we collect infinite data,
we should get the average value exactly. However, predicting a new
value involves intrinsic variability that can't go away no matter
how much data we use to build our model. 

As an example, consider the following two tasks:
(i) guessing the sales price of a diamond given its weight; and (ii)
guessing the average sales price of diamonds 
given a particular weight. With enough data, we should be able to estimate the average
sale price very precisely. However, we still won't know the exact sales
price of an individual diamond of that weight.

To account for this, we develop prediction intervals. These
are not confidence intervals, because they are trying to estimate
something random, not a fixed parameter. 
Consider estimating $Y_0$ at $\bx$ value $\bx_0$. 
Note that
$$
Var(y_0 - \bfx_0' \hat \bfbet) = (1 + \bfx_0' (\bfX' \bfX)^{-1} \bfx_0)\sigma^2
$$
For homework,
show that 
$$
\frac{y_0 - \bx_0' \hat \bbeta}{s\sqrt{1 + \bx_0' (\bfX' \bfX)^{-1} \bx_0 \sigma^2}}
$$
follows a T distribution with $n-p$ degrees of freedom. Finish, by showing that
$$
P\{y_0 \in [\bx_0' \hat \bbeta
\pm t_{n-p, 1-\alpha/2} s \sqrt{1 + \bx_0' (\bfX' \bfX)^{-1}  \bx_0}]\} = 1 - \alpha.
$$
This is called a prediction interval. Notice the variability under consideration
contains $\bx_0' (\bfX' \bfX)^{-1}  \bx_0$, which goes to 0 as we get more $\bX$ variability
and $1$, which represents the intrinsic part of the variability that doesn't
go away.


\subsection{Coding example}

Let's try to predict a car's MPG from other characteristics. 

\begin{verbatim}
> fit = lm(mpg ~ hp + wt, data = mtcars)
> newcar = data.frame(hp = 90, wt = 2.2)
> predict(fit, newdata = newcar)
       1 
25.83648 
> predict(fit, newdata = newcar, interval = "confidence")
       fit      lwr      upr
1 25.83648 24.46083 27.21212
> predict(fit, newdata = newcar, interval = "prediction")
       fit      lwr      upr
1 25.83648 20.35687 31.31609
> #Doing it manually
> library(dplyr)
> y = mtcars$mpg
> x = as.matrix(cbind(1, select(mtcars, hp, wt)))
> n = length(y)
> p = ncol(x)
> xtxinv = solve(t(x) %*% x)
> beta = xtxinv %*% t(x) %*% y
> x0 = c(1, 90, 2.2)
> yhat = x %*% beta
> e = y - yhat
> s = sqrt(sum(e^2 / (n - p)))
> yhat0 = sum(x0 * beta)
> # confidence interval
> yhat0 + qt(c(0.025, .975), n - p) * s * sqrt(t(x0) %*% xtxinv %*% x0)
[1] 24.46083 27.21212
> # prediction interval
> yhat0 + qt(c(0.025, .975), n - p) * s * sqrt(1 + t(x0) %*% xtxinv %*% x0)
[1] 20.35687 31.31609
\end{verbatim}


\subsection{Confidence interval for the variance}
We can use this result to develop a confidence interval
for the variance. Let $\chi^2_{n-p, \alpha}$ be the $\alpha$ quantile
from the chi squared distribution with $n-p$ degrees of freedom.
Then inverting the probability statement
$$
1 - \alpha = 
P\left(
\chi^2_{n-p, \alpha/2}
\leq \frac{\be' \be}{n-p}
\leq 
\chi^2_{n-p,1 - \alpha/2}
\right)
$$
yields the $100(1-\alpha)\%$ confidence interval
$$
\frac{(n-p)s^2}{\chi^2_{n-p, 1-\alpha/2}}
\leq \sigma^2
\leq 
\frac{(n-p)s^2}{\chi^2_{n-p, \alpha/2}} 
$$