-
Notifications
You must be signed in to change notification settings - Fork 48
/
14_Distro.tex
376 lines (317 loc) · 13.3 KB
/
14_Distro.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
\section{Distributional results}
In this chapter we assume that $\bY \sim N(\bX \bbeta, \sigma^2 \bI)$. Under this assumption we will investigate some distributional results for the least-squares estimate $\hat \bfbet$, and the residual variance estimate $s^2$.
%This is the
%standard normal linear model. We saw in the previous chapter that the maximum
%likelihood estimate for
%
%
%\subsection{Quadratic forms}
%\label{sec:qf}
%
%
%Let $\bA$ be a symmetric matrix (not necessarily full rank)
%and let $\bX \sim N(\bmu, \bSigma)$. Then
%$(\bX - \bmu)^t \bA (\bX - \bmu) \sim \chi^2_{p}$ if
%$\bA \bSigma$ is idempotent and $p=\mbox{Rank}(\bA)$.
%
%As an example, note that $(\bX - \bmu)^t \bSigma^{-1} (\bX - \bmu)$
%is clearly Chi-squared($n$). This is most easily seen by the fact that
%$\bSigma^{-1/2}(\bX - \bmu) = \bZ$ is a vector of iid standard normals
%and thus the quadratic form is the sum of their squares. Using
%our result, $\bA = \bSigma^{-1}$ in this case and
%$\bA \bSigma = \bI$, which is idempotent. The rank of $\bSigma^{-1}$ is $n$.
%
%
%Let's prove our result. Let $\bA = \bV \bD^2 \bV^t$ where $\bD$ is diagonal (with $p$ non zero entries)
%and $\bV \bV^t = \bI$. The assumption of idempotency gives us that
%$\bA \bSigma \bA \bSigma = \bA \bSigma$. Plugging in our decomposition for $\bA$
%and using the orthonormality of the columns of $\bV$ we get that
%$\bD \bV^t \bSigma \bV \bD = \bI$. Then note that
%\begin{equation}
%\label{eq:quad}
%(\bX - \bmu)^t \bA (\bX - \bmu) = (\bX - \bmu)^t \bV \bD \bD \bV^t (\bX - \bmu).
%\end{equation}
%But $\bD \bV^t (\bX - \bmu) \sim N(\bzero, \bD \bV^t \bSigma \bV \bD)$,
%which has variance equal to $\bI$. Thus in Equation \eqref{eq:quad} we have
%the sum of $p$ squared iid standard normals and is thus Chi-squared $p$.
%
\subsection{Statistical properties}
Note that $\hat \bbeta =(\bfX' \bfX)^{-1} \bfX' \bfy$ is a linear combination of normally distributed random variables, and hence itself normally distributed with moments:
$$
E[\hat \bbeta] = \bbeta ~~~\mbox{and}~~~ \Var(\hat \bbeta) = (\bfX' \bfX)^{-1} \sigma^2.
$$
The residual variance estimate is
\begin{eqnarray*}
s^2 &=& \frac{1}{n-p}\be' \be,
\end{eqnarray*}
which we previously showed was unbiased, i.e. $E[s^2] = \sigma^2$.
In addition, note that:
\begin{eqnarray*}
\frac{n-p}{\sigma^2} s^2 & = & \frac{1}{\sigma^2}\by' (\bI - \bfX (\bfX' \bfX)^{-1} \bfX') \by
\end{eqnarray*}
is a quadratic form as discussed in the previous section.
Furthermore,
$$
\frac{1}{\sigma^2}\{\bI - \bfX (\bfX' \bfX)^{-1} \bfX'\} \Var(\bfy) = \{\bI - \bfX (\bfX' \bfX)^{-1} \bfX'\},
$$
which is idempotent. For symmetric idempotent matrices, the rank equals
the trace; the latter of which is easily calculated as
$$
\mbox{tr}\{\bI - \bfX (\bfX' \bfX)^{-1} \bfX'\} = \mbox{tr}\{\bI\} - \mbox{tr}\{\bfX (\bfX' \bfX)^{-1} \bfX'\}
= n - \mbox{tr}\{ (\bfX' \bfX)^{-1} \bfX' \bfX\} = n - p.
$$
Thus, $\frac{n-p}{\sigma^2} s^2$ is Chi-squared $n-p$. The special
case of this where $\bX$ has only an intercept yields the usual
empirical variance estimate.
\subsection{T statistics}
We are now in a position to develop $T$ statistics.
Consider the linear contrast $\bfc' \hat \bbeta$. First note that
$\bfc' \hat \bbeta$ is $N(\bfc' \bbeta, \bfc' (\bfX' \bfX)^{-1} \bfc \sigma^2)$.
Furthermore,
$$\Cov(\hat \bbeta, \be) = \Cov((\bfX' \bfX)^{-1} \bX' \bY, \{\bI - \bfX (\bfX' \bfX)^{-1} \bfX' \} \bY) = 0$$
since $\xtxinv \bX' \{\bI - \bfX (\bfX' \bfX)^{-1} \bfX'\} = \bzero$. Thus the residuals
and estimated coefficients are independent, implying that $\bfc'\hat \bbeta$
and $s^2$ are independent.
Therefore,
$$
\frac{\bfc' \hat \bbeta - \bfc' \bbeta }{\sqrt{\bfc' (\bfX' \bfX)^{-1} \bfc \sigma^2}}
/ \sqrt{\frac{n-p}{\sigma^2} s^2 / (n-p)}
=\frac{\bfc' \hat \bbeta - \bfc'\bbeta}{\sqrt{\bfc' (\bfX' \bfX)^{-1} \bfc s^2}}
$$
is a standard normal divided by the square root of an independent Chi-squared
over its degrees of freedom, thus is $T$ distributed with $n-p$ degrees of freedom.
\subsection{F statistic}
\label{sec:ftest}
Consider testing the hypothesis that $H_0:\bK \bbeta = 0$ versus not equal for $\bK$
of full row rank $q$.
Notice that $\bK \hat \bbeta \sim N(\bK \bbeta, \bK (\bfX' \bfX)^{-1} \bK' \sigma^2)$ and thus
$$
(\bK \hat \bbeta - \bK \bbeta)' \{\bK (\bfX' \bfX)^{-1} \bK' \sigma^2 \}^{-1} (\bK \hat \bbeta - \bK \bbeta)
$$
is Chi-squared with $q$ degrees of freedom. Furthermore, it is independent of
$\be$ being a function of $\hat \bbeta$. Thus:
$$
(\bK \hat \bbeta - \bK \bbeta)' \{\bK (\bfX' \bfX)^{-1} \bK'\}^{-1} (\bK \hat \bbeta - \bK \bbeta) / q s^2
$$
forms the ratio of two independent Chi-squared random variables over their degrees of freedom, which follows an F distribution with $q$ and $n-p$ degrees of freedom. .
\subsection{Coding example}
Consider the \texttt{swiss} dataset. Let's first make sure that we can replicate
the coefficient table obtained by R.
\begin{verbatim}
> ## First let's see the coeficient table
> fit = lm(Fertility ~ ., data = swiss)
> round(summary(fit)$coef, 3)
Estimate Std. Error t value Pr(>|t|)
(Intercept) 66.915 10.706 6.250 0.000
Agriculture -0.172 0.070 -2.448 0.019
Examination -0.258 0.254 -1.016 0.315
Education -0.871 0.183 -4.758 0.000
Catholic 0.104 0.035 2.953 0.005
Infant.Mortality 1.077 0.382 2.822 0.007
> # Now let's do it more manually
> x = cbind(1, as.matrix(swiss[,-1]))
> y = swiss$Fertility
> beta = solve(t(x) %*% x, t(x) %*% y)
> e = y - x %*% beta
> n = nrow(x); p = ncol(x)
> s = sqrt(sum(e^2) / (n - p))
> #Compare with lm
> c(s, summary(fit)$sigma)
[1] 7.165369 7.165369
> #calculate the t statistics
> betaVar = solve(t(x) %*% x) * s ^ 2
> ## Show that standard errors agree with lm
> cbind(summary(fit)$coef[,2], sqrt(diag(betaVar)))
[,1] [,2]
(Intercept) 10.70603759 10.70603759
Agriculture 0.07030392 0.07030392
Examination 0.25387820 0.25387820
Education 0.18302860 0.18302860
Catholic 0.03525785 0.03525785
Infant.Mortality 0.38171965 0.38171965
> # Show that the tstats agree
> tstat = beta / sqrt(diag(betaVar))
> cbind(summary(fit)$coef[,3], tstat)
[,1] [,2]
(Intercept) 6.250229 6.250229
Agriculture -2.448142 -2.448142
Examination -1.016268 -1.016268
Education -4.758492 -4.758492
Catholic 2.952969 2.952969
Infant.Mortality 2.821568 2.821568
> # Show that the P-values agree
> cbind(summary(fit)$coef[,4], 2 * pt(- abs(tstat), n - p)
[,1] [,2]
(Intercept) 1.906051e-07 1.906051e-07
Agriculture 1.872715e-02 1.872715e-02
Examination 3.154617e-01 3.154617e-01
Education 2.430605e-05 2.430605e-05
Catholic 5.190079e-03 5.190079e-03
Infant.Mortality 7.335715e-03 7.335715e-03
> # Get the F statistic
> # Set K to grab everything except the intercept
> k = cbind(0, diag(rep(1, p - 1)))
> kvar = k %*% solve(t(x) %*% x) %*% t(k)
> fstat = t(k %*% beta) %*% solve(kvar) %*% (k %*% beta) / (p - 1) / s ^ 2
> #Show that it's equal to what lm is giving
> cbind(summary(fit)$fstat, fstat)
> #Calculate the p-value
> pf(fstat, p - 1, n - p, lower.tail = FALSE)
[,1]
[1,] 5.593799e-10
> summary(fit)
## ... only showing the one relevant line ...
F-statistic: 19.76 on 5 and 41 DF, p-value: 5.594e-10
\end{verbatim}
\subsection{Confidence ellipsoids}
An hyper-ellipsoid with center $\bv$ is defined as the solutions in $\bx$ of
$(\bx - \bv)' \bA (\bx - \bv)=1$. The eigenvalues of $\bA$ determine the length
of the axes of the ellipsoid. The set of points $\{\bx ~|~ (\bx - \bv)' \bA (\bx - \bv) \leq 1\}$
lie in the interior of the hyper-ellipsoid.
Now consider our F statistic from earlier on in the chapter:
$$
(\bK \hat \bbeta - \bm)' \{\bK \xtxinv \bK'\}^{-1} (\bK \hat \bbeta - \bm) / v S^2
$$
We would fail to reject $H_0:\bK \bbeta = \bm$ is less than the appropriate cut off
from an $F$ distribution, say $F_{1-\alpha}$. So, the set of points
$$
\{\bm ~|~
(\bK \hat \bbeta - \bm)' \{\bK \xtxinv \bK'\}^{-1} (\bK \hat \bbeta - \bm) / v S^2 F_{1-\alpha} \leq 1
\}
$$
forms a confidence set. From the discussion above, we see that this is a hyper ellipsoid.
This multivariate form of confidence interval is called a confidence ellipse. These are
of course most useful when the dimension is 2 or 3 so that we can visualize it as
an actual ellipse.
\subsubsection{Coding example}
\begin{verbatim}
fit = lm(mpg ~ disp + hp , mtcars)
open3d()
plot3d(ellipse3d(fit), col = "red", alpha = .5, aspect = TRUE)
## Doing it directly
beta = coef(fit)
Sigma = vcov(fit)
n = nrow(mtcars); p = length(beta)
A = Sigma * (3 * qf(.95, 3, n - p))
nms = names(beta)
open3d()
## Using the definition of an elipse
##(x - b)' A (x - b) = 1
plot3d(ellipse3d(A, centre = beta, t = 1),
color = "blue", alpha = .5, aspect = TRUE,
xlab = nms[1], ylab = nms[2], zlab = nms[3])
## Using the more statistical version
## Provide ellipse3d with the variance covariance matrix
plot3d(ellipse3d(Sigma, centre = beta, t = sqrt(3 * qf(.95, 3, n - p))),
color = "green", alpha = .5, aspect = TRUE,
xlab = nms[1], ylab = nms[2], zlab = nms[3], add = TRUE)
\end{verbatim}
\subsection{Confidence intervals}
Consider creating a confidence interval for $\bfc' \bfbet$. Begin by recalling that
\begin{eqnarray*}
\frac{\bfc' \hat \bfbet - \bfc'\bfbet}{s \sqrt{\bfc' (\bfX' \bfX)^{-1} \bfc }} \sim t_{n-p}.
\end{eqnarray*}
Thus, we can show that a $100(1-\alpha)\%$ confidence interval for $\bfc' \bfbet$ is given by:
\begin{eqnarray*}
\bfc' \hat \bfbet \pm t_{\alpha/2, n-p} s {\sqrt{\bfc' (\bfX' \bfX)^{-1} \bfc }}
\end{eqnarray*}
\subsection{Prediction intervals}
It's worth discussing prediction intervals. The obvious prediction
at a new set of covariates, $\bx_0$, is $\bx_0' \hat \bbeta$. This is
then just a linear contrast of the $\bbeta$ and so the interval would be
$$
\bx_0' \hat \bbeta
\pm t_{n-p, 1-\alpha/2} s \sqrt{\bx_0' \xtxinv \bx_0}.
$$
If you've taken an introductory regression class, you it will have noted
the difference between a prediction interval and a confidence interval
for a mean at a new value of $\bx$. For a prediction interval, we
want to estimate a range of possible values for $y$ at that value
of $\bx$, a different statement than trying to estimate the average
value of $y$ at that value of $\bx$. As we collect infinite data,
we should get the average value exactly. However, predicting a new
value involves intrinsic variability that can't go away no matter
how much data we use to build our model.
As an example, consider the following two tasks:
(i) guessing the sales price of a diamond given its weight; and (ii)
guessing the average sales price of diamonds
given a particular weight. With enough data, we should be able to estimate the average
sale price very precisely. However, we still won't know the exact sales
price of an individual diamond of that weight.
To account for this, we develop prediction intervals. These
are not confidence intervals, because they are trying to estimate
something random, not a fixed parameter.
Consider estimating $Y_0$ at $\bx$ value $\bx_0$.
Note that
$$
Var(y_0 - \bfx_0' \hat \bfbet) = (1 + \bfx_0' (\bfX' \bfX)^{-1} \bfx_0)\sigma^2
$$
For homework,
show that
$$
\frac{y_0 - \bx_0' \hat \bbeta}{s\sqrt{1 + \bx_0' (\bfX' \bfX)^{-1} \bx_0 \sigma^2}}
$$
follows a T distribution with $n-p$ degrees of freedom. Finish, by showing that
$$
P\{y_0 \in [\bx_0' \hat \bbeta
\pm t_{n-p, 1-\alpha/2} s \sqrt{1 + \bx_0' (\bfX' \bfX)^{-1} \bx_0}]\} = 1 - \alpha.
$$
This is called a prediction interval. Notice the variability under consideration
contains $\bx_0' (\bfX' \bfX)^{-1} \bx_0$, which goes to 0 as we get more $\bX$ variability
and $1$, which represents the intrinsic part of the variability that doesn't
go away.
\subsection{Coding example}
Let's try to predict a car's MPG from other characteristics.
\begin{verbatim}
> fit = lm(mpg ~ hp + wt, data = mtcars)
> newcar = data.frame(hp = 90, wt = 2.2)
> predict(fit, newdata = newcar)
1
25.83648
> predict(fit, newdata = newcar, interval = "confidence")
fit lwr upr
1 25.83648 24.46083 27.21212
> predict(fit, newdata = newcar, interval = "prediction")
fit lwr upr
1 25.83648 20.35687 31.31609
> #Doing it manually
> library(dplyr)
> y = mtcars$mpg
> x = as.matrix(cbind(1, select(mtcars, hp, wt)))
> n = length(y)
> p = ncol(x)
> xtxinv = solve(t(x) %*% x)
> beta = xtxinv %*% t(x) %*% y
> x0 = c(1, 90, 2.2)
> yhat = x %*% beta
> e = y - yhat
> s = sqrt(sum(e^2 / (n - p)))
> yhat0 = sum(x0 * beta)
> # confidence interval
> yhat0 + qt(c(0.025, .975), n - p) * s * sqrt(t(x0) %*% xtxinv %*% x0)
[1] 24.46083 27.21212
> # prediction interval
> yhat0 + qt(c(0.025, .975), n - p) * s * sqrt(1 + t(x0) %*% xtxinv %*% x0)
[1] 20.35687 31.31609
\end{verbatim}
\subsection{Confidence interval for the variance}
We can use this result to develop a confidence interval
for the variance. Let $\chi^2_{n-p, \alpha}$ be the $\alpha$ quantile
from the chi squared distribution with $n-p$ degrees of freedom.
Then inverting the probability statement
$$
1 - \alpha =
P\left(
\chi^2_{n-p, \alpha/2}
\leq \frac{\be' \be}{n-p}
\leq
\chi^2_{n-p,1 - \alpha/2}
\right)
$$
yields the $100(1-\alpha)\%$ confidence interval
$$
\frac{(n-p)s^2}{\chi^2_{n-p, 1-\alpha/2}}
\leq \sigma^2
\leq
\frac{(n-p)s^2}{\chi^2_{n-p, \alpha/2}}
$$