forked from bob-carpenter/prob-stats
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathexpectations.Rmd
288 lines (231 loc) · 8.52 KB
/
expectations.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
# Expectations and Variance
Almost all quantities of interest in statistics, ranging from
parameter estimates to event probabilities and forecasts can be
expressed as expectations of random variables. Expectation ties
directly to simulation because expectations are computed as averages
of samples of those random variables.
Variance is a measure of the variation of a random variable. It's
also defined as an expectation. Curiously, it is on the quadratic
scale---it's defined in terms of squared differences from expected
values. This chapter will show why this is the natural measure of
variation and how it relates to expectations.
## The expectation of a random variable
Suppose we have a discrete random variable $Y$. Its *expectation* is
defined as its average value, which is a weighted average of its
values, where the weights are the probabilities.
$$
\mathbb{E}\!\left[ Y \right]
\ = \
\sum_{y \in Y} \, y \times p_Y(y).
$$
The notation $y \in Y$ is meant to indicate the summation is over all
possible values $y$ that the random variable $Y$ may take on.
For example, if $Y$ is the result of a fair coin flip, then
$$
\mathbb{E}\left[ Y \right]
\ = \
0 \times \frac{1}{2} + 1 \times \frac{1}{2}
\ = \
\frac{1}{2}.
$$
Now suppose $Y$ is the result of a fair six-sided die roll. The
expected value of $Y$ is
$$
\mathbb{E} \left[ Y \right]
\ = \
1 \times \frac{1}{6}
+ 2 \times \frac{1}{6}
+ \cdots
+ 6 \times \frac{1}{6}
\ = \
\frac{21}{6}
\ = \
3.5.
$$
Now suppose $U$ is the result of a fair twenty-sided die roll. Using
the same formula as above, the expectation works out to
$\frac{210}{20} = 10.5$.^[In general, the expectation of a die roll is
half the number of faces plus 0.5, because $$\sum_{n=1}^N n \times
\frac{1}{N} \ = \ \frac{1}{N} \sum_{n=1}^N n \ = \ \frac{1}{N}\frac{N \times (N +
1)}{2} \ = \frac{N + 1}{2}.$$]
## Expectations as simulation averages
We have been computing expectations all along for event
probabilities. Expectations are the natural calculation for
simulations, because
$$
\mathbb{E}\left[ Y \right]
\ = \
\lim_{M \rightarrow \infty} \frac{1}{M} \sum_{m = 1}^M y^{(m)},
$$
where the $y^{(m)}$ are individual simulations of $Y$.
For any finite $M$, we get an estimate of the expectation defined as
$$
\mathbb{E}\left[ Y \right]
\ = \
\frac{1}{M} \sum_{m = 1}^M y^{(m)}.
$$
## Event probabilities as expectations of indicators
The event probability estimates from sampling can be viewed as
calculating the expectation of the value of an indicator function.
For example,^[The expectation is implicitly the expectation of
a new random variable defined as $Z = \mathrm{I}[Y = 1].$]
$$
\begin{array}{rcl}
\mathrm{Prob}\!\left[ Y = 1 \right]
& = &
\mathbb{E}\left[\mathrm{I}\left[Y = 1\right]\right]
\\[6pt]
& \approx &
\displaystyle \frac{1}{M} \sum_{m = 1}^M \, \mathrm{I}\!\left[y^{(m)} = 1\right].
\end{array}
$$
The indicator function converts an event to a numerical 1 or 0, then
the expectation does the averaging. The second line provides an
estimate of the expectation based on a finite number of simulations
of the random variable $y^{(m)}$ used to define the event.
## The variance of a random variable
Variance is a measure of how much a random variable can be expected to
vary around its expected value. For example, consider a random
variable $Y$ representing a fair throw of a six-sided die. The
expected value is $\mathbb{E}[Y] = 3.5$, but the result may vary
between 1 and 6.
Variance measures the expected square difference between a random
variable and its expected value.
$$
\mathrm{var}[Y]
\ = \
\mathbb{E}\!\left[
\left( Y - \mathbb{E}[Y] \right)^2
\right].
$$
The nested expectation, $\mathbb{E}[Y]$, is just the expectation
of $Y$, and as such it's a constant. The expectation operator
$\mathbb{E}[\, \cdot \,]$ captures its random variables. Thus we
could have written this as
$$
\mathbb{E}\!\left[ \left(Y - c\right)^2 \right],
$$
where the constant $c = \mathbb{E}[Y]$.
In our example of the six-sided die, the expectation is the average
roll, $\mathrm{E}[Y] = 3.5$. As is often the case with averages of
discrete random variables, 3.5 is not itself a possible value of the
variable. That is, it's not possible to roll a 3.5 on a six-sided
die.
We can continue working out the expectation notation for our example,
expanding
$$
\begin{array}{rcl}
\mathbb{E}\!\left[ \left(Y - 3.5\right)^2 \right]
& = &
\sum_{y \in Y} p_Y(y) \times \left(y - 3.5\right)^2
\\[6pt]
& = &
\sum_{y \in 1:6} \frac{1}{6} \times \left(y - 3.5\right)^2
\\[6pt]
& = &
\frac{1}{6}
\left(
(1 - 3.5)^2
+ (2 - 3.5)^2
+ (3 - 3.5)^2
+ (4 - 3.5)^2
+ (5 - 3.5)^2
+ (6 - 3.5)^2
\right)
\\[6pt]
& = & \frac{1}{6} \left(
6.25 + 2.25 + 0.25 + 0.25 + 2.25 + 6.25
\right)
\\[6pt]
& \approx & 2.92.
\end{array}
$$
We can, of course, use simulation to calculate variances.
```
sum_sq_diffs = 0
for (m in 1:M) {
y = uniform_rng(1:6)
sum_sq_diffs += (y - 3.5)^2
print 'estimated var[Y] = ' sum_sq_diffs / M
```
And running that for $M = 1\,000\,000$ iterations gives us
```{r}
M <- 1e6
printf('estimated var[Y] = %3.2f\n',
sum((sample(6, size = M, replace = TRUE) - 3.5)^2) / M)
```
We see that the extreme values have an outsized influence on the sum
of squared errors. That's because we're dealing with squared error,
which reduces small errors (e.g., $0.5^2 = 0.25$) and magnifies large
errors (e.g., $5^2 = 25$). We can see a plot of the squared
differences from the mean that wold be involved in calculating
the variance of a 20-sided die.
```{r fig.cap="Squared differences from the mean of 10.5 for a 20-sided die roll. The mean is indicated with a dashed vertical line."}
d20_var_df <- data.frame(roll = 1:20, sqe = (1:20 - 10.5)^2)
d20_var_plot <-
ggplot(d20_var_df, aes(x = roll, y = sqe)) +
geom_line() +
geom_point() +
geom_vline(xintercept = 10.5, size = 0.3, linetype = "dashed",
color = "#333333") +
scale_x_continuous(breaks=c(1, 5, 10, 11, 15, 20)) +
ylab("squared difference from mean") +
ggtheme_tufte()
d20_var_plot
```
## Why squared error?
Why are we calculating average squared difference from the expectation
rather than, say, average absolute difference? The answer has to do
with the nature of averages. It turns out the average is the $x$
that minimizes the sum of squared differences,
$$
\mathrm{sqe}(x, y) = \sum_{n \in 1:N} (y_n - x)^2,
$$
or equivalently, minimizes the average (or mean) squared difference,
$$
\mathrm{msqe}(x, y) = \frac{1}{N} \sum_{n \in 1:N} (y_n - x)^2.
$$
That is,
$$
\mathrm{avg}(y)
\ = \
\frac{1}{N} \sum_{n = 1}^N y_n
\ = \
\mathrm{argmin}_x \ \mathrm{msqe}(x, y)
$$
The notation $\mathrm{argmin}_x \ f(x)$ is meant to return the $x$
at which $f(x)$ takes on its minimum value. For example,
$\mathrm{argmin}_u (u - 1)^2 + 7 = 1$, whereas the minimum value
is given by $\mathrm{min}_u (u - 1)^2 + 7 = 7$.
Averages themselves are important because expectations are defined in
terms of averages, and event probabilities are defined in terms of
expectations. Furthermore, the central limit theorem governs the
convergence of averages as estimates of quantities of interest in
terms of squared error.
## Standard deviations for scale correction
The difficulty in managing variance is that it has different units
than the variable it is measuring the variation of. For example, if
$Y$ reprents the fill of a bottle with units in milliliters, then
$\mathrm{var}[Y]$ has units of squared milliliters. Squared
milliliters are not natural. As we saw even with the calculation for
a single six-sided die, values of 1 and 6 had 6.25 squared difference
from the expected value of 3.5, whereas 3 and 4 had only 0.25 squared
difference from the expected value. Small values (below one) get
smaller ($0.5^2 = 0.25$) and large values are amplified ($2.5^2 =
6.25$).
Instead of dealing with variance, with its quadratic scale and
mismatched units, it is common to convert back at the end to ordinary
units by taking a square root. The *standard deviation* of a random
variable is defined as the square root of its variance,
$$
\mathrm{sd}[Y] = \sqrt{\mathrm{var}\left[ Y \right]}.
$$
While it is not known why Karl Pearson coined the term "standard
deviation",^[Pearson, K., 1893. Contributions to the mathematical
theory of evolution. *Philosophical Transactions of the Royal Society
of London, Series A*. Nov 16: 329--333, which on page 330 coins the
term stating only, "standard deviations---a term used in the memoir
for what corresponds in frequency curves to the error of mean
square."] the key consideration here is that $\mathrm{sd}[Y]$ has the
same units as the variable $Y$ itself and is thus much easier to
interpret.