-
Notifications
You must be signed in to change notification settings - Fork 44
/
04.Rmd
738 lines (577 loc) · 32 KB
/
04.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
```{r, echo = F}
knitr::opts_chunk$set(fig.retina = 2.5)
knitr::opts_chunk$set(fig.align = "center")
```
# What is This Stuff Called Probability?
> Inferential statistical techniques assign precise measures to our uncertainty about possibilities. Uncertainty is measured in terms of *probability*, and therefore we must establish the properties of probability before we can make inferences about it. This chapter introduces the basic ideas of probability. [@kruschkeDoingBayesianData2015, p. 71, *emphasis* in the original]
## The set of all possible events
This snip from page 72 is important (*emphasis* in the original):
> Whenever we ask about how likely an outcome is, we always ask with a set of possible outcomes in mind. This set exhausts all possible outcomes, and the outcomes are all mutually exclusive. This set is called the *sample space*.
## Probability: Outside or inside the head
It's worthwhile to quote this section in full.
> Sometimes we talk about probabilities of outcomes that are "out there" in the world. The face of a flipped coin is such an outcome: We can observe the flip, and the probability of coming up heads can be estimated by observing several flips.
>
> But sometimes we talk about probabilities of things that are not so clearly "out there," and instead are just possible beliefs "inside the head." Our belief about the fairness of a coin is an example of something inside the head. The coin may have an intrinsic physical bias, but now I am referring to our *belief* about the bias. Our beliefs refer to a space of mutually exclusive and exhaustive possibilities. It might be strange to say that we randomly sample from our beliefs, like we randomly sample from a sack of coins. Nevertheless, the mathematical properties of probabilities outside the head and beliefs inside the head are the same in their essentials, as we will see. (pp. 73--74, *emphasis* in the original)
### Outside the head: Long-run relative frequency.
> For events outside the head, it's intuitive to think of probability as being the long-run relative frequency of each possible outcome...
>
> We can determine the long-run relative frequency by two different ways. One way is to approximate it by actually sampling from the space many times and tallying the number of times each event happens. A second way is by deriving it mathematically. These two methods are now explored in turn. (p. 74)
#### Simulating a long-run relative frequency.
Before we try coding the simulation, we'll first load the **tidyverse**.
```{r p_unload, echo = F, warning = F, message = F}
# solves difficulties while publishing
pacman::p_unload(pacman::p_loaded(), character.only = TRUE)
```
```{r, message = F, warning = F}
library(tidyverse)
```
Now run the simulation.
```{r end_prop, message = F, warning = F}
n <- 500 # specify the total number of flips
p_heads <- 0.5 # specify underlying probability of heads
# Kruschke reported this was the seed he used at the top of page 94
set.seed(47405)
# here we use that seed to flip a coin n times and compute the running proportion of heads at each flip.
# we generate a random sample of n flips (heads = 1, tails = 0)
d <-
tibble(flip_sequence = sample(x = c(0, 1),
prob = c(1 - p_heads, p_heads),
size = n,
replace = T)) %>%
mutate(n = 1:n,
r = cumsum(flip_sequence)) %>%
mutate(run_prop = r / n)
end_prop <-
d %>%
select(run_prop) %>%
slice(n()) %>%
round(digits = 3) %>%
pull()
```
Now we're ready to make Figure 4.1.
```{r, message = F, warning = F, fig.width = 4.5, fig.height = 4}
d %>%
filter(n < 1000) %>% # this step cuts down on the time it takes to make the plot
ggplot(aes(x = n, y = run_prop)) +
geom_hline(yintercept = .5, color = "white") +
geom_line(color = "grey50") +
geom_point(color = "grey50", alpha = 1/4) +
scale_x_log10(breaks = c(1, 2, 5, 10, 20, 50, 200, 500)) +
coord_cartesian(xlim = c(1, 500),
ylim = c(0, 1)) +
labs(title = "Running proportion of heads",
subtitle = paste("Our end proportion =", end_prop),
x = "Flip number",
y = "Proportion of heads") +
theme(panel.grid = element_blank())
```
#### Deriving a long-run relative frequency.
> Sometimes, when the situation is simple enough mathematically, we can derive the exact long-run relative frequency. The case of the fair coin is one such simple situation. The sample space of the coin consists of two possible outcomes, head and tail. By the assumption of fairness, we know that each outcome is equally likely. Therefore, the long-run relative frequency of heads should be exactly one out of two, i.e., $1/2$, and the long-run relative frequency of tails should also be exactly $1/2$. (p. 76)
### Inside the head: Subjective belief.
> To specify our subjective beliefs, we have to specify how likely we think each possible outcome is. It can be hard to pin down mushy intuitive beliefs. In the next section, we explore one way to "calibrate" subjective beliefs, and in the subsequent section we discuss ways to mathematically describe degrees of belief. (p. 76)
### Probabilities assign numbers to possibilities.
> In general, a probability, whether it's outside the head or inside the head, is just a way of assigning numbers to a set of mutually exclusive possibilities. The numbers, called "probabilities," merely need to satisfy three properties [@kolmogorovFoundationsTheoryProbability1956]:
>
> 1. A probability value must be nonnegative (i.e., zero or positive).
> 2. The sum of the probabilities across all events in the entire sample space must be $1.0$ (i.e., one of the events in the space must happen, otherwise the space does not exhaust all possibilities).
> 3. For any two mutually exclusive events, the probability that one *or* the other occurs is the *sum* of their individual probabilities. For example, the probability that a fair six-sided die comes up $3$-dots *or* $4$-dots is $1/6 + 1/6 = 2/6$.
>
> Any assignment of numbers to events that respects those three properties will also have all the properties of probabilities that we will discuss below. (pp. 77--78, *emphasis* in the original)
## Probability distributions
"A probability *distribution* is simply a list of all possible outcomes and their corresponding probabilities" (p. 78, *emphasis* in the original)
### Discrete distributions: Probability mass.
> When the sample space consists of discrete outcomes, then we can talk about the probability of each distinct outcome. For example, the sample space of a flipped coin has two discrete outcomes, and we talk about the probability of head or tail...
>
> For continuous outcome spaces, we can *discretize* the space into a finite set of mutually exclusive and exhaustive "bins." (p. 78, *emphasis* in the original)
In order to recreate Figure 4.2, we need to generate the heights data with Kruschke's `HtWtDataGenerator()` function. You can find the original code in Kruschke's `HtWtDataGenerator.R` script, and we first used this function in [Section 2.3][The steps of Bayesian data analysis]. Once again, here's how to make the function.
```{r}
HtWtDataGenerator <- function(n_subj, rndsd = NULL, male_prob = 0.50) {
# Random height, weight generator for males and females. Uses parameters from
# Brainard, J. & Burmaster, D. E. (1992). Bivariate distributions for height and
# weight of men and women in the United States. Risk Analysis, 12(2), 267-275.
# Kruschke, J. K. (2011). Doing Bayesian data analysis:
# A Tutorial with R and BUGS. Academic Press / Elsevier.
# Kruschke, J. K. (2014). Doing Bayesian data analysis, 2nd Edition:
# A Tutorial with R, JAGS and Stan. Academic Press / Elsevier.
# require(MASS)
# Specify parameters of multivariate normal (MVN) distributions.
# Men:
HtMmu <- 69.18
HtMsd <- 2.87
lnWtMmu <- 5.14
lnWtMsd <- 0.17
Mrho <- 0.42
Mmean <- c(HtMmu, lnWtMmu)
Msigma <- matrix(c(HtMsd^2, Mrho * HtMsd * lnWtMsd,
Mrho * HtMsd * lnWtMsd, lnWtMsd^2), nrow = 2)
# Women cluster 1:
HtFmu1 <- 63.11
HtFsd1 <- 2.76
lnWtFmu1 <- 5.06
lnWtFsd1 <- 0.24
Frho1 <- 0.41
prop1 <- 0.46
Fmean1 <- c(HtFmu1, lnWtFmu1)
Fsigma1 <- matrix(c(HtFsd1^2, Frho1 * HtFsd1 * lnWtFsd1,
Frho1 * HtFsd1 * lnWtFsd1, lnWtFsd1^2), nrow = 2)
# Women cluster 2:
HtFmu2 <- 64.36
HtFsd2 <- 2.49
lnWtFmu2 <- 4.86
lnWtFsd2 <- 0.14
Frho2 <- 0.44
prop2 <- 1 - prop1
Fmean2 <- c(HtFmu2, lnWtFmu2)
Fsigma2 <- matrix(c(HtFsd2^2, Frho2 * HtFsd2 * lnWtFsd2,
Frho2 * HtFsd2 * lnWtFsd2, lnWtFsd2^2), nrow = 2)
# Randomly generate data values from those MVN distributions.
if (!is.null(rndsd)) {set.seed(rndsd)}
data_matrix <- matrix(0, nrow = n_subj, ncol = 3)
colnames(data_matrix) <- c("male", "height", "weight")
maleval <- 1; femaleval <- 0 # arbitrary coding values
for (i in 1:n_subj) {
# Flip coin to decide sex
sex = sample(c(maleval, femaleval), size = 1, replace = TRUE,
prob = c(male_prob, 1 - male_prob))
if (sex == maleval) {datum <- MASS::mvrnorm(n = 1, mu = Mmean, Sigma = Msigma)}
if (sex == femaleval) {
Fclust = sample(c(1, 2), size = 1, replace = TRUE, prob = c(prop1, prop2))
if (Fclust == 1) {datum <- MASS::mvrnorm(n = 1, mu = Fmean1, Sigma = Fsigma1)}
if (Fclust == 2) {datum <- MASS::mvrnorm(n = 1, mu = Fmean2, Sigma = Fsigma2)}
}
data_matrix[i, ] = c(sex, round(c(datum[1], exp(datum[2])), 1))
}
return(data_matrix)
} # end function
```
Now we have the `HtWtDataGenerator()` function, all we need to do is determine how many values are generated and how probable we want the values to be based on those from men. These are controlled by the `n_subj` and `male_prob` parameters.
```{r, message = F, warning = F}
set.seed(4)
d <-
HtWtDataGenerator(n_subj = 10000, male_prob = .5) %>%
data.frame() %>%
mutate(person = 1:n())
d %>%
head()
```
For Figure 4.2, we'll discretize the continuous `height` values into bins with the `case_when()` function, which you can learn more about from [hrbrmstr](https://twitter.com/hrbrmstr)'s blog post, [*Making a case for case_when*](https://rud.is/b/2017/03/10/making-a-case-for-case_when/).
```{r, fig.height = 2, fig.width = 7, message = F}
d_bin <-
d %>%
mutate(bin = case_when(
height < 51 ~ 51,
between(height, 51, 53) ~ 53,
between(height, 53, 55) ~ 55,
between(height, 55, 57) ~ 57,
between(height, 57, 59) ~ 59,
between(height, 59, 61) ~ 61,
between(height, 61, 63) ~ 63,
between(height, 63, 65) ~ 65,
between(height, 65, 67) ~ 67,
between(height, 67, 69) ~ 69,
between(height, 69, 71) ~ 71,
between(height, 71, 73) ~ 73,
between(height, 73, 75) ~ 75,
between(height, 75, 77) ~ 77,
between(height, 77, 79) ~ 79,
between(height, 79, 81) ~ 71,
between(height, 81, 83) ~ 83,
height > 83 ~ 85)
) %>%
group_by(bin) %>%
summarise(n = n()) %>%
mutate(height = bin - 1)
d %>%
ggplot(aes(x = height, y = person)) +
geom_point(size = 3/4, color = "grey67", alpha = 1/2) +
geom_vline(xintercept = seq(from = 51, to = 83, by = 2),
linetype = 3, color = "grey33") +
geom_text(data = d_bin,
aes(y = 5000, label = n),
size = 3.25) +
scale_y_continuous(breaks = c(0, 5000, 10000)) +
labs(title = "Total N = 10,000",
x = "Height (inches)",
y = "Person #") +
theme(panel.grid = element_blank())
```
Because we're simulating and we don't know what seed number Kruschke used for his plot, ours will differ a little from his. But the overall pattern is the same.
One way to make a version of the histogram in Kruschke's Figure 4.2.b would be to input the `d` data directly into `ggplot()` and set `x = height` and `y = stat(density)` within the `aes()`. Then you could set `binwidth = 2` within `geom_histogram()` to make the bins within the histogram perform like the bins in the plot above. However, since we have already discretized the `height` values into bins in our `d_bin` data, it might make more sense to plot those bins with `geom_col()`. The only other step we need is to manually compute the density values using the formula Kruschke showed in Figure 4.2.b. Here' how:
```{r, fig.height = 1.75, fig.width = 7}
d_bin %>%
# density is the probability mass divided by the bin width
mutate(density = (n / sum(n)) / 2) %>%
ggplot(aes(x = height, y = density, fill = bin == 65)) +
geom_col() +
scale_fill_manual(values = c("gray67", "gray30"), breaks = NULL) +
scale_y_continuous("Probability density", breaks = c(0, 0.04, 0.08)) +
xlab("Height (inches)") +
coord_cartesian(xlim = c(51, 83)) +
theme(panel.grid = element_blank())
```
In the text, Kruschke singled out the bin for the values between 63 and 65 with an arrow. In our plot, we highlighted that bin with shading, instead. Here's how we computed the exact density value for that bin.
```{r}
d_bin %>%
mutate(density = (n / sum(n)) / 2) %>%
filter(bin == 65) %>%
select(n, density)
```
Due to sampling variation, our density value is a little different from the one in the text.
Our data binning approach for Figure 4.2.c will be a little different than what we did, above. Here we'll make our bins with the `round()` function.
```{r, fig.height = 2, fig.width = 7, message = F}
d_bin <-
d %>%
mutate(bin = round(height, digits = 0)) %>%
group_by(bin) %>%
summarise(n = n()) %>%
mutate(height = bin - 0.5)
d %>%
ggplot(aes(x = height, y = person)) +
geom_point(size = 3/4, color = "grey67", alpha = 1/2) +
geom_vline(xintercept = seq(from = 51, to = 83, by = 1),
linetype = 3, color = "grey33") +
geom_text(data = d_bin,
aes(y = 5000, label = n, angle = 90),
size = 3.25) +
scale_y_continuous(breaks = c(0, 5000, 10000)) +
labs(title = "Total N = 10,000",
x = "Height (inches)",
y = "Person #") +
theme(panel.grid = element_blank())
```
However, our method for Figure 4.2.d will be like what we did, before.
```{r, fig.height = 1.75, fig.width = 7}
d_bin %>%
# density is the probability mass divided by the bin width
mutate(density = (n / sum(n)) / 1) %>%
ggplot(aes(x = height, y = density, fill = bin == 64)) +
geom_col() +
scale_fill_manual(values = c("gray67", "gray30"), breaks = NULL) +
scale_y_continuous("Probability density", breaks = c(0, 0.04, 0.08)) +
xlab("Height (inches)") +
coord_cartesian(xlim = c(51, 83)) +
theme(panel.grid = element_blank())
```
Here's the hand-computed density value for the focal bin.
```{r}
d_bin %>%
mutate(density = (n / sum(n)) / 1) %>%
filter(bin == 64) %>%
select(n, density)
```
> The probability of a discrete outcome, such as the probability of falling into an interval on a continuous scale, is referred to as a probability *mass.* Loosely speaking, the term "mass" refers the amount of stuff in an object. When the stuff is probability and the object is an interval of a scale, then the mass is the proportion of the outcomes in the interval. (p. 80, *emphasis* in the original)
### Continuous distributions: Rendezvous with density.
> If you think carefully about a continuous outcome space, you realize that it becomes problematic to talk about the probability of a specific value on the continuum, as opposed to an interval on the continuum... Therefore, what we will do is make the intervals infinitesimally narrow, and instead of talking about the infinitesimal probability mass of each infinitesimal interval, we will talk about the ratio of the probability mass to the interval width. That ratio is called the probability *density*.
>
> Loosely speaking, density is the amount of stuff per unit of space it takes up. Because we are measuring amount of stuff by its mass, then density is the mass divided by the amount space it occupies. (p. 80, *emphasis* in the original)
To make Figure 4.3, we'll need new data.
```{r, message = F, warning = F}
set.seed(4)
d <-
tibble(height = rnorm(1e4, mean = 84, sd = .1)) %>%
mutate(door = 1:n())
d %>%
head()
```
To make the bins for our version of Figure 4.3.a, we could use the `case_when()` approach from above. However, that would require some tedious code. Happily, we have an alternative in the [**santoku** package](https://github.com/hughjonesd/santoku) [@R-santoku], which I learned about with help from the great [Mara Averick](https://twitter.com/dataandme), [Tyson Barrett](https://twitter.com/healthandstats), and [Omar Wasow](https://twitter.com/owasow). We can use the `santoku::chop()` function to discretize our `height` values. Here we'll walk through the first part.
```{r, eval = F}
library(santoku)
d_bin <-
d %>%
mutate(bin = chop(height,
breaks = seq(from = 83.6, to = 84.4, length.out = 32),
# label the boundaries with 3 decimal places, separated by a dash
labels = lbl_dash(fmt = "%.3f"))
head(d_bin)
```
```{r, warning = F, message = F, echo = F}
d_bin <-
d %>%
mutate(bin = santoku::chop(height,
breaks = seq(from = 83.6, to = 84.4, length.out = 32),
# label the boundaries with 3 decimal places, separated by a dash
labels = santoku::lbl_dash(fmt = "%.3f")))
head(d_bin)
```
With this format, the lower-limit for each level of `bin` is the left side of the dash and the upper-limit is on the right. Though the cut points are precise to many decimal places, the `lbl_dash(fmt = "%.3f")` part of the code rounded the numbers to three decimal places in the `bin` labels. The width of each bin is just a bit over 0.0258.
```{r}
(84.4 - 83.6) / (32 - 1)
```
Now to make use of the `d_bin` data in a plot, we'll have to summarize and separate the values from the `bin` names to compute the midway points. Here's one way how.
```{r}
d_bin <- d_bin %>%
group_by(bin) %>%
summarise(n = n()) %>%
separate(bin, c("min", "max"), sep = "—", remove = F, convert = T) %>%
mutate(height = (min + max) / 2)
head(d_bin)
```
Now we plot.
```{r, fig.height = 2, fig.width = 7}
d %>%
ggplot(aes(x = height, y = door)) +
geom_point(size = 3/4, color = "grey67", alpha = 1/2) +
geom_vline(xintercept = seq(from = 83.6, to = 84.4, length.out = 32),
linetype = 3, color = "grey33") +
geom_text(data = d_bin,
aes(y = 5000, label = n, angle = 90),
size = 3.25) +
scale_y_continuous(breaks = c(0, 5000, 10000)) +
labs(title = "Total N = 10,000",
x = "Height (inches)",
y = "Door #") +
theme(panel.grid = element_blank())
```
The only tricky thing about Figure 4.3.b is getting the denominator in the density equation correct.
```{r, fig.height = 1.75, fig.width = 7}
d_bin %>%
# density is the probability mass divided by the bin width
mutate(density = (n / sum(n)) / ((84.4 - 83.6) / (32 - 1))) %>%
ggplot(aes(x = height, y = density, fill = bin == "83.910—83.935")) +
geom_col() +
scale_fill_manual(values = c("gray67", "gray30"), breaks = NULL) +
scale_y_continuous("Probability density", breaks = 0:4) +
xlab("Height (inches)") +
coord_cartesian(xlim = c(83.6, 84.4)) +
theme(panel.grid = element_blank())
```
Here's the density value for that focal bin.
```{r}
d_bin %>%
# density is the probability mass divided by the bin width
mutate(density = (n / sum(n)) / ((84.4 - 83.6) / (32 - 1))) %>%
filter(bin == "83.910—83.935") %>%
select(bin, n, density)
```
As Kruschke remarked: "There is nothing mysterious about probability densities larger than 1.0; it means merely that there is a high concentration of probability mass relative to the scale" (p. 82).
#### Properties of probability density functions.
> In general, for any continuous value that is split up into intervals, the sum of the probability masses of the intervals must be $1$, because, by definition of making a measurement, some value of the measurement scale must occur. (p. 82)
#### The normal probability density function.
"Perhaps the most famous probability density function is the normal distribution, also known as the Gaussian distribution" (p. 83). We'll use `dnorm()` again to make our version of Figure 4.4.
```{r, fig.width = 4.5, fig.height = 3}
tibble(x = seq(from = -.8, to = .8, by = .02)) %>%
mutate(p = dnorm(x, mean = 0, sd = .2)) %>%
ggplot(aes(x = x)) +
geom_line(aes(y = p),
color = "grey50", linewidth = 1.25) +
geom_linerange(aes(ymin = 0, ymax = p),
linewidth = 1/3) +
labs(title = "Normal probability density",
subtitle = expression(paste(mu, " = 0 and ", sigma, " = 0.2")),
y = "p(x)") +
coord_cartesian(xlim = c(-.61, .61)) +
theme(panel.grid = element_blank())
```
The equation for the normal probability density follows the form
$$
p(x) = \frac{1}{\sigma \sqrt{2 \pi}} \exp \left ( - \frac{1}{2} \left [ \frac{x - \mu}{\sigma}^2 \right ] \right ),
$$
where $\mu$ governs the mean and $\sigma$ governs the standard deviation.
### Mean and variance of a distribution.
The mean of a probability distribution is also called the *expected value*, which follows the form
$$E[x] = \sum_x p(x) x$$
when $x$ is discrete. For continuous $x$ values, the formula is
$$E[x] = \int \text d x \; p(x) x.$$
The variance is defined as the mean squared deviation from the mean,
$$\text{var}_x = \int \text d x \; p(x) (x - E[x])^2.$$
If you take the square root of the variance, you get the standard deviation.
### Highest density interval (HDI).
> The HDI indicates which points of a distribution are most credible, and which cover most of the distribution. Thus, the HDI summarizes the distribution by specifying an interval that spans most of the distribution, say $95\%$ of it, such that every point inside the interval has higher credibility than any point outside the interval. (p. 87)
In Chapter 10 (p. 294), Kruschke briefly mentioned his `HDIofICDF()` function, the code for which you can find in his `DBDA2E-utilities.R` file. It's a handy function which we'll put to use from time to time. Here's a mild reworking of his code.
```{r}
hdi_of_icdf <- function(name, width = .95, tol = 1e-8, ... ) {
# Arguments:
# `name` is R's name for the inverse cumulative density function
# of the distribution.
# `width` is the desired mass of the HDI region.
# `tol` is passed to R's optimize function.
# Return value:
# Highest density iterval (HDI) limits in a vector.
# Example of use: For determining HDI of a beta(30, 12) distribution, type
# `hdi_of_icdf(qbeta, shape1 = 30, shape2 = 12)`
# Notice that the parameters of the `name` must be explicitly stated;
# e.g., `hdi_of_icdf(qbeta, 30, 12)` does not work.
# Adapted and corrected from Greg Snow's TeachingDemos package.
incredible_mass <- 1.0 - width
interval_width <- function(low_tail_prob, name, width, ...) {
name(width + low_tail_prob, ...) - name(low_tail_prob, ...)
}
opt_info <- optimize(interval_width, c(0, incredible_mass),
name = name, width = width,
tol = tol, ...)
hdi_lower_tail_prob <- opt_info$minimum
return(c(name(hdi_lower_tail_prob, ...),
name(width + hdi_lower_tail_prob, ...)))
}
```
We already know from the text, and perhaps from prior experience, what the 95% HDI is for the unit normal. But it's nice to be able to confirm that with a function.
```{r}
h <-
hdi_of_icdf(name = qnorm,
mean = 0,
sd = 1)
h
```
Now we've saved those values in `h`, we can use then to make our version of Figure 4.5.a.
```{r, fig.width = 4.5, fig.height = 2, warning = F}
tibble(x = seq(from = -3.5, to = 3.5, by = .05)) %>%
mutate(d = dnorm(x, mean = 0, sd = 1)) %>%
ggplot(aes(x = x, y = d)) +
geom_area(fill = "grey75") +
geom_area(data = . %>% filter(x >= h[1] & x <= h[2]),
fill = "grey50") +
geom_line(data = tibble(x = c(h[1] + .02, h[2] - .02),
d = c(.059, .059)),
arrow = arrow(length = unit(.2, "cm"),
ends = "both",
type = "closed"),
color = "grey92") +
annotate(geom = "text", x = 0, y = .09,
label = "95% HDI", color = "grey92") +
xlim(-3.1, 3.1) +
ylab("p(x)") +
theme(panel.grid = element_blank())
```
As far as I could tell, Figure 4.5.b is of a beta distribution, which Kruschke covered in greater detail starting in [Chapter 6][Inferring a Binomial Probability via Exact Mathematical Analysis]. I got the `shape1` and `shape2` values from playing around. If you have a more principled approach, [do share](https://github.com/ASKurz/Doing-Bayesian-Data-Analysis-in-brms-and-the-tidyverse/issues). But anyway, we can use our `hdi_of_icdf()` function to ge the correct values.
```{r}
h <-
hdi_of_icdf(name = qbeta,
shape1 = 15,
shape2 = 4)
h
```
Let's put those `h` values to work.
```{r, fig.width = 4.5, fig.height = 2, warning = F}
tibble(x = seq(from = 0, to = 1, by = .01)) %>%
mutate(d = dbeta(x, shape1 = 15, shape2 = 4)) %>%
ggplot(aes(x = x, y = d)) +
geom_area(fill = "grey75") +
geom_area(data = . %>% filter(x >= h[1] & x <= h[2]),
fill = "grey50") +
geom_line(data = tibble(x = c(h[1] + .01, h[2] - .002),
d = c(.75, .75)),
arrow = arrow(length = unit(.2, "cm"),
ends = "both",
type = "closed"),
color = "grey92") +
annotate(geom = "text", x = .8, y = 1.1,
label = "95% HDI", color = "grey92") +
xlim(.4, 1) +
ylab("p(x)") +
theme(panel.grid = element_blank())
```
Figure 4.5.c was also a lot of trial and error. It seemed the easiest way to reproduce the shape was to mash two Gaussians together. After playing around with `rnorm()`, I ended up with this.
```{r}
set.seed(4)
d <-
tibble(x = c(rnorm(6e5, mean = 1.50, sd = .5),
rnorm(4e5, mean = 4.75, sd = .5)))
glimpse(d)
```
As you'll see, it's not exactly right. But it's close enough to give you a sense of what's going on. But anyway, since we're working with simulated data rather than an analytic solution, we'll want to use one of the powerful convenience functions from the **tidybayes** package.
```{r, warning = F, message = F}
library(tidybayes)
```
Kay's **tidybayes** package
> provides a family of functions for generating point summaries and intervals from draws in a tidy format. These functions follow the naming scheme `[median|mean|mode]_[qi|hdi]`, for example, `median_qi()`, `mean_qi()`, `mode_hdi()`, and so on. The first name (before the `_`) indicates the type of point summary, and the second name indicates the type of interval. `qi` yields a quantile interval (a.k.a. equi-tailed interval, central interval, or percentile interval) and `hdi` yields a highest (posterior) density interval. [@kayExtractingVisualizingTidy2021, "Point summaries and intervals"]
Here we'll use `mode_hdi()` to compute the HDIs and put them in a tibble. We'll be using a lot of `mode_hdi()` in this project.
```{r}
h <-
d %>%
mode_hdi()
h
```
Usually, `mode_hdi()` will return a tibble with just one row. But in this case, since we had a bimodal distribution, it returned two rows—one for each of the two distinct regions. Oh, and in case it wasn't clear, that first column `x` is the measure of central tendency—the mode, in this case. Though I acknowledge, it's a little odd to speak of central tendency in a bimodal distribution. Again, this won't happen much.
In order to fill the bimodal density with the split HDIs, you need to use the `density()` function to transform the `d` data to a tibble with the values for the $x$-axis in an `x` vector and the corresponding density values in a `y` vector.
```{r}
dens <-
d$x %>%
density() %>%
with(tibble(x, y))
head(dens)
```
We're finally ready to plot. Forgive me. It's a monster.
```{r, fig.width = 4.5, fig.height = 2, warning = F, message = F}
ggplot(data = dens,
aes(x = x, y = y)) +
geom_area(fill = "grey75") +
# note the use of `pull()`, which extracts the values, rather than return a tibble
geom_area(data = dens %>% filter(x > h[1, 2] %>% pull() &
x < h[1, 3] %>% pull()),
fill = "grey50") +
geom_area(data = dens %>% filter(x > h[2, 2] %>% pull() &
x < h[2, 3] %>% pull()),
fill = "grey50") +
geom_line(data = tibble(x = c(h[1, 2] %>% pull(), h[1, 3] %>% pull()),
y = c(.06, .06)),
arrow = arrow(length = unit(.2,"cm"),
ends = "both",
type = "closed"),
color = "grey92") +
geom_line(data = tibble(x = c(h[2, 2] %>% pull(), h[2, 3] %>% pull()),
y = c(.06, .06)),
arrow = arrow(length = unit(.2,"cm"),
ends = "both",
type = "closed"),
color = "grey92") +
annotate(geom = "text", x = c(1.5, 4.75), y = .1,
label = "95% HDI", color = "grey92") +
scale_x_continuous(breaks = 0:6, limits = c(0, 6.3)) +
scale_y_continuous("p(x)", breaks = c(0, .1, .2, .3, .4, .5)) +
theme(panel.grid = element_blank())
```
> When the distribution refers to credibility of values, then the width of the HDI is another way of measuring uncertainty of beliefs. If the HDI is wide, then beliefs are uncertain. If the HDI is narrow, then beliefs are relatively certain. (p. 89)
## Two-way distributions
In the note below Table 4.1, Kruschke indicated the data came from @sneeGraphicalDisplayTwoway1974, [*Graphical display of two-way contingency tables*](https://www.researchgate.net/publication/243769696_Graphical_Display_of_Two-Way_Contingency_Tables). Kruschke has those data saved as the `HairEyeColor.csv` file.
```{r, warning = F, message = F}
d <- read_csv("data.R/HairEyeColor.csv")
glimpse(d)
```
We'll need to transform `Hair` and `Eye` a bit to ensure our output matches the order in Table 4.1.
```{r}
d <-
d %>%
mutate(Hair = if_else(Hair == "Brown", "Brunette", Hair) %>%
factor(., levels = c("Black", "Brunette", "Red", "Blond")),
Eye = factor(Eye, levels = c("Brown", "Blue", "Hazel", "Green")))
```
Here we'll use the `tabyl()` and `adorn_totals()` functions from the [**janitor** package](https://CRAN.R-project.org/package=janitor) [@R-janitor] to help make the table of proportions by `Eye` and `Hair`.
```{r, warning = F, message = F}
library(janitor)
d <-
d %>%
uncount(weights = Count, .remove = F) %>%
tabyl(Eye, Hair) %>%
adorn_totals(c("row", "col")) %>%
data.frame() %>%
mutate_if(is.double, ~ . / 592)
d %>%
mutate_if(is.double, round, digits = 2)
```
### Conditional probability.
> We often want to know the probability of one outcome, given that we know another outcome is true. For example, suppose I sample a person at random from the population referred to in Table 4.1. Suppose I tell you that this person has blue eyes. Conditional on that information, what is the probability that the person has blond hair (or any other particular hair color)? It is intuitively clear how to compute the answer: We see from the blue-eye row of Table 4.1 that the total (i.e., marginal) amount of blue-eyed people is $0.36$, and that $0.16$ of the population has blue eyes and blond hair. (p. 91)
Kruschke then showed how to compute such conditional probabilities by hand in Table 4.2. Here's a slightly reformatted version of that information.
```{r}
d %>%
filter(Eye == "Blue") %>%
pivot_longer(Black:Blond,
names_to = "Hair",
values_to = "proportion") %>%
rename(`p(Eyes = "Blue")` = Total) %>%
mutate(`conditional probability` = proportion / `p(Eyes = "Blue")`) %>%
select(Eye, Hair, `p(Eyes = "Blue")`, proportion, `conditional probability`)
```
The only reason our values differ from those in Table 4.2 is because Kruschke rounded.
## Session info {-}
```{r}
sessionInfo()
```
```{r, eval = F, echo = F}
# remove our objects
rm(n, p_heads, d, end_prop, HtWtDataGenerator, d_bin, hdi_of_icdf, h, dens)
```
```{r, echo = F, message = F, warning = F, results = "hide"}
ggplot2::theme_set(ggplot2::theme_grey())
bayesplot::color_scheme_set("blue")
pacman::p_unload(pacman::p_loaded(), character.only = TRUE)
```