-
Notifications
You must be signed in to change notification settings - Fork 44
/
06.Rmd
1072 lines (873 loc) · 43.6 KB
/
06.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
```{r, echo = F}
knitr::opts_chunk$set(fig.retina = 2.5)
knitr::opts_chunk$set(fig.align = "center")
```
```{r, echo = F, eval = F}
# (PART) ALL THE FUNDAMENTALS APPLIED TO INFERRING A BINOMIAL PROBABILITY {-}
# > In the next few chapters, we will develop all the foundational concepts and methods of Bayesian data analysis, which are applied to the simplest type of data. Because of the simplicity of the data, we can focus on the Bayesian methods and scaffold the concepts clearly and efficiently. The subsequent part of the book applies the methods developed in this part to more complex data structures. [@kruschkeDoingBayesianData2015, p. 121]
```
# Inferring a Binomial Probability via Exact Mathematical Analysis
> This chapter presents an example of how to do Bayesian inference using pure analytical mathematics without any approximations. Ultimately, we will not use the pure analytical approach for complex applications, but this chapter is important for two reasons. *First*, the relatively simple mathematics in this chapter nicely reveal the underlying concepts of Bayesian inference on a continuous parameter. The simple formulas show how the continuous allocation of credibility changes systematically as data accumulate. The examples provide an important conceptual foundation for subsequent approximation methods, because the examples give you a clear sense of what is being approximated. *Second*, the distributions introduced in this chapter, especially the beta distribution, will be used repeatedly in subsequent chapters. [@kruschkeDoingBayesianData2015, 123, *emphasis* added]
## The likelihood function: The Bernoulli distribution
If we denote a set of possible outcomes as $\{y_i\}$, Kruschke's Bernoulli likelihood function for a set of $N$ trials follows the form
$$p(\{y_i\} | \theta) = \theta^z \cdot (1 - \theta) ^ {N - z},$$
where $z$ is the number of 1's in the data (i.e., heads in a series of coin flips) and the sole parameter $\theta$ is the probability a given observation will be a 1. Otherwise put, the Bernoulli function gives us $p(y_i = 1 | \theta)$.
If you follow that equation closely, here is how we might express it in **R**.
```{r}
bernoulli_likelihood <- function(theta, data) {
# `theta` = success probability parameter ranging from 0 to 1
# `data` = the vector of data (i.e., a series of 0s and 1s)
n <- length(data)
z <- sum(data)
return(theta^z * (1 - theta)^(n - sum(data)))
}
```
This will come in handy in just a bit.
## A description of credibilities: The beta distribution
> In this chapter, we use purely mathematical analysis, with no numerical approximation, to derive the mathematical form of the posterior credibilities of parameter values. To do this, we need a mathematical description of the prior allocation of credibilities...
>
> In principle, we could use any probability density function supported on the interval $[0, 1]$. When we intend to apply Bayes' rule (Equation 5.7, p. 106), however, there are two desiderata for mathematical tractability. First, it would be convenient if the product of $p(y | \theta)$ and $p(\theta)$, which is in the numerator of Bayes' rule, results in a function of the same form as $p(\theta)$... Second, we desire the denominator of Bayes' rule (Equation 5.9, p. 107), namely $\int \text d \theta\ p(y | \theta) p(\theta)$, to be solvable analytically. This quality also depends on how the form of the function $p(\theta)$ relates to the form of the function $p(y | \theta)$. When the forms of $p(y | \theta)$ and $p(\theta)$ combine so that the posterior distribution has the same form as the prior distribution, then $p(\theta)$ is called a *conjugate prior* for $p(y | \theta)$. (p. 127 *emphasis* in the original)
When we want a conjugate prior for $\theta$ of the Bernoulli likelihood, the *beta distribution* is a handy choice. Beta has two parameters, $a$ and $b$ (also sometimes called $\alpha$ and $\beta$), and the density is defined as
\begin{align*}
p(\theta | a, b) & = \operatorname{Beta} (\theta | a, b) \\
& = \frac{\theta^{(a - 1)} \; (1 - \theta)^{(b - 1)}}{B(a, b)},
\end{align*}
where $B(a, b)$ is a normalizing constant, keeping the results in a probability metric, and $B(\cdot)$ is the Beta function. Kruschke then clarified that the beta distribution and the Beta function are not the same. In **R**, we use the beta density with the `dbeta()` function, whereas we use the Beta function with `beta()`. In this project, we'll primarily use `dbeta()`. But to give a sense, notice that when given the same input for $a$ and $b$, the two functions return very different values.
```{r}
theta <- .5
a <- 3
b <- 3
dbeta(theta, a, b)
beta(a, b)
```
The $a$ and $b$ parameters are also called *shape* parameters. And indeed, if we look at the parameters of the `dbeta()` function in **R**, we'll see that $a$ is called `shape1` and $b$ is called `shape2`.
```{r}
print(dbeta)
```
You can learn more about the `dbeta()` function [here](https://www.rdocumentation.org/packages/stats/versions/3.5.2/topics/Beta).
Before we make Figure 6.1, we'll need some data.
```{r, fig.width = 6, fig.height = 6, warning = F, message = F}
library(tidyverse)
length <- 1e2
d <-
crossing(shape1 = c(.1, 1:4),
shape2 = c(.1, 1:4)) %>%
expand_grid(x = seq(from = 0, to = 1, length.out = length)) %>%
mutate(a = str_c("a = ", shape1),
b = str_c("b = ", shape2)) %>%
mutate(density = dbeta(x, shape1 = shape1, shape2 = shape2))
head(d)
```
Now we're ready for our Figure 6.1.
```{r, fig.width = 6, fig.height = 6, warning = F, message = F}
d %>%
ggplot(aes(x = x, y = density)) +
geom_area(fill = "grey50") +
scale_x_continuous(expression(theta), breaks = c(0, .5, 1)) +
coord_cartesian(ylim = c(0, 3)) +
labs(title = "Examples of the beta distribution",
y = expression(p(theta*"|"*a*", "*b))) +
theme(panel.grid = element_blank()) +
facet_grid(b ~ a)
```
> Notice that as $a$ gets bigger (left to right across columns of Figure 6.1), the bulk of the distribution moves rightward over higher values of $\theta$, but as $b$ gets bigger (top to bottom across rows of Figure 6.1), the bulk of the distribution moves leftward over lower values of $\theta$. Notice that as $a$ and $b$ get bigger together, the beta distribution gets narrower. (p. 127).
We have a lot of practice with the beta distribution waiting for us in the chapters to come. If you like informal tutorials, you might also check out [Karin Knudson](https://twitter.com/karinknudson)'s nice blog post, [*Beta distributions, Dirichlet distributions and Dirichlet processes*](https://karinknudson.com/dirichletprocesses.html).
### Specifying a beta prior.
Though the $a$ and $b$ parameters in the beta distribution might initially seem confusing, they have a nice connection to the kinds of binary coin-flipping-type data we explores with the Bernoulli distribution in the last chapter. As Kruschke wrote: "You can think of $a$ and $b$ in the prior as if they were previously observed data, in which there were $a$ heads and $b$ tails in a total of $n = a + b$ flips" (pp. 127--128). For example, a popular default minimally-informative prior would be $\operatorname{Beta}(1, 1)$, which is like the sum of two coin flips of one tails and one heads. That produces a uniform distribution like so:
```{r, fig.width = 2.5, fig.height = 2.5}
d %>%
filter(shape1 == 1 & shape2 == 1) %>%
ggplot(aes(x = x, y = density)) +
geom_area(fill = "grey50") +
scale_x_continuous(expression(theta), breaks = c(0, .5, 1)) +
coord_cartesian(ylim = c(0, 3)) +
labs(title = "Beta(1, 1)",
y = expression(p(theta*"|"*a*", "*b))) +
theme(panel.grid = element_blank())
```
We read further:
> It is useful to know the central tendency and spread of the beta distribution expressed in terms of $a$ and $b$. It turns out that the mean of the $\operatorname{beta}(\theta | a, b)$ distribution is $\mu = a / (a + b)$ and the mode is $\omega = (a − 1) / (a + b − 2)$ for $a > 1$ and $b > 1$ ($\mu$ is Greek letter mu and $\omega$ is Greek letter omega). (p. 129)
In the case of our $\operatorname{Beta}(1, 1)$, the mean is then $1 / (1 + 1) = 0.5$ and the mode is undefined (which hopefully makes sense given the uniform distribution). Kruschke continued:
> The spread of the beta distribution is related to the "concentration" $\kappa = a + b$ ($\kappa$ is Greek letter kappa). You can see from Figure 6.1 that as $\kappa = a + b$ gets larger, the beta distribution gets narrower or more concentrated. (p. 129)
In the case of our $\operatorname{Beta}(1, 1)$, the concentration is $1 + 1 = 2$, which is also like the hypothetical sample size we've been using, $n = 2$.
If we further explore Kruschke's formulas, we learn you can specify a beta distribution in terms of $\omega$ and $\kappa$ by
$$\operatorname{Beta} \big (\alpha = \omega (\kappa - 2) + 1, \beta = (1 - \omega) \cdot (\kappa - 2) + 1 \big ),$$
as long as $\kappa > 2$. Kruschke further clarified:
> The value we choose for the prior $\kappa$ can be thought of this way: It is the number of new flips of the coin that we would need to make us teeter between the new data and the prior belief about $\mu$. If we would only need a few new flips to sway our beliefs, then our prior beliefs should be represented by a small $\kappa$. If we would need a large number of new flips to sway us away from our prior beliefs about $\mu$, then our prior beliefs are worth a very large $\kappa$. (p. 129)
He went on to clarify why we might prefer the mode to the mean when discussing the central tendency of a beta distribution.
> The mode can be more intuitive than the mean, especially for skewed distributions, because the mode is where the distribution reaches its tallest height, which is easy to visualize. The mean in a skewed distribution is somewhere away from the mode, in the direction of the longer tail. (pp. 129--130)
Figure 6.2 helped contrast the mean and mode for beta. We'll use the same process from Figure 6.1 and create the data, first.
```{r}
d <-
tibble(shape1 = c(5.6, 17.6, 5, 17),
shape2 = c(1.4, 4.4, 2, 5)) %>%
mutate(a = str_c("a = ", shape1),
b = str_c("b = ", shape2),
kappa = rep(c("kappa==7", "kappa==22"), times = 2),
mu_omega = rep(c("mu==0.8", "omega==0.8"), each = 2)) %>%
mutate(kappa = factor(kappa, levels = c("kappa==7", "kappa==22")),
label = str_c(a, ", ", b)) %>%
expand_grid(x = seq(from = 0, to = 1, length.out = 1e3)) %>%
mutate(density = dbeta(x, shape1 = shape1, shape2 = shape2))
head(d)
```
Here's Figure 6.2.
```{r, fig.width = 6, fig.height = 4}
d %>%
ggplot(aes(x = x)) +
geom_area(aes(y = density),
fill = "grey50") +
geom_vline(xintercept = .8, color = "grey92", linetype = 2) +
geom_text(data = . %>% group_by(label) %>% slice(1),
aes(x = .025, y = 4.75, label = label),
hjust = 0, size = 3) +
scale_x_continuous(expression(theta), breaks = c(0, .8, 1)) +
ylab(expression(p(theta*"|"*a*", "*b))) +
coord_cartesian(ylim = c(0, 5)) +
theme(panel.grid = element_blank()) +
facet_grid(mu_omega ~ kappa, labeller = label_parsed)
```
It's also possible to define the beta distribution in terms of the mean $\mu$ and standard deviation $\sigma$. In this case,
$$
\begin{align*}
\alpha & = \mu \left ( \frac{\mu(1 - \mu)}{\sigma^2} - 1\right), \text{and} \\
\beta & = (1 - \mu) \left ( \frac{\mu(1 - \mu)}{\sigma^2} - 1\right).
\end{align*}
$$
In lines 264 to 290 in his `DBDA2E-utilities.R` file, Kruschke provided a series of `betaABfrom...()` functions that will allow us to compute the $a$ and $b$ parameters from measures of central tendency (i.e., mean and mode) and of spread (i.e., $\kappa$ and $\sigma$). Here are those bits of his code.
```{r}
# Shape parameters from central tendency and scale:
betaABfromMeanKappa <- function(mean, kappa) {
if (mean <= 0 | mean >= 1) stop("must have 0 < mean < 1")
if (kappa <= 0) stop("kappa must be > 0")
a <- mean * kappa
b <- (1.0 - mean) * kappa
return(list(a = a, b = b))
}
betaABfromModeKappa <- function(mode, kappa) {
if (mode <= 0 | mode >= 1) stop("must have 0 < mode < 1")
if (kappa <= 2) stop("kappa must be > 2 for mode parameterization")
a <- mode * (kappa - 2) + 1
b <- (1.0 - mode) * (kappa - 2) + 1
return(list(a = a, b = b))
}
betaABfromMeanSD <- function(mean, sd) {
if (mean <= 0 | mean >= 1) stop("must have 0 < mean < 1")
if (sd <= 0) stop("sd must be > 0")
kappa <- mean * (1 - mean)/sd^2 - 1
if (kappa <= 0) stop("invalid combination of mean and sd")
a <- mean * kappa
b <- (1.0 - mean) * kappa
return(list(a = a, b = b))
}
```
You can use them like so.
```{r}
betaABfromMeanKappa(mean = .25, kappa = 4)
betaABfromModeKappa(mode = .25, kappa = 4)
betaABfromMeanSD(mean = .5, sd = .1)
```
You can also save the results as an object, which can then be indexed by parameter using the base-**R** `$` syntax.
```{r}
beta_param <- betaABfromModeKappa(mode = .25, kappa = 4)
beta_param$a
beta_param$b
```
We'll find this trick quite handy in the sections to come.
## The posterior beta
I'm not going to reproduce all of Formula 6.8. But this a fine opportunity to re-express Bayes' rule in terms of $z$ and $N$,
$$p(\theta | z, N) = \frac{p(z, N | \theta) \; p(\theta)}{p(z, N)}.$$
A key insight from the equations Kruschke worked through this section is: "If the prior distribution is $\operatorname{beta}(a, b)$, and the data have $z$ heads in $N$ flips, then the posterior distribution is $\operatorname{beta}(\theta | z + a, N - z + b)$" (p. 132).
### Posterior is compromise of prior and likelihood.
You might wonder how Kruschke computed the HDI values for Figure 6.3. Remember our `hdi_of_icdf()` function from back in Chapter 4? Yep, that's how. Here's that code, again.
```{r}
hdi_of_icdf <- function(name, width = .95, tol = 1e-8, ... ) {
# Arguments:
# `name` is R's name for the inverse cumulative density function
# of the distribution.
# `width` is the desired mass of the HDI region.
# `tol` is passed to R's optimize function.
# Return value:
# Highest density iterval (HDI) limits in a vector.
# Example of use: For determining HDI of a beta(30, 12) distribution, type
# `hdi_of_icdf(qbeta, shape1 = 30, shape2 = 12)`
# Notice that the parameters of the `name` must be explicitly stated;
# e.g., `hdi_of_icdf(qbeta, 30, 12)` does not work.
# Adapted and corrected from Greg Snow's TeachingDemos package.
incredible_mass <- 1.0 - width
interval_width <- function(low_tail_prob, name, width, ...) {
name(width + low_tail_prob, ...) - name(low_tail_prob, ...)
}
opt_info <- optimize(interval_width, c(0, incredible_mass),
name = name, width = width,
tol = tol, ...)
hdi_lower_tail_prob <- opt_info$minimum
return(c(name(hdi_lower_tail_prob, ...),
name(width + hdi_lower_tail_prob, ...)))
}
```
Recall it's based off of the `HDIofICDF()` function from Kruschke's `DBDA2E-utilities.R` file. I've altered Kruschke's formatting a little bit, but the guts of the code are unchanged. Our `hdi_of_icdf()` function will take the `name` of an "inverse cumulative density function" and its parameters and then return an HDI range, as defined by the `width` parameter. Since the prior at the top of Figure 6.3 is $\operatorname{Beta}(5, 5)$, we can use `hdi_of_icdf()` to calculate the HDI like so.
```{r}
hdi_of_icdf(name = qbeta,
shape1 = 5,
shape2 = 5,
width = .95)
```
Here they are for the posterior distribution at the bottom of the figure.
```{r}
hdi_of_icdf(name = qbeta,
shape1 = 6,
shape2 = 14)
```
Note that since we set `width = .95` as the default, we can leave it out if we want to stick with the conventional 95% intervals.
Here are the mean calculations from the last paragraph on page 134.
```{r}
n <- 10
z <- 1
a <- 5
b <- 5
(prior_mean <- a / (a + b))
(proportion_heads <- z / n)
(posterior_mean <- (z + a) / (n + a + b))
```
In order to make the plots for Figure 6.3, we'll want to compute the prior, likelihood, and posterior density values across a densely-packed range of $\theta$ values.
```{r}
trial_data <- c(rep(0, 9), 1)
d <-
tibble(theta = seq(from = 0, to = 1, length.out = 100)) %>%
mutate(`Prior (beta)` = dbeta(theta,
shape1 = a,
shape2 = b),
`Likelihood (Bernoulli)` = bernoulli_likelihood(theta = theta,
data = trial_data),
`Posterior (beta)` = dbeta(theta,
shape1 = 6,
shape2 = 14))
glimpse(d)
```
To make things easier on ourselves, we'll also make two additional data objects to annotate the plots with lines and text.
```{r}
# save the levels
levels <- c("Prior (beta)", "Likelihood (Bernoulli)", "Posterior (beta)")
# the data for the in-plot lines
line <-
tibble(theta = c(.212 + .008, .788 - .008, .114 + .004, .497 - .005),
value = rep(c(.51, .66), each = 2),
xintercept = c(.212, .788, .114, .497),
name = rep(c("Prior (beta)", "Posterior (beta)"), each = 2)) %>%
mutate(name = factor(name, levels = levels))
# the data for the annotation
text <-
tibble(theta = c(.5, .3),
value = c(.8, 1.125),
label = "95% HDI",
name = c("Prior (beta)", "Posterior (beta)")) %>%
mutate(name = factor(name, levels = levels))
```
Finally, here's our Figure 6.3.
```{r, warning = F, message = F, fig.width = 5, fig.height = 5}
library(cowplot)
d %>%
pivot_longer(-theta) %>%
mutate(name = factor(name, levels = levels)) %>%
ggplot(aes(x = theta, y = value, )) +
# densities
geom_area(fill = "steelblue") +
# dashed vertical lines
geom_vline(data = line,
aes(xintercept = xintercept),
linetype = 2, color = "white") +
# arrows
geom_line(data = line,
arrow = arrow(length = unit(.15,"cm"),
ends = "both",
type = "closed"),
color = "white") +
# text
geom_text(data = text,
aes(label = label),
color = "white") +
labs(x = expression(theta),
y = NULL) +
facet_wrap(~ name, scales = "free_y", ncol = 1) +
theme_cowplot()
```
Note how we loaded the [**cowplot** package](https://wilkelab.org/cowplot) [@R-cowplot]. We played around a bit with plotting conventions in the previous chapters. From this chapter onward we'll explore plotting conventions in a more deliberate fashion. One quick way to alter the look and feel of a plot is by altering its theme, and the **cowplot** package includes several theme options. In this chapter, we'll focus on making simple and conventional-looking plots with the `theme_cowplot()` function.
## Examples
### Prior knowledge expressed as a beta distribution.
If you flip an unaltered freshly-minted coin 20 times and end up with 17 heads, 85% of those trials are heads.
```{r}
100 * (17 / 20)
```
In the first paragraph of this section, Kruschke suggested we consider a beta prior with a mode of $\omega = .5$ and an effective sample size $\kappa = 500$. Why? Because even in the face of 17 heads out of 20 flips, our default prior assumption should still be that freshly-minted coins are fair. To compute the $a$ and $b$ parameters that correspond to $\omega = .5$ and $\kappa = 500$, we might use Kruschke's `betaABfromModeKappa()` function.
```{r}
betaABfromModeKappa(mode = .5, kappa = 500)
```
Confusingly, Kruschke switched from $\operatorname{Beta(250, 250)}$ in the prose to $\operatorname{Beta(100, 100)}$ in Figure 6.4.a, which he acknowledged in his [Corrigenda](https://sites.google.com/site/doingbayesiandataanalysis/corrigenda). We'll stick with $\operatorname{Beta(100, 100)}$, which corresponds to $\omega = .5$ and $\kappa = 200$.
```{r}
betaABfromModeKappa(mode = .5, kappa = 200)
```
Here's how to use those values and some of the equations from above to make the data necessary for the left column of Figure 6.4.
```{r}
# define the prior
beta_param <- betaABfromModeKappa(mode = .5, kappa = 200)
# compute the corresponding HDIs
prior_hdi <-
hdi_of_icdf(name = qbeta,
shape1 = beta_param$a,
shape2 = beta_param$b,
width = .95)
# define the data
n <- 20
z <- 17
trial_data <- c(rep(0, times = n - z), rep(1, times = z))
# compute the HDIs for the posterior
post_hdi <-
hdi_of_icdf(name = qbeta,
shape1 = z + beta_param$a,
shape2 = n - z + beta_param$b,
width = .95)
# use the above to compute the prior, the likelihood, and the posterior
# densities using the grid approximation approach
d <-
tibble(theta = seq(from = 0, to = 1, length.out = 1e3)) %>%
mutate(prior = dbeta(theta,
shape1 = beta_param$a,
shape2 = beta_param$b),
likelihood = bernoulli_likelihood(theta = theta,
data = trial_data),
posterior = dbeta(theta,
shape1 = z + beta_param$a,
shape2 = n - z + beta_param$b))
# what have we done?
glimpse(d)
```
We're finally ready to plot the prior, the likelihood, and the posterior for the left column of Figure 6.4.
```{r, fig.width = 5, fig.height = 2.25}
## Figure 6.4, left column
# prior
d %>%
ggplot(aes(x = theta, y = prior)) +
geom_area(fill = "steelblue", alpha = 1/2) +
geom_area(data = . %>% filter(theta > prior_hdi[1] & theta < prior_hdi[2]),
fill = "steelblue") +
geom_segment(x = prior_hdi[1] + .005, xend = prior_hdi[2] - .005,
y = 1.8, yend = 1.8,
arrow = arrow(length = unit(.15,"cm"),
ends = "both",
type = "closed"),
color = "white") +
annotate(geom = "text", x = .5, y = 3.5,
label = "95% HDI") +
labs(title = "Prior (beta)",
x = expression(theta),
y = expression(dbeta(theta*"|"*100*", "*100))) +
coord_cartesian(ylim = c(0, 12)) +
theme_cowplot()
# likelihood
d %>%
ggplot(aes(x = theta, y = likelihood)) +
geom_area(fill = "steelblue") +
labs(title = "Likelihood (Bernoulli)",
x = expression(theta),
y = expression(p(D*"|"*theta))) +
theme_cowplot()
# posterior
d %>%
ggplot(aes(x = theta, y = posterior)) +
geom_area(fill = "steelblue", alpha = 1/2) +
geom_area(data = . %>% filter(theta > post_hdi[1] & theta < post_hdi[2]),
fill = "steelblue") +
geom_segment(x = post_hdi[1] + .005, xend = post_hdi[2] - .005,
y = 2, yend = 2,
arrow = arrow(length = unit(.15, "cm"),
ends = "both",
type = "closed"),
color = "white") +
annotate(geom = "text", x = .532, y = 3.5,
label = "95% HDI") +
labs(title = "Posterior (beta)",
x = expression(theta),
y = expression(dbeta(theta*"|"*117*", "*103))) +
coord_cartesian(ylim = c(0, 12)) +
theme_cowplot()
```
Here are the exact HDI values for the prior and posterior densities.
```{r}
prior_hdi
post_hdi
```
If you double back to page 129 in the text, you'll see Kruschke defined the mode of a beta density as
$$\omega_\text{beta} = (a - 1) / (a + b - 2)$$
whenever $a > 1$ and $b > 1$. Thus we can compute the modes for our prior and posterior densities like this.
```{r}
(beta_param$a - 1) / (beta_param$a + beta_param$b - 2)
(z + beta_param$a - 1) / (z + beta_param$a + n - z + beta_param$b - 2)
```
For the next example, we consider the probability a professional basketball player will make free a throw. We have the same likelihood based on 17 successes our of 20 trials, but this time our prior is based on $\omega = .75$ and $\kappa = 25$. Here we update those values and our `d` data for the plot.
```{r}
# update the beta parameters for the prior
beta_param <- betaABfromModeKappa(mode = .75, kappa = 25)
# update the HDIs
prior_hdi <-
hdi_of_icdf(name = qbeta,
shape1 = beta_param$a,
shape2 = beta_param$b,
width = .95)
post_hdi <-
hdi_of_icdf(name = qbeta,
shape1 = z + beta_param$a,
shape2 = n - z + beta_param$b,
width = .95)
# update the data
d <-
d %>%
mutate(prior = dbeta(theta,
shape1 = beta_param$a,
shape2 = beta_param$b),
posterior = dbeta(theta,
shape1 = z + beta_param$a,
shape2 = n - z + beta_param$b))
```
With our updated values in hand, we're ready to make our versions of the middle column of Figure 6.4.
```{r, fig.width = 5, fig.height = 2.25}
## plot Figure 6.4, middle column!
# prior
d %>%
ggplot(aes(x = theta, y = prior)) +
geom_area(fill = "steelblue", alpha = 1/2) +
geom_area(data = . %>% filter(theta > prior_hdi[1] & theta < prior_hdi[2]),
fill = "steelblue") +
geom_segment(x = prior_hdi[1] + .005, xend = prior_hdi[2] - .005,
y = 0.75, yend = 0.75,
arrow = arrow(length = unit(.15,"cm"),
ends = "both",
type = "closed"),
color = "white") +
annotate(geom = "text", x = .75, y = 1.5,
label = "95% HDI", color = "white") +
labs(title = "Prior (beta)",
x = expression(theta),
y = expression(dbeta(theta*"|"*18.25*", "*6.75))) +
coord_cartesian(ylim = c(0, 7)) +
theme_cowplot()
# likelihood, which is the same as the last time
d %>%
ggplot(aes(x = theta, y = likelihood)) +
geom_area(fill = "steelblue") +
labs(title = "Likelihood (Bernoulli)",
x = expression(theta),
y = expression(p(D*"|"*theta))) +
theme_cowplot()
# posterior
d %>%
ggplot(aes(x = theta, y = posterior)) +
geom_area(fill = "steelblue", alpha = 1/2) +
geom_area(data = . %>% filter(theta > post_hdi[1] & theta < post_hdi[2]),
fill = "steelblue") +
geom_segment(x = post_hdi[1] + .005, xend = post_hdi[2] - .005,
y = 1, yend = 1,
arrow = arrow(length = unit(.15, "cm"),
ends = "both",
type = "closed"),
color = "white") +
annotate(geom = "text", x = .797, y = 2,
label = "95% HDI", color = "white") +
labs(title = "Posterior (beta)",
x = expression(theta),
y = expression(dbeta(theta*"|"*35.25*", "*9.75))) +
coord_cartesian(ylim = c(0, 7)) +
theme_cowplot()
```
Here are the exact HDI values for the prior and posterior densities.
```{r}
prior_hdi
post_hdi
```
Here are the the modes for our prior and posterior densities.
```{r}
(beta_param$a - 1) / (beta_param$a + beta_param$b - 2)
(z + beta_param$a - 1) / (z + beta_param$a + n - z + beta_param$b - 2)
```
For our final example, we consider the tendency of a newly discovered substance on a distant planet to be blue versus green. Just as in the previous two examples, we discover 17 out of 20 trials come up positive (i.e., blue). This time we have a noncommittal uniform prior, $\operatorname{Beta}(1, 1)$. Here's how to plot the results, as shown in the right column of Figure 6.4.
```{r, fig.width = 5, fig.height = 2.25}
# update beta_param
beta_param$a <- 1
beta_param$b <- 1
# update the HDIs
prior_hdi <-
hdi_of_icdf(name = qbeta,
shape1 = beta_param$a,
shape2 = beta_param$b,
width = .95)
post_hdi <-
hdi_of_icdf(name = qbeta,
shape1 = z + beta_param$a,
shape2 = n - z + beta_param$b,
width = .95)
# update the data
d <-
d %>%
mutate(prior = dbeta(theta,
shape1 = beta_param$a,
shape2 = beta_param$b),
posterior = dbeta(theta,
shape1 = z + beta_param$a,
shape2 = n - z + beta_param$b))
## plot Figure 6.4, rightmost column!
# prior
d %>%
ggplot(aes(x = theta, y = prior)) +
geom_area(fill = "steelblue") +
labs(title = "Prior (beta)",
x = expression(theta),
y = expression(dbeta(theta*"|"*1*", "*1))) +
coord_cartesian(ylim = c(0, 5)) +
theme_cowplot()
# likelihood, which is the same as the last two examples
d %>%
ggplot(aes(x = theta, y = likelihood)) +
geom_area(fill = "steelblue") +
labs(title = "Likelihood (Bernoulli)",
x = expression(theta),
y = expression(p(D*"|"*theta))) +
theme_cowplot()
# posterior
d %>%
ggplot(aes(x = theta, y = posterior)) +
geom_area(fill = "steelblue", alpha = 1/2) +
geom_area(data = . %>% filter(theta > post_hdi[1] & theta < post_hdi[2]),
fill = "steelblue") +
geom_segment(x = post_hdi[1] + .005, xend = post_hdi[2] - .005,
y = 0.8, yend = 0.8,
arrow = arrow(length = unit(.15, "cm"),
ends = "both",
type = "closed"),
color = "white") +
annotate(geom = "text", x = (post_hdi[1] + post_hdi[2]) / 2, y = 1.5,
label = "95% HDI", color = "white") +
labs(title = "Posterior (beta)",
x = expression(theta),
y = expression(dbeta(theta*"|"*18*", "*4))) +
coord_cartesian(ylim = c(0, 5)) +
theme_cowplot()
```
Here are the exact HDI values for the posterior density.
```{r}
post_hdi
```
Because both the $a$ and $b$ parameters for our beta prior are 1, we can't use the formula from above to compute the mode. I hope this makes sense if you look back at the plot. The density for $\operatorname{Beta}(1, 1)$ is uniform and has no mode. We can, at least, compute the mode for the posterior.
```{r}
(z + beta_param$a - 1) / (z + beta_param$a + n - z + beta_param$b - 2)
```
### Prior knowledge that cannot be expressed as a beta distribution.
> The beauty of using a beta distribution to express prior knowledge is that the posterior distribution is again exactly a beta distribution, and therefore, no matter how much data we include, we always have an exact representation of the posterior distribution and a simple way of computing it. But not all prior knowledge can be expressed by a beta distribution, because the beta distribution can only be in the forms illustrated by Figure 6.1. If the prior knowledge cannot be expressed as a beta distribution, then we must use a different method to derive the posterior. In particular, we might revert to grid approximation as was explained in Section 5.5 (p. 116).
For such a small section in the text, the underlying code is a bit of a beast. For kicks, we'll practice two ways. First we'll follow the code Kruschke used in the text. Our second attempt will be in a more **tidyverse** sort of way.
#### Figure 6.5 in Kruschke style.
```{r, fig.width = 4, fig.height = 6}
# Fine teeth for Theta
theta <- seq(0, 1, length = 1000)
# Two triangular peaks on a small non-zero floor
p_theta <-
c(rep(1, 200),
seq(1, 100, length = 50),
seq(100, 1, length = 50),
rep(1, 200)) %>%
rep(., times = 2)
# Make p_theta sum to 1.0
p_theta <- p_theta / sum(p_theta)
```
Here's Kruschke's `BernGrid()` code in all its glory.
```{r}
BernGrid = function( Theta , pTheta , Data , plotType=c("Points","Bars")[2] ,
showCentTend=c("Mean","Mode","None")[3] ,
showHDI=c(TRUE,FALSE)[2] , HDImass=0.95 ,
showpD=c(TRUE,FALSE)[2] , nToPlot=length(Theta) ) {
# Theta is vector of values between 0 and 1.
# pTheta is prior probability mass at each value of Theta
# Data is vector of 0's and 1's.
# Check for input errors:
if ( any( Theta > 1 | Theta < 0 ) ) {
stop("Theta values must be between 0 and 1")
}
if ( any( pTheta < 0 ) ) {
stop("pTheta values must be non-negative")
}
if ( !isTRUE(all.equal( sum(pTheta) , 1.0 )) ) {
stop("pTheta values must sum to 1.0")
}
if ( !all( Data == 1 | Data == 0 ) ) {
stop("Data values must be 0 or 1")
}
# Create summary values of Data
z = sum( Data ) # number of 1's in Data
N = length( Data )
# Compute the Bernoulli likelihood at each value of Theta:
pDataGivenTheta = Theta^z * (1-Theta)^(N-z)
# Compute the evidence and the posterior via Bayes' rule:
pData = sum( pDataGivenTheta * pTheta )
pThetaGivenData = pDataGivenTheta * pTheta / pData
# Plot the results.
layout( matrix( c( 1,2,3 ) ,nrow=3 ,ncol=1 ,byrow=FALSE ) ) # 3x1 panels
par( mar=c(3,3,1,0) , mgp=c(2,0.7,0) , mai=c(0.5,0.5,0.3,0.1) ) # margins
cexAxis = 1.33
cexLab = 1.75
# convert plotType to notation used by plot:
if ( plotType=="Points" ) { plotType="p" }
if ( plotType=="Bars" ) { plotType="h" }
dotsize = 5 # how big to make the plotted dots
barsize = 5 # how wide to make the bar lines
# If the comb has a zillion teeth, it's too many to plot, so plot only a
# thinned out subset of the teeth.
nteeth = length(Theta)
if ( nteeth > nToPlot ) {
thinIdx = round( seq( 1, nteeth , length=nteeth ) )
} else {
thinIdx = 1:nteeth
}
# Plot the prior.
yLim = c(0,1.1*max(c(pTheta,pThetaGivenData)))
plot( Theta[thinIdx] , pTheta[thinIdx] , type=plotType ,
pch="." , cex=dotsize , lwd=barsize ,
xlim=c(0,1) , ylim=yLim , cex.axis=cexAxis ,
xlab=bquote(theta) , ylab=bquote(p(theta)) , cex.lab=cexLab ,
main="Prior" , cex.main=1.5 , col="skyblue" )
if ( showCentTend != "None" ) {
if ( showCentTend == "Mean" ) {
meanTheta = sum( Theta * pTheta )
if ( meanTheta > .5 ) {
textx = 0 ; textadj = c(0,1)
} else {
textx = 1 ; textadj = c(1,1)
}
text( textx , yLim[2] ,
bquote( "mean=" * .(signif(meanTheta,3)) ) ,
cex=2.0 , adj=textadj )
}
if ( showCentTend == "Mode" ) {
modeTheta = Theta[ which.max( pTheta ) ]
if ( modeTheta > .5 ) {
textx = 0 ; textadj = c(0,1)
} else {
textx = 1 ; textadj = c(1,1)
}
text( textx , yLim[2] ,
bquote( "mode=" * .(signif(modeTheta,3)) ) ,
cex=2.0 , adj=textadj )
}
}
# Mark the highest density interval. HDI points are not thinned in the plot.
if ( showHDI ) {
HDIinfo = HDIofGrid( pTheta , credMass=HDImass )
points( Theta[ HDIinfo$indices ] ,
rep( HDIinfo$height , length( HDIinfo$indices ) ) ,
pch="-" , cex=1.0 )
text( mean( Theta[ HDIinfo$indices ] ) , HDIinfo$height ,
bquote( .(100*signif(HDIinfo$mass,3)) * "% HDI" ) ,
adj=c(0.5,-1.5) , cex=1.5 )
# Mark the left and right ends of the waterline.
# Find indices at ends of sub-intervals:
inLim = HDIinfo$indices[1] # first point
for ( idx in 2:(length(HDIinfo$indices)-1) ) {
if ( ( HDIinfo$indices[idx] != HDIinfo$indices[idx-1]+1 ) | # jumps on left, OR
( HDIinfo$indices[idx] != HDIinfo$indices[idx+1]-1 ) ) { # jumps on right
inLim = c(inLim,HDIinfo$indices[idx]) # include idx
}
}
inLim = c(inLim,HDIinfo$indices[length(HDIinfo$indices)]) # last point
# Mark vertical lines at ends of sub-intervals:
for ( idx in inLim ) {
lines( c(Theta[idx],Theta[idx]) , c(-0.5,HDIinfo$height) , type="l" , lty=2 ,
lwd=1.5 )
text( Theta[idx] , HDIinfo$height , bquote(.(round(Theta[idx],3))) ,
adj=c(0.5,-0.1) , cex=1.2 )
}
}
# Plot the likelihood: p(Data|Theta)
plot( Theta[thinIdx] , pDataGivenTheta[thinIdx] , type=plotType ,
pch="." , cex=dotsize , lwd=barsize ,
xlim=c(0,1) , ylim=c(0,1.1*max(pDataGivenTheta)) , cex.axis=cexAxis ,
xlab=bquote(theta) , ylab=bquote( "p(D|" * theta * ")" ) , cex.lab=cexLab ,
main="Likelihood" , cex.main=1.5 , col="skyblue" )
if ( z > .5*N ) { textx = 0 ; textadj = c(0,1) }
else { textx = 1 ; textadj = c(1,1) }
text( textx ,1.0*max(pDataGivenTheta) ,cex=2.0
,bquote( "Data: z=" * .(z) * ",N=" * .(N) ) ,adj=textadj )
if ( showCentTend != "None" ) {
if ( showCentTend == "Mean" ) {
meanTheta = sum( Theta * pDataGivenTheta )
if ( meanTheta > .5 ) {
textx = 0 ; textadj = c(0,1)
} else {
textx = 1 ; textadj = c(1,1)
}
text( textx , 0.7*max(pDataGivenTheta) ,
bquote( "mean=" * .(signif(meanTheta,3)) ) ,
cex=2.0 , adj=textadj )
}
if ( showCentTend == "Mode" ) {
modeTheta = Theta[ which.max( pDataGivenTheta ) ]
if ( modeTheta > .5 ) {
textx = 0 ; textadj = c(0,1)
} else {
textx = 1 ; textadj = c(1,1)
}
text( textx , 0.7*max(pDataGivenTheta) ,
bquote( "mode=" * .(signif(modeTheta,3)) ) ,
cex=2.0 , adj=textadj )
}
}
# Plot the posterior.
yLim = c(0,1.1*max(c(pTheta,pThetaGivenData)))
plot( Theta[thinIdx] , pThetaGivenData[thinIdx] , type=plotType ,
pch="." , cex=dotsize , lwd=barsize ,
xlim=c(0,1) , ylim=yLim , cex.axis=cexAxis ,
xlab=bquote(theta) , ylab=bquote( "p(" * theta * "|D)" ) , cex.lab=cexLab ,
main="Posterior" , cex.main=1.5 , col="skyblue" )
if ( showCentTend != "None" ) {
if ( showCentTend == "Mean" ) {
meanTheta = sum( Theta * pThetaGivenData )
if ( meanTheta > .5 ) {
textx = 0 ; textadj = c(0,1)
} else {
textx = 1 ; textadj = c(1,1)
}
text( textx , yLim[2] ,
bquote( "mean=" * .(signif(meanTheta,3)) ) ,
cex=2.0 , adj=textadj )
}
if ( showCentTend == "Mode" ) {
modeTheta = Theta[ which.max( pThetaGivenData ) ]
if ( modeTheta > .5 ) {
textx = 0 ; textadj = c(0,1)
} else {
textx = 1 ; textadj = c(1,1)
}
text( textx , yLim[2] ,
bquote( "mode=" * .(signif(modeTheta,3)) ) ,
cex=2.0 , adj=textadj )
}
}
# Plot marginal likelihood pData:
if ( showpD ) {
meanTheta = sum( Theta * pThetaGivenData )
if ( meanTheta > .5 ) {
textx = 0 ; textadj = c(0,1)
} else {
textx = 1 ; textadj = c(1,1)
}
text( textx , 0.75*max(pThetaGivenData) , cex=2.0 ,
bquote( "p(D)=" * .(signif(pData,3)) ) ,adj=textadj )
}
# Mark the highest density interval. HDI points are not thinned in the plot.
if ( showHDI ) {
HDIinfo = HDIofGrid( pThetaGivenData , credMass=HDImass )
points( Theta[ HDIinfo$indices ] ,
rep( HDIinfo$height , length( HDIinfo$indices ) ) ,
pch="-" , cex=1.0 )
text( mean( Theta[ HDIinfo$indices ] ) , HDIinfo$height ,
bquote( .(100*signif(HDIinfo$mass,3)) * "% HDI" ) ,
adj=c(0.5,-1.5) , cex=1.5 )
# Mark the left and right ends of the waterline.
# Find indices at ends of sub-intervals:
inLim = HDIinfo$indices[1] # first point
for ( idx in 2:(length(HDIinfo$indices)-1) ) {
if ( ( HDIinfo$indices[idx] != HDIinfo$indices[idx-1]+1 ) | # jumps on left, OR
( HDIinfo$indices[idx] != HDIinfo$indices[idx+1]-1 ) ) { # jumps on right
inLim = c(inLim,HDIinfo$indices[idx]) # include idx
}
}
inLim = c(inLim,HDIinfo$indices[length(HDIinfo$indices)]) # last point
# Mark vertical lines at ends of sub-intervals:
for ( idx in inLim ) {
lines( c(Theta[idx],Theta[idx]) , c(-0.5,HDIinfo$height) , type="l" , lty=2 ,
lwd=1.5 )
text( Theta[idx] , HDIinfo$height , bquote(.(round(Theta[idx],3))) ,
adj=c(0.5,-0.1) , cex=1.2 )
}
}
# return( pThetaGivenData )
} # end of function
```
You plot using Kruschke's method, like so.
```{r, fig.width = 5, fig.height = 6, warning = F, message = F}
Data <- c(rep(0, 13), rep(1, 14))
BernGrid(theta, p_theta, Data, plotType = "Bars",
showCentTend = "None", showHDI = FALSE, showpD = FALSE)
```
The method works fine, but I'm not a fan. It's clear Kruschke put a lot of thought into the `BernGrid()` function. However, its inner workings are too opaque, for me, which leads to our next section...
#### Figure 6.5 in **tidyverse** style.
Here we'll be plotting with **ggplot2**. But let's first get the data into a tibble.
```{r}
# we need these to compute the likelihood
n <- 27
z <- 14
trial_data <- c(rep(0, times = n - z), rep(1, times = z)) # (i.e., Data)
d <-
tibble(theta = seq(from = 0, to = 1, length.out = 1000), # (i.e., Theta)
Prior = c(rep(1, 200), # (i.e., pTheta)
seq(1, 100, length = 50),
seq(100, 1, length = 50),
rep(1, 200)) %>%
rep(., times = 2)) %>%
mutate(Prior = Prior / sum(Prior),
Likelihood = bernoulli_likelihood(theta = theta, # (i.e., pDataGivenTheta)
data = trial_data)) %>%
mutate(evidence = sum(Likelihood * Prior)) %>% # (i.e., pData)
mutate(Posterior = Likelihood * Prior / evidence) # (i.e., pThetaGivenData)