-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathsection2_modelling.jl
2859 lines (2250 loc) · 112 KB
/
section2_modelling.jl
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
### A Pluto.jl notebook ###
# v0.19.46
using Markdown
using InteractiveUtils
# ╔═╡ 69b19a54-1c79-11ed-2a22-ebe65ccbfcdb
begin
using PlutoUI
using Distributions
using StatsPlots
using PlutoTeachingTools
using Plots; default(fontfamily="Computer Modern", framestyle=:box) # LaTex-style
using Random
using LaTeXStrings
using SpecialFunctions
using Logging; Logging.disable_logging(Logging.Warn);
end;
# ╔═╡ 904c5f8d-e5a9-4f18-a7d7-b4561ad37655
TableOfContents()
# ╔═╡ 085021d6-4565-49b2-b6f5-10e2fdfff15b
md"
[**↩ Home**](https://lf28.github.io/BayesianModelling/)
[**↪ Next Chapter**](./section3_mcmc.html)
"
# ╔═╡ 613a65ce-9384-46ed-ab0d-abfda374f73c
md"""
# More on Bayesian modelling
"""
# ╔═╡ 5b4c3b01-bd19-4085-8a83-781086c85825
md"""
## Prior choice
"""
# ╔═╡ 0c123e4e-e0ff-4c47-89d6-fdf514f9e9d0
md"""
As part of the Bayesian modelling, we need to choose suitable priors for the unknown variables. In this section, we are going to introduce some important concepts about the prior specification.
### Priors with matching support
**First and foremost**, when choosing priors for the unknowns, the modeller needs to make sure the prior distribution has the correct *support* of the unknown parameters.
*Example* For the coin-flipping example, the unknown bias ``\theta`` has a support: ``\theta \in [0,1].`` Therefore, a standard Gaussian distribution is not suitable: as a Gaussian sample can take any value on the real line (so not within the correct range between 0 and 1). A suitable choice for the bias ``\theta`` can be e.g. Uniform distribution between 0 and 1, truncated Gaussian (truncated between 0 and 1) or Beta distribution. The probability density functions are listed below together with their plots.
```math
\begin{align}
p(\theta) &= \texttt{TruncNormal}(\mu, \sigma^2) \propto \begin{cases} \mathcal N(\mu, \sigma^2) & {0\leq \theta \leq 1} \\ 0 & \text{otherwise} \end{cases} \\
p(\theta) &= \texttt{Uniform}(0,1) = \begin{cases} 1 & {0\leq \theta \leq 1} \\ 0 & \text{otherwise} \end{cases}
\end{align}
```
"""
# ╔═╡ f479a1f3-a008-4622-82f0-ab154a431a33
let
σ = sqrt(0.1)
# Plots.plot(TruncatedNormal(0.5, σ, 0., 1), lw=2, label=L"\texttt{TrunNormal}" *"(0.5, $(round((σ^2);digits=2)))", legend=:outerright, xlabel=L"\theta", ylabel=L"p(\theta)", title="Priors for "*L"θ\in [0,1]")
Plots.plot(truncated(Normal(0.5, σ), 0.0, 1), lw=2, label=L"\texttt{TrunNormal}" *"(0.5, $(round((σ^2);digits=2)))", legend=:outerright, xlabel=L"\theta", ylabel=L"p(\theta)", title="Priors for "*L"θ\in [0,1]")
# Plots.plot!(TruncatedNormal(0.25, σ, 0., 1), lw=2, label=L"\texttt{TrunNormal}"*"(0.25, $(round((σ^2);digits=2))))")
Plots.plot!(truncated(Normal(0.25, σ), 0., 1), lw=2, label=L"\texttt{TrunNormal}"*"(0.25, $(round((σ^2);digits=2))))")
# Plots.plot!(TruncatedNormal(0.75, σ, 0., 1), lw=2, label=L"\texttt{TrunNormal}"*"(0.75, $(round((σ^2);digits=2))))")
Plots.plot!(truncated(Normal(0.75, σ), 0., 1), lw=2, label=L"\texttt{TrunNormal}"*"(0.75, $(round((σ^2);digits=2))))")
Plots.plot!(Uniform(0,1), lw=2, label=L"\texttt{Uniform}(0, 1)")
end
# ╔═╡ 2dfafb7b-773a-4c51-92e7-1f192fa354ea
md"""
*Example.* A Gaussian distribution's variance ``\sigma^2`` is a positive number: ``\sigma^2 >0``. Priors on the positive real line are e.g. Exponential distribution, or Half-Cauchy.
"""
# ╔═╡ e574a570-a76b-4ab2-a395-ca40dc383e5e
let
xpltlim1= -1
xpltlim2 = 6
Plots.plot(xpltlim1:0.01:xpltlim2, Exponential(1),lw=1.5, label=L"\texttt{Exponential}(1.0)", legend=:best, xlabel=L"\theta", ylabel=L"p(\theta)", title="Priors for "*L"σ^2\in (0,∞)")
Plots.plot!(xpltlim1:0.01:xpltlim2, Exponential(2),lw=1.5, label=L"\texttt{Exponential}(2.0)")
Plots.plot!(xpltlim1:0.01:xpltlim2, Exponential(5), lw=1.5, label=L"\texttt{Exponential}(5.0)")
Plots.plot!(xpltlim1:0.01:xpltlim2, truncated(Cauchy(0, 1), lower= 0), lw=1.5, label=L"\texttt{HalfCauchy}(0, 1)")
# Plots.plot!(0:0.1:10, truncated(Cauchy(0, 2), lower= 0), label="HalfCauchy(0, 2)")
# Plots.plot!(0:0.1:10, truncated(Cauchy(0, 5), lower= 0), label="HalfCauchy(0, 5)")
end
# ╔═╡ 9bb06170-34d0-4528-bd9b-988ecf952089
md"""
### Conjugate prior
Conjugate prior is a class of prior distributions such that the posterior distribution is of the same distribution form. When conjugacy holds, Bayesian computation becomes very simple: one only needs to update the prior's parameters to find the posterior. Unfortunately, conjugate priors only exist in some very simple Bayesian models. And for most real-world applications, no such conjugate priors can be found. Nevertheless, conjugate models provide us with some insights into the role of prior in Bayesian computation.
It is easier to study the concept by seeing a few concrete examples.
#### Example: Beta-Bernoulli model
For the coin flipping example, the conjugate prior is
```math
p(\theta) = \texttt{Beta}(\theta; a_0, b_0) = \frac{1}{\text{B}(a_0, b_0)} \theta^{a_0-1}(1-\theta)^{b_0-1},
```
where ``a_0,b_0 >0`` are the prior's parameter and ``B(a_0,b_0)``, the beta function, is a normalising constant for the Beta distribution: *i.e.* ``\mathrm{B}(a_0, b_0) = \int_0^1 \theta^{a_0-1}(1-\theta)^{b_0-1}\mathrm{d}\theta``.
*Remarks.
A few Beta distributions with different parameterisations are plotted below. Note that when ``a_0=b_0=1``, the prior reduces to a uniform distribution. Also note that when ``a_0> b_0``, e.g. ``\texttt{Beta}(5,2)``, the prior belief has its peak, or mode, greater than 0.5, which implies the prior believes the coin is biased towards the head; and vice versa.*
"""
# ╔═╡ 13cb967f-879e-49c2-b077-e9ac87569d87
let
plot(Beta(1,1), xlims=[0,1], ylim=[0,3], label=L"\texttt{Beta}(a_0=1,b_0=1)", linewidth=2, legend=:outerright, size=(600,300))
plot!(Beta(0.5,0.5), xlims=[0,1], ylim=[0,3], label=L"\texttt{Beta}(a_0=.5,b_0=.5)", linewidth=2)
plot!(Beta(5,5), xlims=[0,1], ylim=[0,3], label=L"\texttt{Beta}(a_0=2,b_0=2)", linewidth=2)
plot!(Beta(5,2), xlims=[0,1], ylim=[0,3], label=L"\texttt{Beta}(a_0=5,b_0=2)", linewidth=2)
plot!(Beta(2,5), xlims=[0,1], ylim=[0,3], label=L"\texttt{Beta}(a_0=2,b_0=5)", linewidth=2)
end
# ╔═╡ 0437436b-16b9-4c90-b554-f7cf5ea8b4c0
md"""
Apply Baye's rule, we can find the posterior is still of a Beta form (therefore the conjugacy holds):
```math
p(\theta|\mathcal D) = \texttt{Beta}(\theta; a_N, b_N) = \frac{1}{\text{B}(a_N, b_N)} \theta^{a_N-1}(1-\theta)^{b_N-1},
```
where ``a_N= a_0 + N_h`` and ``b_N = b_0 + N - N_h``; and ``N_h = \sum_i d_i`` is the total number of heads observed in the ``N`` tosses.
"""
# ╔═╡ 13f7968f-fde3-4204-8237-ab33a2a5cfd0
Foldable("Details about Beta-Binomial model's conjugacy", md"
Apply Baye's rule, we have
```math
\begin{align}
p(\theta|\mathcal D) &\propto p(\theta) p(\mathcal D|\theta) \\
&= \frac{1}{\text{B}(a_0, b_0)} \theta^{a_0-1}(1-\theta)^{b_0-1} \theta^{\sum_n d_n} (1-\theta)^{N-\sum_n d_n}\\
&= \theta^{a_0+ \sum_{n} d_n -1}(1-\theta)^{b_0+N-\sum_n d_n -1}\\
&= \theta^{a_N -1}(1-\theta)^{b_N -1},
\end{align}
```
where ``a_N= a_0 + N_h`` and ``b_N = b_0 + N - N_h``.
Next we needs to normalise the non-normalised posterior to find the exact posterior distribution. The normalising constant is:
$$\int_{0}^1 \theta^{a_N -1}(1-\theta)^{b_N -1} d\theta$$
We recognise the unnormalised distribution is a Beta distribution with the updated parameters ``a_N= a_0 + N_h`` and ``b_N = b_0 + N - N_h``; the normalising constant must be ``B(a_N, b_N)``. Therefore,
$$p(\theta|\mathcal D) =\text{Beta}(a_N, b_N).$$
")
# ╔═╡ 3c5b6566-6e4e-41df-9865-fff8a839a70e
md"""
For our example, we have observed ``N_h = 7`` number of heads in the 10 tosses. Therefore, if we choose a prior of
``p(\theta)= \texttt{Beta}(1, 1),``
The updated posterior is
$$p(\theta|\mathcal D)= \texttt{Beta}(1+7, 1+3).$$
The prior and posterior distributions are plotted below. The posterior now peaks at 0.7 which is in agreement with the observed data. Also note the uncertainty (spread of the density) around the peak.
"""
# ╔═╡ f10d3513-77ee-426a-aa5c-d2bf887572d9
let
nh, nt = 7, 3
plot(Beta(1,1), xlims=[0,1], label=L"p(\theta)= \texttt{Beta}(1,1)", linewidth=1, xlabel=L"\theta", ylabel="density" ,fill= true, lw=2, alpha=0.2, legend=:outerright, color=1, title="Conjugate posterior update")
vline!([mean(Beta(1,1))], label="prior mean", lw=2, lc=1, ls=:dash)
plot!(Beta(1+nh,1+nt), xlims=[0,1], fill= true, lw=2, alpha=0.2, color=2, label=L"p(\theta|\mathcal{D})= \texttt{Beta}(8,4)", linewidth=2)
vline!([mean(Beta(1+nh,1+nt))], label="posterior mean", lw=2, lc=2, ls=:dash)
end
# ╔═╡ 8c6233e0-14fb-4202-92b9-3eddaea3e107
md"""
**Interpretation of the prior parameters.**
Conjugate priors usually lead to very convenient posterior computation. For the coin-flipping example, we only need to update the prior's parameters with the count of heads and tails respectively:
$$a_N= a_0+ N_h\;\; b_N= b_0+N_t.$$ If we choose a truncated Gaussian prior, then the posterior distribution is much harder to compute. For more general prior choices, we need to resort to advanced methods like Markov Chain Monte Carlo (MCMC).
The updated parameters ``a_N`` ``b_N`` are the posterior counts of the heads and tails. And they together determine the shape of the posterior (check the plot above). For our case, the posterior peaks at ``0.7`` which is in agreement with the oberved data. And the posterior mean is ``\frac{8}{8+4} \approx 0.67``.
As a result, we can interpret the prior parameters ``a_0, b_0`` as some pseudo observations contained in the prior distribution. For example, ``a_0=b_0=1`` implies the prior contains one pseudo count of head and tail each (therefore, the prior is flat).
"""
# ╔═╡ 7fd2a003-bed6-45d4-8335-f8d4f149c8d8
md"""
*Remark. One benefit of Bayesian inference arises when we have observed little data. Assume we have only tossed the coin twice and observed two heads: i.e. ``N_h=2, N_t=0``. Frequentist's estimation of the bias will be ``\hat{\theta}=\frac{N_h}{N_h+N_t}=0``, the observed frequency. This is clearly an over-confident conclusion (a.k.a. over-fitting). After all, we have only observed two data points. On the other hand, the Bayesian estimation will make better sense. The posterior is ``p(\theta|N_h=2, N_t=0) = \texttt{Beta}(3, 1),`` which is plotted below. The posterior peaks at 1.0 but there is a great amount of uncertainty about the bias, which leads to a posterior mean ``E[\theta|\mathcal D] =\frac{3}{3+1} = 0.75``. One should expect 0.75 (a.k.a. a maximum a posteriori estimator) makes better sense here. However, as stated before, a true Bayesian answer is to report the posterior distribution as an answer.*
$(begin
plot(Beta(1,1), label=L"p(\theta)", fill=true, alpha=0.3, xlabel=L"\theta", ylabel=L"p(\theta|\mathcal{D})", legend=:outerright)
plot!(Beta(3,1), label=L"p(\theta|N_h=2, N_t=0)", xlabel=L"\theta", ylabel=L"p(\theta|\mathcal{D})", legend=:outerright, fill=true, alpha=0.3, lw=2, title="Bayesian inference with little data")
vline!([mean(Beta(3,1))], ylim = [0,3],label="posterior mean", lw=2, lc=2, ls=:dash)
end)
"""
# ╔═╡ f5dfcd1f-9d10-49d1-b68b-eafdf10baaec
begin
# simulate 200 tosses of a fair coin
N_tosses = 200
true_θ = 0.5
Random.seed!(100)
coin_flipping_data = rand(N_tosses) .< true_θ
Nh = sum(coin_flipping_data)
Nt = N_tosses- Nh
end;
# ╔═╡ 754c5fe2-3899-4467-8fd4-828fb0ec5040
md"""
**Sequential update.** The conjugacy also provides us a computational cheap way to sequentially update the posterior. Due to the independence assumption, we do not need to calculate the posterior in one go, but update the posterior incrementally as data arrives.
```math
\begin{align}
p(\theta|\{d_1, d_2, \ldots, d_N\})&\propto p(\theta) \prod_{n=1}^N p(d_n|\theta) p(d_N|\theta)\\
&= \underbrace{p(\theta) \prod_{n=1}^{N-1} p(d_n|\theta)}_{p(\theta|\mathcal D_{N-1})}\cdot p(d_N|\theta)\\
&= \underbrace{p(\theta|\mathcal D_{N-1})}_{\text{new prior}} p(d_N|\theta).
\end{align}
```
Suppose we have observed ``N-1`` coin tosses, and the posterior so far is ``p(\theta|\mathcal D_{N-1})``. The posterior now serves as a *new prior* to update the next observation ``d_N``. It can be shown that the final posterior sequentially updated this way is the same as the off-line posterior.
To be more specific, the sequential update algorithm is:
Initialise with a prior ``p(\theta|\emptyset)= \texttt{Beta}(a_0, b_0)``
For ``n = 1,2,\ldots, N``:
* update
$$a_n = a_{n-1} + \mathbf{1}(d_n=\texttt{head}), \;\; b_n = b_{n-1} + \mathbf{1}(d_n=\texttt{tail})$$
* report the posterior at ``n`` if needed
Note that the function ``\mathbf{1}(\cdot)`` returns 1 if the test result of the argument is true and 0 otherwise.
*Demonstration.* An animation that demonstrates the sequential update idea is listed below. $(N_tosses) tosses of a fair coin are first simulated to be used for the demonstration.
"""
# ╔═╡ 2ba73fd3-ae2b-4347-863f-28e6c21e7a91
md"""
We apply the sequantial learning algorithm to the simulated data. The posterior update starts from a vague flat prior. As more data are observed and absorbed into the posterior, the posterior distribution is more and more informative and finally recovers the final posterior.
"""
# ╔═╡ 56e80a69-6f84-498d-b04e-5be59c1488eb
let
a₀, b₀ = 1, 1
prior_θ = Beta(a₀, b₀)
plot(prior_θ, xlim = [0, 1], lw=2, fill=(0, 0.1), label="Prior "* L"p(\theta)", xlabel=L"\theta", ylabel="density", title="Sequential update N=0", legend=:topleft)
plot!(Beta(a₀ + Nh, b₀+ Nt), lw=2, fill=(0, 0.1), label="Posterior "* L"p(\theta|\mathcal{D})")
vline!([true_θ], label="true "*L"θ", lw=4, lc=3)
an = a₀
bn = b₀
anim=@animate for n in 1:N_tosses
# if the toss is head, update an
if coin_flipping_data[n]
an += 1
# otherwise
else
bn += 1
end
poster_n = Beta(an, bn)
# plot every 5-th frame
if (n % 5) == 0
plot!(poster_n, lw=1, label="", title="Sequential update with observation N=$(n)")
end
end
gif(anim, fps=15)
end
# ╔═╡ 65f0dc31-c2bd-464f-983a-9f4ca2c04a35
md"""
#### Examples: Gamma-Gaussian model
> Assume we have made ``N`` independent samples from a Gaussian distribution with known ``\mu`` but unknown variance ``\sigma^2>0``. What is the observation variance ``\sigma^2`` ?
It is more convenient to model the precision ``\phi\triangleq 1/\sigma^2``. A conjugate prior for the precision parameter is Gamma distribution. Gamma distributions have support on the positive real line which matches the precision's value range.
A Gamma distribution, parameterised with a shape ``a_0>0``, and a rate parameter ``b_0>0``, has a probability density function:
```math
p(\phi; a_0, b_0) = \texttt{Gamma}(\phi; a_0, b_0)=\frac{b_0^{a_0}}{\Gamma(b_0)} \phi^{a_0-1} e^{-b_0\phi}.
```
"""
# ╔═╡ d2c3bd6d-fb8c-41df-8112-c7bfbc180b0a
let
as_ = [1,2,3,5,9,7.5,0.5]
bs_ = [1/2, 1/2, 1/2, 1, 1/0.5, 1, 1]
plt= plot(xlim =[0,20], ylim = [0, 0.8], xlabel=L"\phi", ylabel="density")
for i in 1:length(as_)
plot!(Gamma(as_[i], 1/bs_[i]), fill=(0, .1), lw=1.5, label=L"\texttt{Gamma}"*"($(as_[i]),$(bs_[i]))")
end
plt
end
# ╔═╡ 76c641dd-c790-4f51-abc0-e157c00e3ba7
md"""
**Conjugacy.** The data follow a Gaussian distribution. That is for ``n = 1,2,\ldots, N``:
```math
d_n \sim \mathcal N(\mu, 1/\phi);
```
The likelihood, therefore, is a product of Gaussian likelihoods:
```math
p(\mathcal D|\mu, \phi) = \prod_{n=1}^N p(d_n|\mu, \phi) = \prod_{n=1}^N \mathcal N(d_n; \mu, 1/\phi) .
```
It can be shown that the posterior formed according to Baye's rule
```math
p(\phi|\mathcal D, \mu) \propto p(\phi) p(\mathcal D|\mu, \phi)
```
is still of a Gamma form (therefore conjugacy is established):
```math
p(\phi|\mathcal D, \mu) = \texttt{Gamma}(a_N, b_N),
```
where
$$a_N= a_0 +\frac{N}{2}, \;\;b_N = b_0 + \frac{\sum_{n} (d_n- \mu)^2}{2}.$$
The Bayesian computation reduces to hyperparameter update again. Note the posterior mean is
$$\mathbb{E}[\phi|\mathcal D] = \frac{a_N}{b_N} = \frac{a_0 + N/2}{b_0 + {\sum_n (d_n -\mu)^2}/{2}}.$$
If we assume ``a_0= b_0=0`` (a flat prior), we recover the regular maximum likelihood estimator for ``{\sigma^2}``:
$$\hat \sigma^2 =1/\hat{\phi} = \frac{\sum_n (d_n-\mu)^2}{N}.$$
Based on the above result, we can intuitively interpret the hyperparameters as
* ``a_N`` : the total count of observations contained in the posterior (both pseudo ``a_0`` and real observation ``N/2``) ;
* ``b_N`` : the rate parameter is the total sum of squares;
"""
# ╔═╡ cb3191ef-3b54-48ff-996c-d4993c876063
Foldable("Derivation details on the Gamma-Gaussian conjugacy.", md"
```math
\begin{align}
p(\phi|\mathcal D)&\propto p(\phi) p(\mathcal D|\mu, \phi)\\
&= \underbrace{\frac{b_0^{a_0}}{\Gamma(a_0)} \phi^{a_0-1} e^{-b_0 \phi}}_{p(\phi)} \underbrace{\frac{1}{ (\sqrt{2\pi})^N}\phi^{\frac{N}{2}}e^{\frac{-\phi\cdot \sum_{n} (d_n-\mu)^2}{2}}}_{p(\mathcal D|\phi, \mu)}\\
&\propto \phi^{a_0+\frac{N}{2}-1} \exp\left \{-\left (b_0 + \frac{\sum_{n} (d_n- \mu)^2}{2}\right )\phi\right \} \\
&= \phi^{a_N-1} e^{- b_N \phi},
\end{align}
```
where ``a_n= a_0 +\frac{N}{2}, \;\;b_N = b_0 + \frac{\sum_{n} (d_n- \mu)^2}{2}.`` Note this is a unnormalised Gamma distribution (whose normalising constant can be read off directly from a Gamma distribution), therefore
$$p(\phi|\mathcal D)= \text{Gamma}(a_N, b_N).$$
")
# ╔═╡ edbf5fa3-a064-47a4-9ff4-a93dbd7b9112
md"""
**Demonstration:** We first simulate ``N=100`` Gaussian observations with unknown ``\mu=0`` and ``\phi= 1/\sigma^2 = 2``.
"""
# ╔═╡ 958d6028-e38f-4a56-a606-47b6f8ee86f1
begin
σ² = 0.5
true_ϕ = 1/σ²
N = 100
Random.seed!(100)
# Gaussian's density in Distributions.jl is implemented with standard deviation σ rather than σ²
gaussian_data = rand(Normal(0, sqrt(σ²)), N)
plot(Normal(0, sqrt(σ²)), xlabel=L"d", ylabel="density", label=L"\mathcal{N}(0, 1/2)", title="Simulated Gaussian observations")
scatter!(gaussian_data, 0.01 .* ones(N), markershape=:vline, c=1, label=L"d_n")
end
# ╔═╡ 59975c2b-c537-4950-aeea-985377f22d93
md"""
We have used a relatively vague Gamma prior with ``a_0=b_0=0.5``; and the posterior can be easily calculated by updating the hyperparameters: ``\texttt{Gamma}(a_0+ 100/2, b_0+ \texttt{sse}/2).`` The update is implemented below in Julia:
"""
# ╔═╡ cb9ad8b5-7975-4d54-bb94-afc9f4953a67
begin
a₀ = 0.5
b₀ = 0.5
# Gamma in Distributions.jl is implemented with shape and scale parameters where the second parameter is 1/b
prior_ϕ = Gamma(a₀, 1/b₀)
posterior_ϕ = Gamma(a₀+ N/2, 1/(b₀+ sum(gaussian_data.^2)/2))
end;
# ╔═╡ 03a650cb-7610-4d20-8ae7-2c6e308770f6
md"""
We can plot the prior and posterior distributions to visually check the Bayesian update's effect. To plot a random variable in Julia, one can simply use `plot()` function from `StatsPlots.jl` package. The figure of both the prior and posterior plots is shown below.
"""
# ╔═╡ cb37f8a8-83aa-44f6-89a5-43fe0e4a5fa8
let
plot(prior_ϕ, xlim = [0, 5], ylim=[0, 1.5], lw=2, fill=(0, 0.2), label="Prior "* L"p(\phi)", xlabel=L"\phi", ylabel="density", title="Conjugate inference of a Gaussian's precision")
plot!(posterior_ϕ, lw=2,fill=(0, 0.2), label="Posterior "* L"p(\phi|\mathcal{D})")
vline!([true_ϕ], label="true "*L"\phi", lw=4)
vline!([mean(prior_ϕ)], label="prior mean", lw=2, lc=1, ls=:dash)
vline!([mean(posterior_ϕ)], label="posterior mean", lw=2, lc=2, ls=:dash)
# vline!([1/σ²], label="true "*L"λ", lw=2)
# end
end
# ╔═╡ 2d7792cd-5b32-4228-bfef-e1ab250724f3
md"""
**Sequential update.** The conjugacy also provides us with a computationally cheap way to sequentially update the posterior. Due to the independence assumption, we do not need to calculate the posterior in one go.
```math
\begin{align}
p(\phi|\{d_1, d_2, \ldots, d_N\})&\propto p(\phi) \prod_{n=1}^N p(d_n|\phi) p(d_N|\phi)\\
&= \underbrace{p(\phi) \prod_{n=1}^{N-1} p(d_n|\phi)}_{p(\phi|\mathcal D_{N-1})}\cdot p(d_N|\phi)\\
&= \underbrace{p(\phi|\mathcal D_{N-1})}_{\text{new prior}} p(d_N|\phi)
\end{align}
```
With observations up to ``N-1``, we obtain a posterior ``p(\phi|\mathcal D_{N-1})``.
The posterior now serves as the new prior waiting to be updated with the next observation ``d_N``. It can be shown that the final posterior sequentially updated this way is the same as the off-line posterior.
To be more specific, the sequential update algorithm for the precision example is:
Initialise with a prior ``p(\phi|\varnothing)``;
for ``n = 1,2,\ldots, N``
* update ``a_n = a_{n-1} + 0.5 ``, ``b_n = b_{n-1} + 0.5 \cdot (d_n-\mu)^2``
* report the posterior at ``n`` if needed
To demonstrate the idea, check the following animation. The posterior update starts from a vague prior. As more data is observed and absorbed into the posterior, the posterior distribution is more and more informative and finally recovers the ground truth.
"""
# ╔═╡ 7ccd4caf-ef8f-4b78-801f-b000f5aad430
let
plot(prior_ϕ, xlim = [0, 5], ylim=[0, 1.5], lw=2, fill=(0, 0.1), label="Prior "* L"p(\phi)", xlabel=L"\phi", ylabel="density", title="Sequential update N=0")
plot!(posterior_ϕ, lw=2, fill=(0, 0.1), label="Posterior "* L"p(\phi|\mathcal{D})")
vline!([true_ϕ], label="true "*L"\phi", lw=2, lc=3)
anim=@animate for n in [(1:9)..., (10:10:N)...]
an = a₀ + n/2
bn = b₀ + sum(gaussian_data[1:n].^2)/2
poster_n = Gamma(an, 1/bn)
plot!(poster_n, lw=1, label="", title="Sequential update N=$(n)")
end
gif(anim, fps=3)
end
# ╔═╡ 87b411d1-a62f-4462-b81b-ef0e8ac97d7e
md"""
### Informative vs Non-informative
All priors can be largely classified into two groups: **informative** prior and **non-informative** prior.
**Non-informative** prior, as the name suggests, contains no information in the prior [^1] and *let the data speak for itself*. For our coin-flipping case, a possible choice is a flat uniform prior: *i.e.* ``p(\theta) \propto 1`` when ``\theta\in[0,1]``.
**Informative** prior, on the other hand, contains the modeller's subjective prior judgement.
For example, if we believe our coin should be fair, we can impose an informative prior such as ``p(\theta) =\texttt{Beta}(n,n)``. When ``n`` gets larger, e.g. ``n=5``, the prior becomes more concentrated around ``\theta=0.5``, which implies a stronger prior belief in the coin being fair.
$(begin
plot(Beta(1,1), lw=2, xlabel=L"\theta", ylabel="density", label=L"\texttt{Unif}(0,1)", title="Priors with different level of information", legend=:topleft)
plot!(Beta(2,2), lw=2,label= L"\texttt{Beta}(2,2)")
plot!(Beta(3,3), lw=2,label= L"\texttt{Beta}(3,3)")
plot!(Beta(5,5), lw=2,label= L"\texttt{Beta}(5,5)")
# vline!([0.5], lw=2, ls=:dash, label= "")
end)
After we observe ``N_h=7, N_t=3``, the posteriors can be calculated by updating the pseudo counts: ``\texttt{Beta}(n+7, n+3)``. The corresponding posteriors together with their posterior means (thick dashed lines) are plotted below.
$(begin
nh, nt= 7, 3
plot(Beta(1+7,1+3), lw=2,xlabel=L"\theta", ylabel="density", label=L"\texttt{Beta}(1+7,1+3)", title="Posteriors with different priors",legend=:topleft)
plot!(Beta(2+7,2+3), lw=2,label= L"\texttt{Beta}(2+7,2+3)")
plot!(Beta(3+7,3+3), lw=2,label= L"\texttt{Beta}(3+7,3+3)")
plot!(Beta(5+7,5+3), lw=2,label= L"\texttt{Beta}(5+3,5+3)")
vline!([mean(Beta(5+7, 5+3))], color=4, ls=:dash, lw=2, label="")
vline!([mean(Beta(1+7, 1+3))], color=1, ls=:dash, lw=2, label="")
vline!([mean(Beta(2+7, 2+3))], color=2, ls=:dash, lw=2, label="")
vline!([mean(Beta(3+7, 3+3))], color=3, ls=:dash, lw=2, label="")
end)
"""
# ╔═╡ 58d7878a-f9d6-46b4-9ac4-1944f7508f03
md"""
It can be observed that all posteriors peak at different locations now. When the prior is more informative (that the coin is fair), it shrinks the posterior closer to the centre ``\theta=0.5``.
"""
# ╔═╡ 3cdf9875-9c42-46b3-b88c-74e0f363bc4a
md"""
Historically, **non-informative** Bayesian was the dominating choice due to its *objectiveness*. However, it should be noted that it is almost impossible to be completely non-informative. Indeed, for our coin-flipping example, the flat uniform distributed prior still shrinks the posterior mean towards 0.5, therefore not non-informative or objective.
The flexibility of introducing priors that reflect a modeller's local expert knowledge, however, is now more considered an *advantage* of the Bayesian approach. After all, modelling itself is a subjective matter. A modeller needs to take the responsibility for their modelling choices. The preferred approach is to impose subjective (possibly weak) informative priors but carefully check one's prior assumptions via methods such as posterior/prior predictive checks.
"""
# ╔═╡ 3ca3ee53-c744-4474-a8da-6209ec5e6904
md"""
## Likelihood model variants
"""
# ╔═╡ 5cbace78-8a92-43aa-9d3a-45054be48532
md"""
Bayesian modelling is more an art than science. Ideally, each problem should have its own bespoke model. We have only seen problems where the observed data is assumed to be independently and identically distributed (**i.i.d.**). For example, in the coin-flipping example, since the ten coin tosses ``\{d_1, d_2, \ldots, d_{10}\}`` are *independent* tossing realisations of the *same* coin, the i.i.d. assumption makes sense. We assume each ``d_i`` follows the same Bernoulli distribution and they are all independent.
However, there are problems in which more elaborate modelling assumptions are required. To demonstrate the idea, we consider the following inference problem.
"""
# ╔═╡ 2cb31d23-a542-4f8d-a056-025fa574f0d7
md"""
!!! question "Seven scientist problem"
[*The question is adapted from [^1]*] Seven scientists (A, B, C, D, E, F, G) with widely-differing experimental skills measure some signal ``\mu``. You expect some of them to do accurate work (*i.e.* to have small observation variance ``\sigma^2``, and some of them to turn in wildly inaccurate answers (*i.e.* to have enormous measurement error). What is the unknown signal ``\mu``?
The seven measurements are:
| Scientists | A | B | C | D | E | F | G |
| -----------| ---|---|---|--- | ---|---|---|
| ``d_n`` | -27.020| 3.570| 8.191| 9.898| 9.603| 9.945| 10.056|
"""
# ╔═╡ 31afbad6-e447-44d2-94cf-c8f99c7fa64a
begin
scientist_data = [-27.020, 3.57, 8.191, 9.898, 9.603, 9.945, 10.056]
μ̄ = mean(scientist_data)
end
# ╔═╡ de5c3254-1fbe-46ec-ab09-77889405510d
md"""
The measurements are also plotted below.
"""
# ╔═╡ ff49403c-8e0e-407e-8141-6ccf178c152b
let
ylocations = range(0, 2, length=7) .+ 0.5
plt = plot(ylim = [0., 3.0], xminorticks =5, yticks=false, showaxis=:x, size=(600,200))
scientists = 'A':'G'
δ = 0.1
for i in 1:7
plot!([scientist_data[i]], [ylocations[i]], label="", markershape =:circle, markersize=5, markerstrokewidth=1, st=:sticks, c=1)
annotate!([scientist_data[i]].+7*(-1)^i * δ, [ylocations[i]].+ δ, scientists[i], 8)
end
vline!([μ̄], lw=2, ls=:dash, label="sample mean", legend=:topleft)
# density!(scientist_data, label="")
plt
end
# ╔═╡ 024ce64f-29ef-49f2-a66f-87c2d4eb67a7
md"""
*Remarks. Based on the plot, scientists C, D, E, F, and G all made similar measurements. Scientists A and B's experimental skills seem questionable. This is a problem that the frequentist method should find challenging. If all 7 measurements were observed by one scientist or scientists with a similar level of experimental skill, the sample mean:
$$\frac{\sum_n d_n}{N} \approx 3.46$$
would have been a good estimator.
An ad hoc remedy is probably to treat the first two observations as outliers and take an average over the rest of the 5 measurements. This remedy lacks formal justification and does not scale well with a larger dataset.*
"""
# ╔═╡ 52fd1028-05aa-4b37-b57c-daabf4d77f50
md"""
### A bad Bayesian model
**Modelling**:
One possible model is to ignore the subtleties and reuse our coin-flipping model's assumption. Since the observed data is real-valued, we only need to replace a Bernoulli likelihood with a Gaussian. We then assume observations ``d_n`` are i.i.d distributed with a Gaussian
$$d_n \overset{\mathrm{i.i.d}}{\sim} \mathcal N(\mu, \sigma^2),$$
where the mean is the unknown signal ``\mu`` and a shared ``\sigma^2`` is the observation variance. The model implies each scientist's observation is the true signal ``\mu`` plus some Gaussian distributed observation noise.
To specify a Bayesian model, we need to continue to specify a prior model for the two unknowns ``\mu``, ``\sigma^2``. For computational convenience, we assume a Gaussian prior for the signal ``\mu``:
$$\mu \sim \mathcal N(m_0, v_0),$$
* ``m_0`` is our prior guess of the signal's centre
* ``v_0`` represents our prior belief strength;
To show our ignorance, we can set ``m_0=0`` (or the sample average) and ``v_0`` to a very large positive number, say 10,000. The prior then becomes a very flat vague distribution.
It is more convenient to model the observation precision ``\phi \triangleq 1/\sigma^2`` instead of variance ``\sigma^2``. Here we assume a Gamma prior for the precision parameter:
$$\phi \sim \texttt{Gamma}(a_0, b_0)$$
Again, to show our ignorance, we can set ``a_0, b_0`` such that the distribution is as flat and vague as possible. A possible parameterisation is ``a_0=b_0=0.5.`` Note that Gamma is a distribution on the positive real line which has matching support for the precision parameter.
To put them together, the full Bayesian model is:
```math
\begin{align}
\text{prior}: \mu &\sim \mathcal N(m_0=0, v_0=10000)\\
\phi &\sim \texttt{Gamma}(a_0=0.5, b_0=0.5)\\
\text{likelihood}: d_n &\overset{\mathrm{i.i.d}}{\sim} \mathcal N(\mu, 1/\phi) \;\; \text{for } n = 1,2,\ldots, 7.
\end{align}
```
**Computation:**
After specifying the model, we need to apply Baye's rule to compute the posterior distribution:
```math
\begin{align}
p(\mu, \phi|\mathcal D) &\propto p(\mu, \phi) p(\mathcal D|\mu, \phi)
\\
&= p(\mu)p(\phi) p(\mathcal D|\mu, \phi) \\
&= p(\mu)p(\phi) \prod_{n=1}^N p(d_n|\mu, \phi);
\end{align}
```
where we have assumed the prior for ``\mu`` and ``\phi`` are independent. Sub-in the definition of the prior and likelihood, we can plot the posterior.
"""
# ╔═╡ d6056188-9a46-4b5f-801b-e028b6eb0b7f
let
m₀, v₀ = 0, 10000
a₀, b₀ = 0.5, 0.5
function ℓπ(μ, ϕ; data)
σ² = 1/ϕ
logprior = logpdf(Normal(m₀, v₀), μ) + logpdf(Gamma(a₀, 1/b₀), ϕ)
logLik = sum(logpdf.(Normal(μ, sqrt(σ²)), data))
return logprior + logLik
end
plot(-13:0.05:20, 0:0.001:0.019, (x, y) -> exp(ℓπ(x, y; data=scientist_data)), st=:contour,fill=true, ylim=[0.001, 0.015], xlabel=L"μ", ylabel=L"\phi", title="Contour plot of "*L"p(\mu, \phi|\mathcal{D})")
end
# ╔═╡ bbbb73c8-0254-463a-8e81-5acdd08583ac
md"""
**Marginal posterior ``p(\mu|\mathcal D)``:**
The posterior distribution shows the posterior peaks around ``\mu = 3.5``, which is roughly the same as the sample average. However, to better answer the question, we should treat ``\phi`` as a *nuisance* parameter, and integrate it out to find the marginal posterior for ``\mu`` only. After some algebra, we find the unnormalised marginal posterior is of the form:
```math
p(\mu|\mathcal D) \propto p(\mu)\cdot \Gamma\left (a_0+ \frac{N}{2}\right )\left (b_0+\frac{\sum_n (d_n-\mu)^2}{2}\right )^{- (a_0+ \frac{N}{2})},
```
where ``N=7`` for our problem.
"""
# ╔═╡ 546cfb29-6f60-44be-9450-b4d33c8238e6
Foldable("Details on the posterior marginal distribution of μ.", md"
```math
\begin{align}
p(\mu|\mathcal D) &= \int_0^\infty p(\mu, \phi|\mathcal D) \mathrm{d}\phi \\
&= \frac{1}{p(\mathcal D)} \int_0^\infty p(\mu)p(\phi) p(\mathcal D|\mu, \phi)\mathrm{d}\phi\\
&\propto p(\mu)\int_0^\infty p(\phi) p(\mathcal D|\mu, \phi)\mathrm{d}\phi\\
&= p(\mu)\int_0^\infty \frac{b_0^{a_0}}{\Gamma(a_0)} \phi^{a_0-1} e^{-b_0 \phi} \frac{1}{ (\sqrt{2\pi})^N}\phi^{\frac{N}{2}}e^{\frac{-\phi\cdot \sum_{n} (d_n-\mu)^2}{2}}\mathrm{d}\phi\\
&\propto p(\mu)\int_0^\infty\phi^{a_0+\frac{N}{2}-1} \exp\left \{-\left (b_0 + \frac{\sum_{n} (d_n- \mu)^2}{2}\right )\phi\right \}\mathrm{d}\phi \\
&= p(\mu)\int_0^\infty\phi^{a_N-1} e^{- b_N \phi}\mathrm{d}\phi \\
&= p(\mu)\frac{\Gamma(a_N)}{b_N^{a_N}},
\end{align}
```
where ``a_n= a_0 +\frac{N}{2}`` and ``b_N = b_0 + \frac{\sum_{n} (d_n- \mu)^2}{2}``. Note that we have used the normalising constant trick of Gamma distribution in the second last step, where we recognise ``\phi^{a_N-1} e^{- b_N \phi}`` is the unnormalised part of a Gamma distribution with ``a_N, b_N`` as parameters; Then the corresponding Gamma density must integrate to one:
```math
\int_{0}^\infty \frac{b_N^{a_N}}{\Gamma(a_N)}\phi^{a_N-1} e^{- b_N \phi}\mathrm{d}\phi =1,
```
which leads to ``\int_{0}^\infty \phi^{a_N-1} e^{- b_N \phi}\mathrm{d}\phi= \frac{\Gamma(a_N)}{b_N^{a_N}}.``
")
# ╔═╡ 2f2e3e34-76c0-4f35-8c9b-21184f86cf66
md"""
We can implement the (log) un-normalised density in Julia (check `log_marginal_μ_wrong` function below). It is more common to compute log probability to avoid numerical issues. For reference, the log posterior density is:
```math
\ln p(\mu|\mathcal D) = \ln p(\mu) + \ln\Gamma\left (a_0+ \frac{N}{2}\right )- \left (a_0+ \frac{N}{2}\right )\cdot \ln \left (b_0+\frac{\sum_n (d_n-\mu)^2}{2}\right )
```
"""
# ╔═╡ 9732358b-f592-4c66-9d8a-fdc221249a56
function log_marginal_μ_wrong(μ; data, a₀=0.5, b₀=0.5, m₀=0, v₀=10000)
N = length(data)
logprior = logpdf(Normal(m₀, sqrt(v₀)), μ)
logLik = loggamma(a₀ + N/2) - (a₀+ N/2)* log(b₀+ 0.5 * sum((data .- μ).^2))
return logprior + logLik
end
# ╔═╡ 9d737cc5-c533-4e86-8654-b1adba943fc0
md"""
The unnormalised marginal posterior is plotted below. It shows the most likely estimate for the signal is about 3.46, which is counter-intuitive.
"""
# ╔═╡ 1d7783bf-a8d0-47d6-812d-b31ae0c0b138
let
μs = -30:0.01:30
ℓs = log_marginal_μ_wrong.(μs; data= scientist_data)
_, maxμ = findmax(ℓs)
plot(μs, (ℓs), color=2, alpha=0.5, label="", fillrange = -100, ylim=[-39, -28],
fillalpha = 0.5,
fillcolor = 2, xlabel=L"μ", yaxis=false, ylabel="log density",title="Unnormalised marginal posterior "*L"\ln p(μ|\mathcal{D})")
vline!([μs[maxμ]], lw=2, color=2, ls=:dash, label="Mode")
xticks!([(-30:10:30)...; μs[maxμ]])
end
# ╔═╡ 73bb12a3-db59-4e71-bf05-5a4b4e018f51
md"""
### A better Bayesian model
"""
# ╔═╡ 8eb641f0-a01f-43f9-a362-e7d54c26411a
md"""
**Modelling:**
The i.i.d assumption does not reflect the subtleties of the data generation process. A better model should assume each observation follows an independent but not identical Gaussian distribution. In particular, for each scientist, we should introduce their own observation precision, ``\phi_n \triangleq \sigma^2_n`` to reflect their "different levels of experimental skills".
The improved model now includes 7+1 unknown parameters: the unknown signal ``\mu`` and the observation precisions ``\{\phi_n\}_{n=1}^7`` for each of the seven scientists:
```math
\begin{align}
\mu &\sim \mathcal N(m_0, v_0)\\
\phi_n &\sim \texttt{Gamma}(a_0, b_0)\;\; \text{for }n = 1,2,\ldots,7\\
d_n &\sim \mathcal N(\mu, 1/\phi_n)\;\; \text{for }n = 1,2,\ldots,7\\
\end{align}
```
Intuitively, the precision ``\phi_n`` reflects a scientist's experimental skill level. Higher precision (lower observation variance) implies better skill.
"""
# ╔═╡ a06f8785-fbbc-4973-b3d0-5a6db967b3cc
md"""
**Computation:**
Similarly, we want to find out the marginal posterior distribution. After integrating out the nuisance parameters ``\{\phi_n\}`` from the posterior, we can find the marginal posterior is of the following form:
```math
p(\mu|\mathcal D) \propto p(\mu)\prod_{n=1}^N p(d_n|\mu) \propto p(\mu)\prod_{n=1}^N \frac{\Gamma{(a_n)}}{b_n^{a_n}},
```
where for ``n= 1,2\ldots, 7``:
$$a_n = a_0 + \frac{1}{2}, b_n = b_0+ \frac{1}{2} (d_n-\mu)^2.$$
"""
# ╔═╡ 15b09064-2c48-4013-8a08-c32e32d1f4df
Foldable("Derivation details on the marginal distribution.", md"
To find the marginal posterior for ``\mu``, we need to find the following marginal likelihood for observation ``d_n``:
$p(d_n|\mu) = \int p(\phi_n) p(d_n|\mu, \phi_n)\mathrm{d}\phi_n,$
where we have assumed ``\mu, \phi_n`` are independent, i.e. ``p(\phi_n|\mu) = p(\phi_n)``.
The marginal likelihood is the normalising constant of the marginal posterior ``p(\phi_n|d_n, \mu)``, since
$$p(\phi_n|d_n, \mu) = \frac{p(\phi_n) p(d_n|\mu, \phi_n)}{p(d_n|\mu)}.$$
Due to conjugacy, it can be shown that the conditional posterior of ``\phi_n`` is of Gamma form
$p(\phi_n|\mu, x_n) = \text{Gamma}(a_n, b_n),$ where
$$a_n = a_0 + \frac{1}{2}, b_n = b_0+ \frac{1}{2} (d_n-\mu)^2.$$
The normalising constant of the unnormalised posterior is therefore the corresponding Gamma distribution's normalising constant:
$$p(d_n|\mu) \propto \frac{b_0^{a_0}}{\Gamma(a_0)} \frac{\Gamma{(a_n)}}{b_n^{a_n}}.$$
")
# ╔═╡ 5fc1d8aa-5835-4559-860b-b73031d3bfe7
md"""
We can implement the log posterior distribution in Julia and plot the density.
"""
# ╔═╡ da7f1485-a532-4d3c-96db-d1d50f2bdee6
function log_marginal_μ(μ; data, a₀=0.5, b₀=0.5, m₀=0, v₀=10000)
aₙ = a₀ + 0.5
bₙ = b₀ .+ 0.5 .* (data .-μ).^2
logpdf(Normal(m₀, sqrt(v₀)), μ) + length(data) * loggamma(a₀) - sum(aₙ * log.(bₙ))
end
# ╔═╡ ee8fce9a-0a5b-40c2-a4d9-f124b40188f4
md"""
The posterior distribution looks more complicated now but makes much better sense.
* the mode now correctly sits around 10 (circa 9.7)
* also note a few small bumps near locations of -27 and 3.5
The posterior makes much better sense. It not only correctly identifies the consensus of the majority of the scientists (*i.e.* scientists C, D, E,..., G) and also takes into account scientists A and B's noisy observations.
"""
# ╔═╡ d6341bf9-7a95-4a10-8a26-9495b724b907
let
μs = -30:0.01:30
ℓs = log_marginal_μ.(μs; data= scientist_data)
_, maxμ = findmax(ℓs)
plot(μs, ℓs, xlabel=L"μ", ylabel="log density", title="Unnormalised marginal log posterior: "*L"\ln p(\mu|\mathcal{D})", lw=2, label="", lc=2)
vline!([μs[maxμ]], lw=2, color=2, ls=:dash, label="Mode")
# xticks!([(-30:10:30)...; μs[maxμ]])
end
# ╔═╡ 0a8a5aae-3567-41a3-9f15-d07235d07da2
md"""
## Check your models: Predictive checks
Bayesian inference provides a great amount of modelling freedom to its user. The modeller, for example, can incorporate his or her expert knowledge in the prior distribution and also choose a suitable likelihood that best matches the data generation process. However, greater flexibility comes with a price. The modeller also needs to take full responsibility for the modelling decisions. To be more specific, for a Bayesian model, one at least needs to check whether
* the prior makes sense?
* the generative model as a whole (*i.e.* prior plus likelihood) match the observed data?
In other words, the user needs to **validate the model** before making any final inference decisions. **Predictive checks** are a great way to empirically validate a model's assumptions.
"""
# ╔═╡ 8fe3ed2f-e826-4faa-ace0-2286b944991f
md"""
### Posterior predictive check
The idea of predictive checks is to generate future *pseudo observations* based on the assumed model's (posterior) **prediction distribution**:
$$\mathcal{D}^{(r)} \sim p(\mathcal D_{\textit{pred}}|\mathcal D, \mathcal M), \;\; \text{for }r= 1\ldots, R$$
where ``\mathcal D`` is the observed data and ``\mathcal M`` denotes the Bayesian model. Note that the posterior predictive distribution indicates what future data might look like, given the observed data and our model. If the model assumptions (both the prior and likelihood) are reasonable, we should expect the generated pseudo data "agree with" the observed.
Comparing vector data is not easy (note that a sample ``\mathcal D`` is usually high-dimensional). In practice, we compute the predictive distribution of some summary statistics, say mean, variance, median, or any meaningful statistic instead, and visually check whether the observed statistic falls within the predictions' possible ranges.
The possible credible ranges can be calculated based on the Monte Carlo method. Based on the Monte Carlo principle, after simulating ``R`` pseudo samples,
$$\tilde{\mathcal D}^{(1)}, \tilde{\mathcal D}^{(2)}, \tilde{\mathcal D}^{(3)},\ldots, \tilde{\mathcal D}^{(R)} \sim p(\mathcal D_{pred}|\mathcal D, \mathcal M),$$ the predictive distribution of a summary statistic ``t(\cdot)``:
$$p(t(D_{pred})|\mathcal D, \mathcal M)$$
can be approximated by the empirical distribution of ``\{t(\tilde{\mathcal D}^{(1)}), t(\tilde{\mathcal D}^{(2)}), \ldots, t(\tilde{\mathcal D}^{(R)})\}``. One can check whether the observed statistic ``t(\mathcal{D})`` falls within a credible region of the empirical distribution ``\{t(\tilde{\mathcal D}^{(1)}), t(\tilde{\mathcal D}^{(2)}), \ldots, t(\tilde{\mathcal D}^{(R)})\}``. We will see some examples soon to demonstrate the idea.
### Prior predictive check
Alternatively, one can use a **prior predictive distribution** to simulate the *future* data: *i.e.*
$$\mathcal{D}^{(r)} \sim p(\mathcal D_{pred}|\mathcal M), \;\; \text{for }r= 1\ldots, R$$
and the visual check based on the sample is called **prior predictive check**. The prior predictive distribution is a distribution of possible data sets given the priors and the likelihood, *before any real observations are taken into account*. Since the distribution relies on the prior information only, a prior predictive check is particularly useful to validate a prior's specification.
Note that the two predictive distributions are very much related. By setting ``\mathcal D=\emptyset``, we recover the prior predictive distribution from the posterior predictive distribution.
"""
# ╔═╡ 79b85624-a68d-41df-856a-7a7670ff4d0e
md"""
### Approximate predictive distributions
The predictive distribution, in general, can be found by applying the sum rule,
```math
p(\mathcal D_{pred}|\mathcal D, \mathcal M) = \int p(\mathcal D_{pred}, \theta|\mathcal D, \mathcal M) \mathrm{d}\theta= \int p(\mathcal D_{pred} |\theta, \mathcal D, \mathcal M) p(\theta|\mathcal D, \mathcal M) \mathrm{d}\theta
```
in which the unknown parameters ``\theta`` are integrated out. Assuming that past and future observations are conditionally independent given ``\theta``, *i.e.* ``p(\mathcal D_{pred} |\theta, \mathcal D, \mathcal M) = p(\mathcal D_{pred} |\theta, \mathcal M)``, the above equation can be written as:
```math
p(\mathcal D_{pred}|\mathcal D, \mathcal M) = \int p(\mathcal D_{pred} |\theta, \mathcal M) p(\theta|\mathcal D, \mathcal M) \mathrm{d}\theta
```
The integration in general is intractable. However, we can rely on sampling-based methods to approximate the integration.
!!! information ""
Repeat the following many times:
1. Draw one sample from the posterior (or the prior for prior predictive): $$\tilde \theta \sim p(\theta|\mathcal D)$$
2. Conditioning on ``\tilde \theta``, simulate pseudo observations: ``\tilde{\mathcal D}\sim p(\mathcal D|\tilde{\theta}) ``
The Monte Carlo approximation method can also be viewed as an ensemble method:
$$p(\mathcal D_{pred}|\mathcal D, \mathcal M) \approx \frac{1}{R} \sum_{r} p(\mathcal D_{pred}|\theta^{(r)}).$$
We are making predictions of future data on an ensemble of ``R`` models by taking the average, and each ensemble element model is indexed by one posterior (prior) sample.
The ensemble view highlights a key difference between the Bayesian and frequentist methods. When it comes to prediction, the frequentist usually applies the plug-in principle: use a point estimator of ``\theta`` (e.g. the maximum likelihood estimator) and that singleton model to predict:
$$p(\mathcal D_{pred}|\hat{\theta}, \mathcal M).$$
On the other hand, the Bayesian adopts a more democratic approach: an ensemble of models is consulted.
*Remarks. In simulating a pseudo sample ``\tilde{\mathcal D}``, we need to generate the same number of observations as ``\mathcal D``. For example, in the coin flipping example, for each ``\tilde\theta``, we need to simulate 10 tosses.*
"""
# ╔═╡ 7b4d6471-59f5-4167-bc35-a61549aaaef9
Foldable("More explanation on predictive distribution approximation.", md"
Note that the sampling procedure starts with
```math
\tilde{\theta} \sim p(\theta|\mathcal D, \mathcal M)
```
Then conditional on ``\theta``, simulate pseudo observation:
```math
\tilde{\mathcal{D}} \sim p(\mathcal D_{pred}|\tilde{\theta}, \mathcal M)
```
Due to the independence assumption, the joint can be factored as:
$$p(\theta, \mathcal D_{pred}|\mathcal D, \mathcal M)= p( \mathcal D_{pred}|\theta, \cancel{\mathcal{D}},\mathcal M)p(\theta|\mathcal D, \mathcal M).$$
As a result, the tuple ``\tilde\theta, \tilde{\mathcal{D}}`` is actually drawn from the joint distribution:
$$\tilde{\theta}, \tilde{\mathcal{D}} \sim p(\theta, \mathcal D_{pred}|\mathcal D,\mathcal M),$$
we can therefore retain the marginal sample ``\{\tilde{\mathcal{D}}\}`` to approximate the marginal distribution ``p(\mathcal{D}_{pred}|\mathcal D, \mathcal M)``, which provides another way to show the correctness of the approximation method.
")
# ╔═╡ e9a87ae2-f906-4be9-a969-7b681c2cff7b
md"""
### Example: coin-flipping model
For the conjugate coin-flipping model, the parameter sampling step is straightforward since both the prior and posterior distributions ``p(\theta|\mathcal M)`` (or ``p(\theta|\mathcal M)``) are both Beta distributed.
Similarly, it is also easy to simulate coin flip observations conditional on the bias ``\theta``. For example, one can simulate a tuple ``(\theta, \mathcal D_{pred})`` in Julia by
```julia
# draw a coin's bias from a Beta distribution
θ = rand(Beta(a, b))
# draw a pseudo coin-flipping sample of 10 tosses
D = rand(Bernoulli(θ), 10)
```
One should repeat the above two steps ``R`` (e.g. 2000) times to obtain an approximation of the predictive distribution of future data.
"""
# ╔═╡ ebdb9711-20cc-458c-9474-8e3ace69797e
function predictive_simulate(a, b; N=10 , R=2000)
θ = rand(Beta(a,b), R)
D = rand(N, R) .< θ'
return D
end;
# ╔═╡ 13fae558-d12e-4e27-b36f-ec59eabb65cf
let
Random.seed!(100)
D = predictive_simulate(1, 1; N=10 , R=5000)
histogram(sum(D, dims=1)[:], bins = 20, xticks=0:10, normed=true, label="Prior predictive on "*L"N_h", legend=:outerbottom, xlabel="number of heads "*L"N_h", title="Prior predictive check with "*L"a_0=b_0=1")
vline!([7], lw=4, lc=2, label="Observed "*L"N_h")
end
# ╔═╡ 5bc82f81-1c49-47e3-b9c8-6c5b79b4e88a
let
Random.seed!(100)
D = predictive_simulate(8, 4; N=10 , R=5000)
histogram(sum(D, dims=1)[:], normed=true, xticks = 0:10, label="Posterior predictive on "*L"N_h", legend=:outerbottom, xlabel="number of heads "*L"N_h", title="Posterior predictive check with "*L"a_0=b_0=1")
# density!(sum(D, dims=1)[:], xlim=[0,10], lc=1, lw=2, label="")
vline!([7], lw=4, lc=2, label="Observed "*L"N_h")
end
# ╔═╡ 858d3a5f-053d-4282-83f7-ff33ad8b5f58
md"""
**A model with misspecified prior.** Predictive checks can identify potential problems in a misspecified model. Suppose the modeller has mistakenly used a very strongly informative prior
$$\theta \sim \text{Beta}(50, 1),$$
which contradicts the observation. As a result, the posterior is dominated by the prior:
$$\theta \sim \text{Beta}(50+7, 1+3).$$
The prior and posterior plots are shown below.
"""
# ╔═╡ 4c1f3725-ddfa-4a19-adaf-4b48ef47c79b
let
nh, nt = 7, 3
a₀, b₀ = 50, 1
plot(Beta(a₀,b₀), xlims=[0.6,1], label=L"p(\theta)= \mathcal{Beta}(50,1)", linewidth=1, xlabel=L"\theta", ylabel=L"p(\theta|\cdot)" ,fill= true, lw=2, alpha=0.2, legend=:outerright, color=1, title="A mis-specified model")
vline!([mean(Beta(a₀,b₀))], label="prior mean", lw=2, lc=1, ls=:dash)
plot!(Beta(a₀+nh,b₀+nt), fill= true, lw=2, alpha=0.2, color=2, label=L"p(\theta|\mathcal{D})= \mathcal{Beta}(57,4)", linewidth=2)
vline!([mean(Beta(a₀+nh,b₀+nt))], label="posterior mean", lw=2, lc=2, ls=:dash)
end
# ╔═╡ 461dd52d-478e-4b1e-b0e0-55ecb98d4022