-
Notifications
You must be signed in to change notification settings - Fork 7
/
Copy pathpdf_cdf_review.Rmd
412 lines (323 loc) · 27.3 KB
/
pdf_cdf_review.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
---
title: "Introduction to R demos"
subtitle: "Probability density function review"
author: "Dr. Joseph P. Yurko"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
## Overview
This markdown quickly reviews important concepts related to working with probability density functions (pdfs) and cumulative distribution functions (CDFs). Just a single library is loaded in, `tidyverse`, which brings in multiple packages as shown in the code chunk display below.
```{r, load_packages}
library(tidyverse)
```
This markdown uses functions from the `dplyr` and `tibble` packages introduced in previous examples. However, it also makes use of functions from `tidyr` and `purrr`. Both of these packages are part of the wider `tidyverse` and are imported with `dplyr`.
## Gaussian distribution
We will start with the Gaussian or normal distribution. The density of a Gaussian variable $x$ with hyperparameters $\mu$ and $\sigma$ is:
$$
p\left( x|\mu,\sigma \right) = \frac{1}{\sigma \sqrt{2\pi}} \mathrm{exp}\left(-\frac{1}{2} \left(\frac{x-\mu}{\sigma} \right)^2 \right)
$$
The **standard** normal density corresponds to a Gaussian distributed variable with $\mu = 0$ and $\sigma = 1$:
$$
p\left(x|\mu=0,\sigma=1\right) = \frac{1}{\sqrt{2\pi}} \mathrm{exp} \left( -\frac{1}{2}x^2 \right)
$$
As we will discuss in lecture, if we define a variable $z = \sigma \times x + \mu$, and $x$ has a Gaussian distribution, the variable $z$ will have a standard normal distribution. In this document we will focus on the standard normal, but because of this transformation we can apply our learning to a general Gaussian distribution.
### R functions
We do not need to code the Gaussian distribution in order to evaluate it. The `*norm()` family of functions provides us several functions for evaluating and working with Gaussian random variables. If you type `?dnorm()` into the `R` console the help page displays 4 functions, `dnorm()`, `pnorm()`, `qnorm()`, and `rnorm()`. The last of these, `rnorm()` is the random number generator for Gaussian random variables. We will use `rnorm()` extensively throughout this course. However, within this markdown, we will focus on the other three functions.
The first of those functions, `dnorm()`, evaluates the density function of a Gaussian distribution with $\mu$ equal to the `mean` argument and $\sigma$ equal to the `sd` argument. By default, all of the `*norm()` family functions have `mean = 0` and `sd = 1`. So to evaluate the density of a standard normal variable at `x = 1.15` we only need to specify the first argument `x = ` in the `dnorm()` function call. The code to do so is shown in the code chunk below.
```{r, ex_dnorm_01}
dnorm(x = 1.15)
```
We get the same result if we explicitly set the `mean = 0` and `sd = 1`:
```{r, ex_dnorm_02}
dnorm(x = 1.15, mean = 0, sd = 1)
```
In the previous code chunks, we passed in scalars to the `x` argument of the `dnorm()` function. We can also pass in vectors in order to evaluate the density function at many different `x` variable values. For example, evaluate the standard normal density function at -2, -1, 0, 1, and 2:
```{r, ex_dnorm_03}
dnorm(x = -2:2, mean = 0, sd = 1)
```
### Visualize
Using this notation we will evaluate the standard normal density function over the interval -4.5 to 4.5. The code chunk below creates a `tibble` with the variable `x` which has 1001 elements from -4.5 to 4.5. That `tibble` is passted into the `mutate()` function where the variable `normal_pdf` is created by passing the variable `x` to the `dnorm()` function.
```{r, ex_dnorm_04}
df_gauss <- tibble::tibble(
x = seq(-4.5, 4.5, length.out = 1001)
) %>%
mutate(normal_pdf = dnorm(x))
```
The standard normal density function is visualized in the code chunk below using the `geom_line()` geometric object in `ggplot2`.
```{r, ex_dnorm_05}
df_gauss %>%
ggplot(mapping = aes(x = x, y = normal_pdf)) +
geom_line(mapping = aes(group = 1), size = 1.25) +
scale_x_continuous(breaks = -4:4) +
theme_bw() +
theme(axis.text = element_text(size = 12),
axis.title = element_text(size = 12))
```
Looking at the distribution's shape reveals some important properties of the Gaussian. First, it is symmetric. Second, it is unimodal, since there is just one local maximum density value. The center of the distribution is located at `x = 0`, which corresponds to the mean or expected value of the standard normal, $\mathbb{E}\left[x\right]=\mu$. As shown in the figure, the maximum density value (the mode) is also located at the `x = 0`. Thus, the **mode and mean are the same for a Gaussian**.
Although it may not be obvious from just look at the figure, **the mode and mean also correspond to the median for a Gaussian**. This property occurs for any symmetric unimodal distribution, not just the Gaussian (as long as the distribution has a mean). The median is the 0.5 quantile and corresponds to the $x$ value where the probability *mass* up to and including that value is 50%. That also means that 50% of the probability mass is located above (or greater) than the median. Notice the language used to the describe the median. The median does not correspond to the value that has 50% chance of occurring. Remember that the Gaussian distribution is continuous, and thus the probability that $x$ exactly equals a specific value is 0. In fact, the density function does not give the probability, it gives (as the name states) the *density*. In order to calculate the probability we must *integrate* the density function. Hence why the term *mass* was used. The interval we integrate over is the "volume" that contains the probability mass. The higher the density is within an interval, the more the mass in concentrated within that interval relative to other intervals.
Let's make this clear by visualizing the Cumulative Distribution Function (CDF). In the code chunk below, we pipe `df_gauss` into `mutate()` and call the `pnorm()` function. The result is piped into a `ggplot()` to visualize the CDF with respect to the `x` variable.
```{r, ex_pnorm_01}
df_gauss %>%
mutate(normal_cdf = pnorm(q = x)) %>%
ggplot(mapping = aes(x = x, y = normal_cdf)) +
geom_line(mapping = aes(group = 1), size = 1.25) +
scale_x_continuous(breaks = -4:4) +
theme_bw() +
theme(axis.text = element_text(size = 12),
axis.title = element_text(size = 12))
```
What's going on here? The x-axis is still the standard normal random variable, `x`. The y-axis is now between 0 and 1, because the y-axis gives the probability that the variable will take on a value less than or equal to the corresponding value on the x-axis. By integrating, we are "accumulating" the density to calculate the probability mass "contained" up to that point. That is why on the far left side of the x-axis (around `x = -4`) the CDF value displayed on the y-axis is essentially zero. At the extreme values of $x \leq -4$, the density is very low, so integrating over the interval yields a low probability. Why then does the right hand side of the figure, with $x \geq 4$, have CDF values all near 1? The density, as shown in the PDF figure, is also very low in that interval as well?
Let's put the PDF and CDF together into a single graphic with separate facets. In the figure below the CDF is displayed by the plot shown in the top row, while the PDF is shown by the plot in the bottom row. The dataset needed to be reformatted or *reshaped* to a **long** or **tall** format in order to create the figure below. The reshaping or *pivoting* operation is controlled by the `pivot_longer()` function. We will discuss `pivot_longer()` in more detail another time, but notice that the `starts_with()` helper function is used to identify the columns of interest. The syntax "reads" as "gather all columns that have names starting with" the pattern `'normal_'`. The reshaped long-format data are then passed to `ggplot()`. The `facet_grid()` function is used to create separated subplots for the CDF and PDF curves.
```{r, combine_pdf_cdf}
df_gauss %>%
mutate(normal_cdf = pnorm(q = x)) %>%
tibble::rowid_to_column() %>%
pivot_longer(starts_with('normal_')) %>%
ggplot(mapping = aes(x = x, y = value)) +
geom_line(mapping = aes(group = name), size = 1.15) +
facet_grid(name ~ ., scales = "free_y") +
scale_x_continuous(breaks = -4:4) +
theme_bw()
```
Let's walk through the figure above. The CDF is calculated by integrating the PDF over the entire range of the `x` variable (known as the **support**). A Gaussian random variable is unbounded, spanning the whole real line. Thus, we must integrate from $-\infty$ to $+\infty$. The PDF shows that for $\left| x\right|>3$ the density is very low. Thus integrating from $-\infty$ up to -3 yields very little "accumulated density" or *mass*. As we continue to move from -3 to -2, the density increases, yielding an increasing slope in the CDF, though it is still rather small. Continuing to from -2 to -1 however, we that the slope of the CDF increases sharply. The slope from -1 to 0 appears roughly constant in the CDF. Moving from 0 to +1 reveals that the CDF continues to increase, even though the PDF reaches its maximum at $x=0$. As mentioned already, the CDF is "accumulating" or building up mass via integration. The probability mass continues to increase at a slower rate than previously because the density is decreasing, especially between +1 and +2. At approximately $x=2.5$, the probability mass essentially saturates. Continuing to integrate over $x$ yields only a small increase in the cumulative probably. Thus, even though the Gaussian random variable exists over the entire real line almost all of the probability mass is contained within the interval of -3 to +3. This statement carries over to a general Gaussian variable in the form of -3$\sigma$ to +3$\sigma$.
### Quantiles
Let's calculate various quantiles to quantify the statements we made from visually studying the PDF and CDF. The `qnorm()` function allows us to calculate a quantile based on a desired probability as specified by the `p = ` argument. For example, the median or 0.5 quantile of the standard normal distribution is:
```{r, ex_qnorm_01}
qnorm(p = 0.5)
```
The result is 0 because, as mentioned already the mean, mode, median are the same for the Gaussian. The code chunk below adds a horizontal line on the y-axis of the normal CDF subplot at a probability value of 0.5. Both subplots have a vertical blue line located at median. Notice that the horizontal line and vertical lines intersect. Hopefully this makes it clear what the 50th quantile (the median) represents. After integrating from $-\infty$ to `x = 0`, the accumulating probability mass is 50% of the total probability mass.
```{r, ex_qnorm_02}
df_gauss %>%
mutate(normal_cdf = pnorm(q = x)) %>%
tibble::rowid_to_column() %>%
pivot_longer(starts_with('normal_')) %>%
ggplot(mapping = aes(x = x, y = value)) +
geom_line(mapping = aes(group = name), size = 1.15) +
geom_hline(data = data.frame(prob_val = 0.5, name = c("normal_cdf")),
mapping = aes(yintercept = prob_val),
color = "red") +
geom_vline(xintercept = qnorm(0.5), color = "navyblue") +
facet_grid(name ~ ., scales = "free_y") +
scale_x_continuous(breaks = -4:4) +
theme_bw()
```
We can make a similar figure to show the 0.95 quantile. The horizontal red line on the CDF plot is now placed at the y-axis value of 0.95. Likewise, the vertical blue line is located at `qnorm(0.95)` on the x-axis in both subplots. For the 95th quantile, we integrate from $-\infty$ to `r signif(qnorm(0.95), 4)`.
```{r, ex_qnorm_03}
df_gauss %>%
mutate(normal_cdf = pnorm(q = x)) %>%
tibble::rowid_to_column() %>%
pivot_longer(starts_with('normal_')) %>%
ggplot(mapping = aes(x = x, y = value)) +
geom_line(mapping = aes(group = name), size = 1.15) +
geom_hline(data = data.frame(prob_val = 0.95, name = c("normal_cdf")),
mapping = aes(yintercept = prob_val),
color = "red") +
geom_vline(xintercept = qnorm(0.95), color = "navyblue") +
facet_grid(name ~ ., scales = "free_y") +
scale_x_continuous(breaks = -4:4) +
theme_bw()
```
### Intervals
You have probably heard of the "2-sigma" rule which states approximately 95% of the probability is contained between $-2\sigma$ and $+2\sigma$. Notice that in the previous figure, the 0.95 quantile is not located at 2...it is located at `r signif(qnorm(0.95, 4))`. That is because the 0.95 quantile is calculated by integrating starting from $-\infty$. The "2-sigma" rule however, defines an interval **centered around** the mean. The left hand side of the integral used to calculate the probability starts at the 0.025 quantile and the integration ends at the 0.975 quantile. The interval therefore spans 95% of the probability. We could use a different interval which captures 95% of the probability mass. For example, the 0.01 quantile to the 0.96 quantile would also cover 95% of the probability mass. Spanning the 0.025 through 0.975 quantiles though encompasses a central interval, with the median/mean/mode at its center. The central 95% interval therefore represents the uncertainty around the most likely value.
Let's first calculate the probability mass contained between -2 and +2 (for the standard normal). We already know that the CDF is evaluated by the `pnorm()` function, but since the left hand side starts at $`-\infty$, we need to setup the calculation slightly different. We need to integrate from $-\infty$ to +2, and then subtract out the integral from $-\infty$ to -2. We can perform the calculation quite simply in `R`, as shown in the code chunk below:
```{r, ex_pnorm_02}
pnorm(2) - pnorm(-2)
```
The result `r signif(pnorm(2) - pnorm(-2), 4)` is pretty close to 95%. This shows that the "2-sigma" is an approximation, a pretty good approximation, but still an approximation. The actual 0.025 and 0.975 quantile values are:
```{r, ex_qnorm_04}
qnorm(p = c(0.025, 0.975))
```
If we plug in the actual quantiles to the integral calculation, we obtain 95% exactly:
```{r, ex_pnorm_03}
pnorm(qnorm(0.975)) - pnorm(qnorm(0.025))
```
To visualize the relationship between the integral of the PDF to the central 95% probability interval, the figure below shades in the area under the PDF curve between the two quantiles.
```{r, ex_qnorm_05}
df_gauss %>%
mutate(normal_cdf = pnorm(q = x)) %>%
tibble::rowid_to_column() %>%
pivot_longer(starts_with('normal_')) %>%
mutate(show_area = ifelse(name == "normal_cdf",
NA,
between(x, qnorm(0.025), qnorm(0.975)))) %>%
ggplot(mapping = aes(x = x, y = value)) +
geom_area(mapping = aes(fill = show_area)) +
geom_line(mapping = aes(group = name), size = 1.15) +
geom_hline(data = data.frame(prob_val = c(0.025, 0.975), name = c("normal_cdf")),
mapping = aes(yintercept = prob_val),
color = "red") +
geom_vline(xintercept = qnorm(c(0.025, 0.975)), color = "navyblue") +
facet_grid(name ~ ., scales = "free_y") +
scale_fill_manual("Area under the density curve for probability interval",
values = c("TRUE" = "red",
"FALSE" = "grey",
"NA" = NA),
labels = c("TRUE" = "included in integral",
"FALSE" = "outside integral",
"NA" = NA),
na.value = NA) +
scale_x_continuous(breaks = -4:4) +
theme_bw() +
theme(legend.position = "top")
```
## Beta distribution
Many of the concepts described for the Gaussian can be carried over to other distributions. Especially how to interpret the CDF and relate the PDF to the CDF through integration. We will now apply those understandings to the Beta distribution.
The Beta distribution is a continuous distribution, but unlike the Gaussian it is bounded between 0 and 1. Also, the Beta distribution can take a wide variety of shapes based upon its hyperparameters. The Bishop book denotes these hyperparameters as $a$ and $b$, but in `R` they are referred to as `shape1` and `shape2`, respectively.
As you will see in lecture, the Beta distribution can be used as the prior distribution for an unknown event probability, $\mu$. Within this markdown we will denote that variable simply as `x`. Doing so allows us to be consistent with the input arguments to the `*beta` family of functions. This document therefore represents a more general notation consistent with many references discussing probability distributions. The Beta density function on the variable $x$ given the hyperparameters $a$ and $b$ is proportional to:
$$
p\left( x|a,b \right) \propto x^{a-1}\left( 1 - x\right)^{b-1}
$$
### R functions
The `*beta()` family of functions are `dbeta()`, `pbeta()`, `qbeta()`, and `rbeta()`. The `rbeta()` is the random number generation function for beta distribution values bounded between 0 and 1. Compared with the `*norm()` family of functions, the hyperparameters, `shape1` and `shape2` do not have default argument values. The user **must** supply the hyperparameters for the functions to execute.
### Visualize
Let's start out evaluating the density function `dbeta()` using the $a$ and $b$ hyperparameters set to 2 and 3, respectively. We will discuss how to interpret these hyperparameters in lecture. The mean for this Beta is `r 2/(2+3)` since the expected value of the Beta distribution has the following formula:
$$
\mathbb{E}\left[x|a,b\right] = \frac{a}{a+b}
$$
The code chunk below stores a vector of values between 0 and 1 and then calculates the Beta density function at each element.
```{r, ex_dbeta_01}
df_beta_01 <- tibble::tibble(
x = seq(0, 1, length.out = 201)
) %>%
mutate(beta_pdf = dbeta(x, shape1 = 2, shape2 = 3))
```
Plot the Beta density function using are given hyperparameters.
```{r, ex_dbeta_02}
df_beta_01 %>%
ggplot(mapping = aes(x = x, y = beta_pdf)) +
geom_line(mapping = aes(group = 1), size = 1.25) +
scale_x_continuous(breaks = seq(0, 1, by = 0.2)) +
theme_bw()
```
Notice how in the figure above, the particular Beta distribution is **not** symmetric, but it is unimodal. Because of this, the mean will not correspond to the mode. To make this clear the code chunk below plots the same Beta distribution, but marks the mean value with a vertical blue line. The figure below also displays the median value with a dashed vertical orange line. The median is calculated using the `qbeta()` function (which has similar behavior as the `qnorm()` function, except for hyperparameter changes).
```{r, ex_dbeta_03}
df_beta_01 %>%
ggplot(mapping = aes(x = x, y = beta_pdf)) +
geom_line(mapping = aes(group = 1), size = 1.25) +
geom_vline(data = tibble::tibble(stat_val = c((2 / (2+3)),
qbeta(0.5, shape1 = 2, shape2 = 3)),
stat_name = c("mean", "median")),
mapping = aes(xintercept = stat_val,
color = stat_name,
linetype = stat_name),
size = 1.15) +
scale_color_manual("Statistic",
values = c("mean" = "navyblue",
"median" = "darkorange")) +
scale_linetype_manual("Statistic",
values = c("mean" = "solid",
"median" = "dashed")) +
theme_bw() +
theme(legend.position = "top")
```
Let's demonstrate the impact of the hyperparameters on the Beta distribution shape. The code chunk below sets the hyperparameters to be $a = 1.5$ and $b = 4.5$ and then plots the Beta density function. Colored vertical lines give the mean and median, as in the previous figure. Try out a different set of hyperparameters and include the mean and median as colored vertical lines.
```{r, ex_dbeta_04}
tibble::tibble(
x = seq(0, 1, length.out = 201)
) %>%
mutate(beta_pdf = dbeta(x, shape1 = 1.5, shape2 = 4.5)) %>%
ggplot(mapping = aes(x = x, y = beta_pdf)) +
geom_line(mapping = aes(group = 1), size = 1.25) +
geom_vline(data = tibble::tibble(stat_val = c((1.5 / (1.5+4.5)),
qbeta(0.5, shape1 = 1.5, shape2 = 4.5)),
stat_name = c("mean", "median")),
mapping = aes(xintercept = stat_val,
color = stat_name,
linetype = stat_name),
size = 1.15) +
scale_color_manual("Statistic",
values = c("mean" = "navyblue",
"median" = "darkorange")) +
scale_linetype_manual("Statistic",
values = c("mean" = "solid",
"median" = "dashed")) +
theme_bw() +
theme(legend.position = "top")
```
Rather than plotting the distributions in separate figues, let's use a separate subplot for each. To accomplish this, we will need to combine the two datasets used to create the two previous figures. That is an easy operation to do, but let's generalize our setup such that we can easily try out arbitrary hyperparameter values. We will first define a function, `make_beta_pdf()`, which creates a `tibble` consisting of the beta density evaluated at specific `x` values, based on the user supplied hyperparameter values. Notice that in `make_beta_pdf()` I did not use the `return()` function. By default, `R` will therefore return the last object within the function, which in this case is a `tibble` consisting of `x`, the Beta density function value, `beta_pdf`, and the two hyperparameters which defined the distribution.
```{r, ex_func_01}
make_beta_pdf <- function(shape_1, shape_2, vector_length)
{
tibble::tibble(
x = seq(0, 1, length.out = vector_length),
beta_pdf = dbeta(x = x,
shape1 = shape_1,
shape2 = shape_2)
) %>%
mutate(a = shape_1,
b = shape_2)
}
```
The `make_beta_pdf()` function can be called just like any function in `R`. We just pass in the arguments we wish to use. To demonstrate, call `make_beta_pdf()` with the following arguments `shape_1 = 2`, `shape_2 = 3`, and `vector_length = 3`. Pipe the result of the function call to the `glimpse()` function. As shown below, the resulting dataset consists of 3 observations of 4 variables. The values of those variables are printed below. The hyperparameters are simply repeated over all of the rows.
```{r, ex_func_02}
make_beta_pdf(shape_1 = 2, shape_2 = 3, vector_length = 3) %>%
glimpse()
```
Next, we will evaluate the Beta density function for several sets of hyperparameters. We could use a for-loop to accomplish this task, but instead I use the use the `purrr` package to perform the iterations for me. You may not be familar with `purrr` or functional programming in general. We will cover `purrr` later in this course. For now, it's ok to view the use of `purrr` as managing the for-loop for us.
Before iterating though, we have to specify what we will iterate over. The code chunk below creates a `tibble` which stores three sets of hyperparameters, named `hyper_params`.
```{r, ex_dbeta_05}
hyper_params <- tibble::tibble(
a = c(2, 1.5, 8),
b = c(3, 4.5, 4)
)
```
Now iterate over the three different hyperparameter sets using the `purrr::map2_dfr()` function. Store the result as `beta_results`.
```{r, ex_dbeta_06}
beta_results <- purrr::map2_dfr(hyper_params$a,
hyper_params$b,
make_beta_pdf,
vector_length = 201)
```
Pipe `beta_results` into the `glimpse()` function to get an idea about the structure of the dataset. As shown below, there are `r nrow(beta_results)` rows, which equals three times the specified value of the `vector_length` argument. Thus, the `purrr::map2_dfr()` function "stacked" the three generated `tibble`s on top of each other.
```{r, ex_dbeta_07}
beta_results %>% glimpse()
```
We can now plot each of the three separate Beta distributions by specifying the facets based on the combination of the `a` and `b` variables in the `beta_results` object.
```{r, ex_dbeta_08}
beta_results %>%
ggplot(mapping = aes(x = x, y = beta_pdf)) +
geom_line(mapping = aes(group = interaction(a, b)),
size = 1.15) +
facet_grid(. ~ a + b, labeller = "label_both") +
theme_bw()
```
### CDF
As we did with the Gaussian distribution, let's now relate the Beta density function with its Cumulative Distribution Function (CDF). We will evaluate the Beta CDF with the `pbeta()` function in much the same way that we evaluated the `pnorm()` function. To facilitate comparing the three different sets of hyperparameters, we will create a function `make_beta_cdf()` which calls `pbeta()` at the same conditions we evaluated `dbeta()` with. First, create the function:
```{r, ex_func_03}
make_beta_cdf <- function(shape_1, shape_2, vector_length)
{
tibble::tibble(
x = seq(0, 1, length.out = vector_length),
beta_cdf = pbeta(q = x,
shape1 = shape_1,
shape2 = shape_2)
) %>%
mutate(a = shape_1,
b = shape_2)
}
```
Next, iterate over the same set of Beta hyperparameters we used before and save the results to `beta_integrate`.
```{r, ex_dbeta_09}
beta_integrate <- purrr::map2_dfr(hyper_params$a,
hyper_params$b,
make_beta_cdf,
vector_length = 201)
```
We need to merge or join the `beta_results` and `beta_integrate` datasets in order to plot the Beta density functions and CDFs as subplots of a single graphic. Although we have not walked through joining datasets before, the operation is straightforward to do in this case. We will use the `left_join()` function from `dplyr` and instruct the function to join `by`: `x`, `a`, and `b`. The merged dataset is then reshaped to long-format using `pivot_longer()`. Finally the reshaped dataset is passed into `ggplot()` to give the CDF on the top row of subplots, and the density functions on the bottom row.
```{r, ex_dbeta_10}
beta_results %>%
left_join(beta_integrate,
by = c("x", "a", "b")) %>%
pivot_longer(starts_with('beta_')) %>%
ggplot(mapping = aes(x = x, y = value)) +
geom_line(mapping = aes(group = interaction(name, a, b)),
size = 1.15) +
facet_grid(name ~ a + b,
labeller = "label_both",
scales = "free_y") +
theme_bw()
```
We can walk through the figure above, just as we walked through the Gaussian CDFs and PDFs. A key difference between the Beta and Gaussian though is the Beta is bounded between 0 and 1. Thus, the integration of the Beta density function **starts** at $x = 0$, not at $-\infty$. The Beta density is integrated to $x = 1$ and not $+\infty$ because by definition the Beta distribution "stops" at $x = 1$. Thus, at $x=0$ the Beta CDF equals 0 and at $x=1$ the Beta CDF equals 1.
Comparing the horizontal subplots allows us to compare how the rate of increase in the accumulated probability mass based on the density shape. In the right column of subplots, the density is roughly at zero until $x \approx 0.25$. The CDF therefore remains at essentially zero since there is so little mass being accumulated integrating over $x=0$ up to $x \approx 0.25$. Moving to the middle and then finally the left column of subplots, the density increases closer and closer to zero. This causes the probability mass to accumulate at a faster rate integrating from $x=0$ to $x \approx 0.25$. That is why the slope of the CDF of the left most distribution is higher than the slope of the middle and right distributions.