-
Notifications
You must be signed in to change notification settings - Fork 1
/
Binomial_Model.qmd
256 lines (180 loc) · 12.3 KB
/
Binomial_Model.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
---
title: "Binomial Model"
author: "Adrien Osakwe"
format: html
editor: visual
---
## Binomial Model
Here I will explore the binomial model, where each random variable represents a success (1) or failure (0) of a trial. The usual example is coin tossing, which, unless an unfair coin is used, always has a probability of 0.5 for landing a head or tail. It is therefore more interesting to explore a problem such as the thumbtack toss, where the ground truth probability of success (landing with the point up) is less well defined.
We can assume *n* trials are done and count the total number of trials where the point lands up ($s_n$ ) to find an approximation of the ground truth for the probability of success $\theta$ . A feasible approach to estimating $\theta$ would be to divide the number of success $s_n$ by the number of trials $n$:
$$
\theta = \frac{s_n}{n}
$$
However, this estimation is only meaningful for large n and provides no estimate of the *uncertainty* of the measurement. Hence, the field of *statistical inference* becomes relevant as the latter can be achieved by creating a model for the problem.
#### Generate Data
To create the model, we first generate data by tossing the thumbtack say, 25 times ($n = 25$). We can then present the problem as a binomial model where:
$$
s_n \sim Bin(n,\theta)
$$
which takes the form:
$$
f(s_n;n,\theta) = \binom{n}{s_n}\theta^{s_n}(1-\theta)^{n-s_n}
$$
```{r}
#| echo: false
# Plotting Examples
n <- 25
sn <- 0:n
theta <- c(5,10,15,20)/n
plot(sn,dbinom(sn,n,theta[1]),col = 'red',type = 'b', ylab = 'P(SN = sn)')
lines(sn,dbinom(sn,n,theta[2]),col = 'blue',type = 'b')
lines(sn,dbinom(sn,n,theta[3]),col = 'green',type = 'b')
lines(sn,dbinom(sn,n,theta[4]),col = 'black',type = 'b')
legend('topleft',inset = .02,legend = c('0.2','0.4','0.6','0.8'), title = 'Theta Value',col = c('red','blue','green','black'),lwd =2 )
```
The above plot shows how the choice of $\theta$ affects the overall distribution. From here, we can try to infer the true value of theta using a *frequentist* or *Bayesian* approach.
### Maximum Likelihood
The frequentist approach relies on what is called the likelihood: the probability that the distribution has a certain parameter given the data we collected. From this equation we try to find the maximum likelihood. That is, we try to find the parameter that *maximizes* the likelihood for the data we collected. For our binomial model, we then have:
$$
\hat{\theta}(s_n) = \underset{\theta}{argmax}L(\theta;s_n)
$$
where
$$
L(\theta;s_n)
$$
is the likelihood function we are maximizing.
For convenience, the likelihood is often expressed as a *log-likelihood* which is usually easier to compute. The first step is to ignore the normalizing constant, as it does not depend on $\theta$ . We then have:
$$
l(\theta;s_n) = logL(\theta;s_n)
$$
$$
= logf(s_n;n,\theta) \propto log(\theta^{s_n}(1-\theta)^{n-s_n})
$$
$$
= s_nlog\theta + (n-s_n)log(1-\theta)
$$
Differentiating the equation helps us identify the extreme values:
$$
l'(\theta;s_n) = \frac{s_n}{\theta} - \frac{n-s_n}{1-\theta} = 0
$$
$$
= s_n-s_n\theta - n\theta + s_n\theta = s_n -n\theta
$$
$$
\theta = \frac{s_n}{n}
$$
```{r}
#| echo: false
## Checking if this is indeed the maximum
n <- 25
sn <- 15
thetas <- seq(0,1,0.1)
loglik_slope <- sn/thetas - (n-sn)/(1-thetas)
plot(thetas,loglik_slope, xlab = 'Theta Value',ylab = 'Log-Likelihood Slope')
lines(thetas,rep(0,length(thetas)), col = 'red')
```
We can also use the second derivative which is always negative, indicating that the log-likelihood is concave and that the extreme value is a global maximum.
Although this approach gives us an estimate of the model parameters, there is no indication of our *confidence* in the result and on the likelihood of other parameter values. This is due to the frequentist approach treating $\theta$ as a constant as opposed to an unknown random variable.
This is circumvented by calculating the maximum likelihood estimates from all observable data sets. In this case, we are observing the random variable $S_n$ as opposed to the constant $s_n$ and have the following \*maximum likelihood estimate:
$$ \hat{\theta}(Y) = \frac{Y}{n}$$
We can then estimate the standard error and construct confidence intervals for our parameter values. Hypothesis testing is also a possibility.
*Note: fill out an example showing how to get the above.*
### Bayesian Model
In the Bayesian framework, we treat the distribution parameters as random variables. We are first interested in knowing what the probability of the thumbtack landing upwards is *prior* to tossing any. This prior essentially defines the distribution from which our parameter $\theta$ is sampled from. For example, we can set this to be a beta distribution:
$$
\theta \sim Beta(\alpha,\beta)
$$
We can also formulate the *sampling distribution* which is the distribution of the data *given* the parameter. It is denoted as follows:
$$
f_{S_n|\Theta}(s_n | \theta)
$$ This notation emphasizes that the sampling distribution is a *conditional* probability distribution. We therefore have a model composed of the following prior and sampling distributions: $$
S_n|\theta \sim Bin(n,\theta)
$$
$$
\theta \sim Beta(\alpha,\beta)
$$ With this tools, we can now focus on updating our parameter estimates based on observed data. This step is what has us compute the so-called *posterior* distribution of the parameter. We start with:
$$
f_{\theta,\S_n}(\theta,s_n) = f_{\theta|S_n}(\theta|s_n)f_{S_n}(s_n)
$$ $$
f_{\theta|S_n}(\theta|s_n) = f_{\theta,\S_n}(\theta,s_n)/f_{S_n}(s_n)
$$ $$
f_{\theta|S_n}(\theta|s_n) = \frac{f_{\theta}(\theta)f_{\S_n|\theta}(s_n|\theta)}{f_{S_n}(s_n)}
$$ As the denominator on the RHS has no dependence on $\theta$, it can be ignored and the posterior can be seen as proportional to the joint distribution (the numerator):
$$
f_{\theta|S_n}(\theta|s_n) \propto f_{\theta}(\theta)f_{\S_n|\theta}(s_n|\theta) = \frac{\Gamma(\alpha + \beta)}{\Gamma(\alpha)\Gamma(\beta)}\theta^{\alpha - 1}(1-\theta)^{\beta - 1} * \binom{n}{s_n}\theta^{s_n}(1-\theta)^{n-s_n}
$$ $$
\propto \theta^{s_n + \alpha_0 - 1}(1-\theta)^{n-s_n + \beta_0 - 1}
$$
This derivation is similar in form to the beta distribution given the following, new representation of the posterior:
$$
\Theta|S_n \sim Beta(s_n + \alpha_0, n - s_n + \beta_0)
$$ We can now compare the prior and posterior:
```{r}
n <- 25
sn <- 15
a <- 2
b <- 3
a_n <- a + sn
b_n <- n - sn + b
thetas <- seq(0,1,0.01)
prior <- dbeta(thetas,a,b)
posterior <- dbeta(thetas,a_n,b_n)
plot(thetas,prior, xlab = 'Theta Value',ylab = 'f(Theta)',
col = 'blue',ylim = c(0,max(posterior)),
type = 'l')
lines(thetas,posterior, col = 'red')
legend('topleft',inset = 0.02,legend = c('Beta(a,b)','Beta(a_n,b_n)'),
col = c('blue','red'),
lwd = 2)
```
This reveals that the true value of theta (given the observed trials) is most likely between \~ 0.4 and 0.8.
```{r}
pbeta(0.8,a_n,b_n) - pbeta(0.4,a_n,b_n)
```
We can also calculate the *posterior mean* by finding the mean of the beta distribution we derived for the posterior
$$
E(\theta|S_n = s_n) = \frac{\alpha_n}{\alpha_n + \beta_n} = \frac{s_n + \alpha_0}{\alpha_0 + n + \beta_0} = \frac{17}{30}
$$
### Components of Bayesian Inference
We can now generalize the example to highlight the key concepts for Bayesian inference.
We work on a problem with *observed* data coming from *n* observations which we can then use to create and update a model representing the *generative process* behind the data. Here, we will focus on parametric inference, where the observed data helps us better estimate the model parameters.
#### Sampling Distribution and the Likelihood function
The conditional distribution of the data given the parameter values, $$f_{S_n|\Theta}(s_n|\theta)$$ , can be referred to as the sampling distribution or the likelihood function. The be exact, the sampling distribution presents the above as a function of the observed data, whereas the likelihood presents it as a function of the model parameters.
If we assume that our data is independent, given $$\theta$$, we can present the joint sampling distribution as a product of the sampling distributions for each random variable or trial:
$$
f_{S_N|\Theta}(s|\theta) = \prod_{i =1}^{N}f_{S_i|\Theta}(s_i|\theta)
$$
Which becomes even simpler if the observations ar i.i.d (not only independent, but coming from the same distribution):
$$
f_{S_N|\Theta}(s|\theta) = \prod_{i =1}^{N}f(s_i|\theta)
$$
Many different models can be used for the sampling distribution, but it is usually recommended to use models whose behavior best describes the observed phenomena (e.g. our thumbtack example is a clear success or failure scenario, which makes the binomial distribution suitable).
#### Prior distribution
The prior distribution represents the distribution from which the parameter values of the sampling distribution are generated. It represents our beliefs in the parameter values *prior* to observing any data.
In situations where there is little confidence in the possible values for the parameter, it is best to just a vague prior distribution to reduce its influence on the model and put a greater emphasis on the observed data (*uninformative prior*). We could also have an *informative prior* which has a greater influence on the model which can enforce sparsity on values we believe should be zero. We can also call the prior distribution a parametric distribution, which depends on *hyperparameters* (in the thumbtack case, these would be $\alpha$ and $\beta$ ).
**Posterior distribution**
For the posterior, we make use of Bayes' theorem to formulate it. The normalizing constant, which has no dependence on the parameter can be treated as a constant w.r.t. the parameter values and is often ignored. We can therefore compute the posterior as being *proportional* to the joint distribution.
#### Marginal Likelihood
The normalizing constant is also described as the *marginal likelihood*, as it represents the joint probability of the observed data after marginalizing out the parameter \$\\theta\$. This presents itself as an integral in continuous cases and as a sum in discrete cases (for the parameter). This can also be seen as the expectation of the sampling distribution, where we average the values over the full sample space for $\theta$ .
### Prediction
Beyond estimating the true parameter values, we are also interested in predicting the unobserved data (n + m). The simplest way to achieve this would be to use the MLE for the model parameters and predict the new data. This however, can lead to estimates that are biased if the observed data used for the MLE came from a small sample size (e.g.: only three tosses and all fell down). We can instead go the Bayesian route and compute a probability for the new observations *given* the observed data
$$
f_{\tilde{S}|S}(\tilde{s}|s)
$$
The parameter values aren't present in the function, but it is important to realize that they are need to derive the predictive distribution. In essence, the posterior predictive distribution is calculated as the product of the sampling distribution for the new data and the *posterior* distribution from the observed data, which acts as un *updated prior distribution*. In essence, the same way we updated $\alpha$ and $\beta$ for the observed data, we do it again by updating their updated values.
$$
f_{\tilde{S}|S_n}(\tilde{s}|s_n) = \int_{\Theta}
f_{\tilde{S}|\Theta}(\tilde{s}|\theta)f_{\theta|S_n}(\theta|s_n)
d\theta
$$
$$
f_{\tilde{S}|S_n}(\tilde{s}|s_n) = \int_{\Theta}\binom{m}{\tilde{s}}
\theta^{\tilde{s}}(1-\theta)^{m - \tilde{s}}
\frac{1}{B(\alpha_n,\beta_n)}
\theta^{\alpha_n - 1}(1-\theta)^{\beta_n - 1}d\theta
$$
Removing constants and recognizing the new formulation of a beta function, as in the earlier example, allows us to define the new posterior predictive function given the updated prior:
$$
f_{\tilde{S}|S_n}(\tilde{s}|s_n) = \binom{m}{\tilde{s}}\frac{B(\tilde{s} + \alpha_n,m + \beta_n - \tilde{s})}{B(\alpha_n,\beta_n)}
$$
Which represents the **beta-binomial distribution**. With this formulation we can therefore iteratively update our prior distribution as we collect more samples, leading to a progressively more accurate posterior predictive distribution.