-
Notifications
You must be signed in to change notification settings - Fork 8
/
notes-20_finite-mixture-models_bda3-22.Rmd
87 lines (64 loc) · 4.46 KB
/
notes-20_finite-mixture-models_bda3-22.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
# Section 20. Notes on 'Ch 22. Finite mixture models'
2022-01-18
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, dpi = 300, comment = "#>")
```
> These are just notes on a single chapter of *BDA3* that were not part of the course.
## Chapter 22. Finite mixture models
- "when measurements of a random variable are taken under two different conditions" (pg. 519)
- where the data contains multiple subpopulations where each has a different, relatively simple model
- basic mixture modeling principle is to introduce *unobserved indicators* $z$ to specify the mixture component for an observation
- can think of a mixture indicator as missing data
### 22.1 Setting up and interpreting mixture models {-}
#### Finite mixtures {-}
- want to model the distribution of $y = (y_1, \dots, y_n)$ or $y|x$ as a mixture of $H$ components
- for each component $h \in (1, \dots, H)$, the distribution $f_h(y_i | \theta_h)$ depends on a parameter vector $\theta_h$
- $\lambda_h$ denotes the proportion of the population in component $h$
- $\sum_{h=1}^{H} \lambda_h = 1$
- common to assume all mixture components have the same parametric form
- thus, the sampling distribution of $y$ is:
$$
p(y_i | \theta, \lambda) = \lambda_1 f(y_i | \theta_1) + \dots + \lambda_H f(y_i | \theta_H)
$$
- can think of the mixture distribution probabilities $\lambda$ as priors over the parameters $\theta_h$
- or as a description of the variation in $\theta$ in a population
- akin to a hierarchical model
- introduce the indicator variables $z_{ih}$ where $z_{ih} = 1$ if the $i$th data point is drawn from component $h$ and 0 otherwise
- the $lambda$ values are used to determine $z$
- can think of $lambda$ as a hyperprior over $z$
- joint distribution of the observed data $y$ and the unobserved indicators $z$ conditions on the model parameters:
- only one $z_{ih}$ can be 1 for each $i$
$$
\begin{aligned}
p(y, z | \theta, \lambda) &= p(z | \lambda) p(y | z, \theta) \\
&= \prod_{i=1}^n \prod_{h=1}^H (\lambda_h f(y_i | \theta_h))^{z_{i,h}}
\end{aligned}
$$
#### Continuous mixtures {-}
- generalize the finite mixture to allow probability of an observation belongs to a class
- hierarchical models are a form a continuous mixture model
- each observed value $y_i$ is modeled as coming from a mixture of models defined by the probability of values for $\theta$
- in the book, the focus is on finite mixtures and "minor modifications" are generally required to form a continuous distribution
#### Identifiabilitiy of the mxixture likelihood {-}
- **all finite mixture models are nonidentifiable: the distribution is unchanged if the group labels are permuted**
- in many cases, purposeful, informative priors can solve the issue
#### Priors distributions {-}
- the priors for a finite mixture model's parameters $\theta$ and $\lambda$ are usually the product of the two independent priors on each variable
- because the vector of mixture indicators $z_i = (z_{i,1}, \dots, z_{i,H})$ is multinomial with parameter $\lambda$, a common prior for $\lambda$ is the Dirichlet
- $\lambda \sim \text{Dirichlet}(\alpha_1, \dots, \alpha_H)$
- $\theta = (\theta_1, \dots, \theta_H)$ is the vector of parameters for each component's sub-model
- some can be shared across components (i.e. equal variance for a group of normal distributions)
#### Number of mixture components {-}
- can model $H$ as unknown but is computationally expensive
- usually can just build models with different $H$ and compare their goodness of fit
- compare the posterior predictive distributions with a "suitably chosen" test quantity
#### Mixtures as true models or approximating distributions {-}
- two classes of thought^[I came up with the names of the two schools of thought as descriptive titles; they do not appear in the book.]:
1. *theoretical:* a mixture model is "a realistic characterization of the true data-generating mechanism" (pg. 522)
2. *pragmatic:* "trying to infer latent subpopulations is an intrinsically ill-defined statistical problem, but finite mixture models are nonetheless useful" (pg. 523)
### 22.4 Unspecifed number of mixture components {-}
- can assign a Poisson distribution as a on $H$ (the number of groups/components in the mixture model)
- computationally intensive
- more common to just fit the model with different $H$ and compare with some statistic and a penalty for model complexity
- WAIC is theoretically justified, but ignores the uncertainty over $H$
- LOO-CV may be even better