forked from lauken13/Beginners_Bayes_Workshop
-
Notifications
You must be signed in to change notification settings - Fork 0
/
stancon2019-intro_nonHier.Rmd
662 lines (510 loc) · 21.2 KB
/
stancon2019-intro_nonHier.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
---
title: "StanCon 2019"
author: "Cambridge, UK"
date: "August 20-23"
output:
html_document:
toc: true
toc_depth: 2
---
## Setup
```{r setup, results="hide", message=FALSE}
knitr::opts_chunk$set(
echo = TRUE,
dev = "png",
dpi = 150,
fig.align = "center",
comment = NA
)
library(rstan)
library(dplyr)
library(lubridate)
library(ggplot2)
library(bayesplot)
theme_set(bayesplot::theme_default())
# seed for R's pseudo-RNGs, not Stan's
set.seed(1123)
# load data
pest_data <- readRDS('data/pest_data.RDS')
standata_hier <- readRDS('data/standata_hier.RDS')
```
## The problem
### Background
Imagine that you are a statistician or data scientist working as an independent
contractor. One of your clients is a company that owns many residential buildings
throughout New York City. The property manager explains that they are concerned about the number
of cockroach complaints that they receive from their buildings. Previously
the company has offered monthly visits from a pest inspector as a solution to
this problem. While this is the default solution of many property managers in
NYC, the tenants are rarely home when the inspector visits, and so the manager
reasons that this is a relatively expensive solution that is currently not very
effective.
One alternative to this problem is to deploy long term bait stations. In this
alternative, child and pet safe bait stations are installed throughout the
apartment building. Cockroaches obtain quick acting poison from these stations
and distribute it throughout the colony. The manufacturer of these bait stations
provides some indication of the space-to-bait efficacy, but the manager suspects
that this guidance was not calculated with NYC roaches in mind. NYC roaches, the
manager rationalizes, have more hustle than traditional roaches; and NYC
buildings are built differently than other common residential buildings in the
US. This is particularly important as the uni§t cost for each bait station per
year is quite high.
### The goal
The manager wishes to employ your services to help them to find the optimal
number of roach bait stations they should place in each of their buildings in
order to minimize the number of cockroach complaints while also keeping
expenditure on pest control affordable.
A subset of the company's buildings have been randomly selected for an experiment:
* At the beginning of each month, a pest inspector randomly places a number of
bait stations throughout the building, without knowledge of the current
cockroach levels in the building
* At the end of the month, the manager records
the total number of cockroach complaints in that building.
* The manager would like to determine the optimal number of traps ($\textrm{traps}$) that
balances the lost revenue ($R$) that complaints ($\textrm{complaints}$) generate
with the all-in cost of maintaining the traps ($\textrm{TC}$).
Fortunately, Bayesian data analysis provides a coherent framework for us to tackle this problem.
Formally, we are interested in finding
$$
\arg\max_{\textrm{traps} \in \mathbb{N}} \mathbb{E}_{\text{complaints}}[R(\textrm{complaints}(\textrm{traps})) - \textrm{TC}(\textrm{traps})]
$$
The property manager would also, if possible, like to learn how these results
generalize to buildings they haven't treated so they can understand the
potential costs of pest control at buildings they are acquiring as well as for
the rest of their building portfolio.
As the property manager has complete control over the number of traps set, the
random variable contributing to this expectation is the number of complaints
given the number of traps. We will model the number of complaints as a function
of the number of traps.
## The data
The data provided to us is in a file called `pest_data.RDS`. Let's
load the data and see what the structure is:
```{r load-data}
pest_data <- readRDS('data/pest_data.RDS')
str(pest_data)
```
We have access to the following fields:
* `complaints`: Number of complaints per building per month
* `building_id`: The unique building identifier
* `traps`: The number of traps used per month per building
* `date`: The date at which the number of complaints are recorded
* `live_in_super`: An indicator for whether the building as a live-in super
* `age_of_building`: The age of the building
* `total_sq_foot`: The total square footage of the building
* `average_tenant_age`: The average age of the tenants per building
* `monthly_average_rent`: The average monthly rent per building
* `floors`: The number of floors per building
First, let's see how many buildings we have in the data:
```{r describe-data}
length(unique(pest_data$building_id))
```
And make some plots of the raw data:
```{r data-plot-1}
ggplot(pest_data, aes(x = complaints)) +
geom_bar()
```
```{r data-plot-2}
ggplot(pest_data, aes(x = traps, y = complaints, color = live_in_super == TRUE)) +
geom_jitter()
```
```{r data-plot-ts, fig.width = 6, fig.height = 8}
ggplot(pest_data, aes(x = date, y = complaints, color = live_in_super == TRUE)) +
geom_line(aes(linetype = "Number of complaints")) +
geom_point(color = "black") +
geom_line(aes(y = traps, linetype = "Number of traps"), color = "black", size = 0.25) +
facet_wrap(~building_id, scales = "free", ncol = 2, labeller = label_both) +
scale_x_date(name = "Month", date_labels = "%b") +
scale_y_continuous(name = "", limits = range(pest_data$complaints)) +
scale_linetype_discrete(name = "") +
scale_color_discrete(name = "Live-in super")
```
The first question we'll look at is just whether the number of complaints per
building per month is associated with the number of bait stations per building
per month, ignoring the temporal and across-building variation (we'll get to
that later). This requires only two variables, $\textrm{complaints}$ and
$\textrm{traps}$. How should we model the number of complaints?
## Modeling count data : Poisson distribution
We already know some rudimentary information about what we should expect. The
number of complaints over a month should be either zero or an integer. The
property manager tells us that it is possible but unlikely that number of
complaints in a given month is zero. Occasionally there are a very large number
of complaints in a single month. A common way of modeling this sort of skewed,
single bounded count data is as a Poisson random variable. One concern about
modeling the outcome variable as Poisson is that the data may be
over-dispersed, but we'll start with the Poisson model and then check
whether over-dispersion is a problem by comparing our model's predictions
to the data.
### Model
Given that we have chosen a Poisson regression, we define the likelihood to be
the Poisson probability mass function over the number bait stations placed in
the building, denoted below as `traps`. This model assumes that the mean and
variance of the outcome variable `complaints` (number of complaints) is the
same. We'll investigate whether this is a good assumption after we fit the
model.
For building $b = 1,\dots,10$ at time (month) $t = 1,\dots,12$, we have
$$
\begin{align*}
\textrm{complaints}_{b,t} & \sim \textrm{Poisson}(\lambda_{b,t}) \\
\lambda_{b,t} & = \exp{(\eta_{b,t})} \\
\eta_{b,t} &= \alpha + \beta \, \textrm{traps}_{b,t}
\end{align*}
$$
### Prior predictive checks
Before we fit the model to real data, we should check that our priors and model
can generate plausible simulated data.
In R we can simulate from the prior predictive distribution using a
function like this:
```{r simulate-poisson-data}
# using normal distributions for priors on alpha and beta
simple_poisson_dpg <- function(traps, alpha_mean, alpha_sd, beta_mean, beta_sd) {
N <- length(traps)
alpha <- rnorm(1, mean = alpha_mean, sd = alpha_sd);
beta <- rnorm(1, mean = beta_mean, sd = beta_sd);
complaints <- rpois(N, lambda = exp(alpha + beta * traps))
return(complaints)
}
```
Try with $N(0,10)$ priors on both parameters:
```{r prior-pred-really-bad}
# you can run this chunk multiple times to keep generating different datasets
simple_poisson_dpg(
pest_data$traps,
alpha_mean = 0,
alpha_sd = 10,
beta_mean = 0,
beta_sd = 10
)
```
Try with $N(0,1)$ priors on both parameters:
```{r prior-pred-bad}
simple_poisson_dpg(
pest_data$traps,
alpha_mean = 0,
alpha_sd = 1,
beta_mean = 0,
beta_sd = 1
)
```
Let's calculate some summary statistics of 1000 data sets generated according
to the $N(0,1)$ priors:
```{r prior-pred-check}
prior_preds <- t(replicate(1000, expr = {
simple_poisson_dpg(
traps = pest_data$traps,
alpha_mean = 0,
alpha_sd = 1,
beta_mean = 0,
beta_sd = 1
)
}))
dim(prior_preds)
```
```{r prior-pred-mean}
summary(apply(prior_preds, 1, mean))
```
```{r prior-pred-min}
summary(apply(prior_preds, 1, min))
```
```{r prior-pred-max}
summary(apply(prior_preds, 1, max))
```
A more reasonable (though still quite flexible) prior:
```{r}
simple_poisson_dpg(
traps = pest_data$traps,
alpha_mean = log(7),
alpha_sd = 0.5,
beta_mean = -0.25,
beta_sd = 0.5
)
```
Simulate 1000 times and calculate summary stats:
```{r prior-pred-better}
prior_preds <- t(replicate(10000, expr = {
simple_poisson_dpg(
traps = pest_data$traps,
alpha_mean = log(7),
alpha_sd = 0.5,
beta_mean = -0.25,
beta_sd = 0.5
)
}))
```
```{r}
summary(apply(prior_preds, 1, mean))
summary(apply(prior_preds, 1, min))
summary(apply(prior_preds, 1, max))
```
### Writing and fitting our first Stan program
Let's encode the model in a Stan program.
* Write `simple_poisson_regression.stan` together, put in `stan_programs` directory.
```{r compile-simple-poisson}
comp_simple <- stan_model('stan_programs/simple_poisson_regression.stan')
```
To fit the model to the data we'll first create a list to pass
to Stan using the variables in the `pest_data` data frame. The names of the
list elements must match the names used in the `data` block of the Stan program.
```{r stan-data}
standata_simple <- list(
N = nrow(pest_data),
complaints = pest_data$complaints,
traps = pest_data$traps,
alpha_mean = log(7),
alpha_sd = log(2),
beta_mean = log(2),
beta_sd = log(2)
)
str(standata_simple)
```
We have already compiled the model, so we can jump straight to sampling from it.
```{r fit_simple, cache=TRUE}
fit_simple <- sampling(comp_simple, data = standata_simple,
# these are the defaults but specifying them anyway
# so you can see how to use them:
# posterior sample size = chains * (iter-warmup)
chains = 4, iter = 2000, warmup = 1000, cores=4)
```
Print the summary of the intercept and slope parameters:
```{r results_simple_P}
print(fit_simple, pars = c('alpha','beta'))
```
We can also plot the posterior distributions:
```{r mcmc_hist}
# https://mc-stan.org/bayesplot/reference/MCMC-distributions
draws <- as.matrix(fit_simple, pars = c('alpha','beta'))
mcmc_hist(draws) # marginal posteriors of alpha and beta
mcmc_scatter(draws, alpha = 0.2, size = 1) # posterior of (alpha,beta)
```
And compare them to draws from the prior:
```{r compare-prior-posterior}
alpha_prior_post <- cbind(alpha_prior = rnorm(4000, log(7), 1),
alpha_posterior = draws[, "alpha"])
mcmc_hist(alpha_prior_post, facet_args = list(nrow = 2), binwidth = 0.1) +
xlim(range(alpha_prior_post))
beta_prior_post <- cbind(beta_prior = rnorm(4000, -0.25, 0.5),
beta_posterior = draws[, "beta"])
mcmc_hist(beta_prior_post, facet_args = list(nrow = 2), binwidth = 0.05) +
xlim(range(beta_prior_post))
```
From the posterior of `beta`, it appears the number of bait stations set in a
building is associated with the number of complaints about cockroaches that were
made in the following month. However, we still need to consider how well the
model fits.
### Posterior predictive checking
```{r y_rep_simple}
# see http://mc-stan.org/rstan/articles/stanfit_objects.html for various ways
# of extracting contents from stanfit objects
y_rep <- as.matrix(fit_simple, pars = "pred_complaints_rep")
```
```{r marginal_PPC}
# https://mc-stan.org/bayesplot/reference/PPC-distributions#plot-descriptions
ppc_dens_overlay(y = standata_simple$complaints, yrep = y_rep[1:200,])
```
In the plot above we have the kernel density estimate of the observed data ($y$,
thicker curve) and 200 simulated data sets ($y_{rep}$, thin curves) from the
posterior predictive distribution. Here the posterior predictive
simulations are not as dispersed as the real data and don't seem to capture the
rate of zeros well at all. This suggests the Poisson model may not be sufficiently
flexible for this data.
Let's explore this further by looking directly at the proportion of zeros in the
real data and predicted data.
```{r ppc_stat-prop_zero}
# calculate the proportion of zeros in a vector
prop_zero <- function(x) mean(x == 0)
# https://mc-stan.org/bayesplot/reference/PPC-test-statistics#plot-descriptions
ppc_stat(
y = standata_simple$complaints,
yrep = y_rep,
stat = "prop_zero",
binwidth = .01
)
```
The plot above shows the observed proportion of zeros (thick vertical line) and
a histogram of the proportion of zeros in each of the simulated data sets. It is
clear that the model does not capture this feature of the data well at all.
The rootogram is another useful plot to compare the observed vs expected number
of complaints. This is a plot of the square root of the expected counts
(continuous line) vs the observed counts (blue histogram)
```{r rootogram}
# https://mc-stan.org/bayesplot/reference/PPC-discrete#plot-descriptions
ppc_rootogram(standata_simple$complaints, yrep = y_rep)
```
If the model was fitting well these would be relatively similar, however in this
figure we can see the number of complaints is underestimated if there are few
complaints, over-estimated for medium numbers of complaints, and underestimated
if there are a large number of complaints.
The `ppc_bars()` function will make a bar plot of the observed values and
overlay prediction intervals but not on the square root scale (unlike the
rootogram):
```{r ppc_bars}
# https://mc-stan.org/bayesplot/reference/PPC-discrete#plot-descriptions
ppc_bars(standata_simple$complaints, yrep = y_rep)
```
We can also view how the predicted number of complaints varies with the number
of traps:
```{r intervals}
ppc_intervals(y = standata_simple$complaints, yrep = y_rep,
# jitter number of traps since multiple observations at some values
x = standata_simple$traps + rnorm(standata_simple$N, 0, 0.2)) +
labs(x = "Number of traps", y = "Number of complaints")
```
## Expanding the model: multiple predictors
Currently, our model's mean parameter is a rate of complaints per 30 days, but
we're modeling a process that occurs over an area as well as over time. We have
the square footage of each building, so if we add that information into the
model, we can interpret our parameters as a rate of complaints per square foot
per 30 days.
$$
\begin{align*}
\textrm{complaints}_{b,t} & \sim \textrm{Poisson}(\textrm{sq_foot}_b\,\lambda_{b,t}) \\
\lambda_{b,t} & = \exp{(\eta_{b,t} )} \\
\eta_{b,t} &= \alpha + \beta \, \textrm{traps}_{b,t}
\end{align*}
$$
The term $\text{sq_foot}$ is called an exposure term. If we log the term, we can
put it in $\eta_{b,t}$:
$$
\begin{align*}
\textrm{complaints}_{b,t} & \sim \textrm{Poisson}(\lambda_{b,t}) \\
\lambda_{b,t} & = \exp{(\eta_{b,t} )} \\
\eta_{b,t} &= \alpha + \beta \, \textrm{traps}_{b,t} + \textrm{log_sq_foot}_b
\end{align*}
$$
We will also include whether there is a live-in super or not as a predictor
for the number of complaints, which gives us gives us:
$$
\eta_{b,t} = \alpha + \beta \, \textrm{traps}_{b,t} +
\beta_{\rm super} \textrm{live_in_super}_{b,t} +
\textrm{log_sq_foot}_b
$$
Add these new variables to the data list for Stan.
```{r add-predictors-to-standata}
standata_simple$log_sq_foot <- pest_data$log_sq_foot_1e4 # log(total_sq_foot/1e4)
standata_simple$live_in_super <- pest_data$live_in_super
standata_simple$beta_super_mean <- log(1)
standata_simple$beta_super_sd <- log(2)
```
### Stan program for Poisson multiple regression
Now we need a new Stan program that uses multiple predictors.
* Write `multiple_poisson_regression.stan`
```{r compile-multi-poisson}
comp_multi <- stan_model('stan_programs/multiple_poisson_regression.stan')
```
### Fit the Poisson multiple regression
```{r fit_multi}
fit_multi <- sampling(comp_multi, data = standata_simple,
refresh = 500) # turn off printed progress updates
print(fit_multi, pars = c("alpha", "beta", "beta_super"))
```
```{r ppc_dens_overlay-random-subset}
y_rep <- as.matrix(fit_multi, pars = "pred_complaints_rep")
ppc_dens_overlay(standata_simple$complaints, y_rep[1:200,])
# in this case we have very high effective sample sizes, but if there is
# nontrivial autocorrelation then it's better to take a random sample of the draws,
# for example:
ids <- sample(nrow(y_rep), size = 200)
ppc_dens_overlay(standata_simple$complaints, y_rep[ids, ])
```
This again looks like we haven't captured the smaller counts very well, nor
have we captured the larger counts.
We're still severely underestimating the proportion of zeros in the data:
```{r}
prop_zero <- function(x) mean(x == 0)
ppc_stat(y = standata_simple$complaints, yrep = y_rep,
stat = "prop_zero", binwidth = 0.01)
```
Ideally this vertical line would fall somewhere within the histogram.
We can also plot uncertainty intervals for the predicted complaints for different
numbers of traps.
```{r}
ppc_intervals(
y = standata_simple$complaints,
yrep = y_rep,
x = standata_simple$traps + rnorm(standata_simple$N, 0, 0.2)
) +
labs(x = "Number of traps", y = "Number of complaints")
```
We can see that we've increased the tails a bit more at the larger numbers of traps
but we still have some large observed numbers of complaints that the model
would consider extremely unlikely events.
## Modeling count data with the Negative Binomial
When we considered modeling the data using a Poisson, we saw that the model
didn't appear to fit as well to the data as we would like. In particular the
model underpredicted low and high numbers of complaints, and overpredicted the
medium number of complaints. This is one indication of over-dispersion, where
the variance is larger than the mean. A Poisson model doesn't fit over-dispersed
count data very well because the same parameter $\lambda$, controls both the
expected counts and the variance of these counts. The natural alternative to
this is the negative binomial model:
$$
\begin{align*}
\text{complaints}_{b,t} & \sim \text{Neg-Binomial}(\lambda_{b,t}, \phi) \\
\lambda_{b,t} & = \exp{(\eta_{b,t})} \\
\eta_{b,t} &= \alpha + \beta \, {\rm traps}_{b,t} + \beta_{\rm super} \, {\rm super}_{b} + \text{log_sq_foot}_{b}
\end{align*}
$$
In Stan the negative binomial mass function we'll use is called
$\texttt{neg_binomial_2_log}(\text{ints} \, y, \text{reals} \, \eta, \text{reals} \, \phi)$
in Stan. Like the `poisson_log` function, this negative binomial mass function
that is parameterized in terms of its log-mean, $\eta$, but it also has a
precision $\phi$ such that
$$
\mathbb{E}[y] \, = \lambda = \exp(\eta)
$$
$$
\text{Var}[y] = \lambda + \lambda^2/\phi = \exp(\eta) + \exp(\eta)^2 / \phi.
$$
As $\phi$ gets larger the term $\lambda^2 / \phi$ approaches zero and so
the variance of the negative-binomial approaches $\lambda$, i.e., the
negative-binomial gets closer and closer to the Poisson.
### Stan program for negative-binomial regression
```{r add-phi-predictors-to-standata}
standata_simple$phi_mean <- 0.5
standata_simple$phi_sd <- 0.5
```
* Write `multiple_NB_regression.stan` together
```{r compile-multi-NB, cache=TRUE, results="hide", message=FALSE}
comp_multi_NB <- stan_model('stan_programs/multiple_NB_regression.stan')
```
### Fit to data and check the fit
```{r fit_multi_NB}
fit_multi_NB <- sampling(comp_multi_NB, data = standata_simple, refresh = 250)
# to demonstrate extracting as a list instead of using as.matrix
samps_NB <- rstan::extract(fit_multi_NB)
```
Let's look at our predictions vs. the data.
```{r ppc_dens_overlay_NB}
ppc_dens_overlay(y = standata_simple$complaints, yrep = samps_NB$pred_complaints_rep[1:200,])
```
It appears that our model now captures both the number of small counts better
as well as the tails.
Let's check the proportion of zeros:
```{r prop_zero_NB}
ppc_stat(y = standata_simple$complaints, yrep = samps_NB$pred_complaints_rep,
stat = "prop_zero", binwidth = 0.01)
```
Check predictions by number of traps:
```{r}
ppc_intervals(
y = standata_simple$complaints,
yrep = samps_NB$pred_complaints_rep,
x = standata_simple$traps + rnorm(standata_simple$N, 0, 0.2)
) +
labs(x = "Number of traps (jittered)", y = "Number of complaints")
```
We haven't used the fact that the data are clustered by building yet. A posterior
predictive check might elucidate whether it would be a good idea to add the building
information into the model.
```{r ppc-group-means, fig.width=7, fig.height=6}
ppc_stat_grouped(
y = standata_simple$complaints,
yrep = samps_NB$pred_complaints_rep,
group = pest_data$building_id,
stat = 'mean',
binwidth = 0.2
)
```
We're getting plausible predictions for most building means but some are
estimated better than others and some have larger uncertainties than we might
expect. If we explicitly model the variation across buildings we may be able to
get much better estimates.