stancon2019-intro_nonHier.Rmd

---
title: "StanCon 2019"
author: "Cambridge, UK"
date: "August 20-23"
output:
  html_document:
    toc: true
    toc_depth: 2
---

## Setup

```{r setup, results="hide", message=FALSE}
knitr::opts_chunk$set(
  echo = TRUE, 
  dev = "png",
  dpi = 150,
  fig.align = "center",
  comment = NA
)
library(rstan)
library(dplyr)
library(lubridate)
library(ggplot2)
library(bayesplot)

theme_set(bayesplot::theme_default())

# seed for R's pseudo-RNGs, not Stan's
set.seed(1123) 

# load data
pest_data <- readRDS('data/pest_data.RDS')
standata_hier <- readRDS('data/standata_hier.RDS')
```

## The problem

### Background

Imagine that you are a statistician or data scientist working as an independent
contractor. One of your clients is a company that owns many residential buildings 
throughout New York City. The property manager explains that they are concerned about the number
of cockroach complaints that they receive from their buildings. Previously
the company has offered monthly visits from a pest inspector as a solution to
this problem. While this is the default solution of many property managers in
NYC, the tenants are rarely home when the inspector visits, and so the manager
reasons that this is a relatively expensive solution that is currently not very
effective.

One alternative to this problem is to deploy long term bait stations. In this
alternative, child and pet safe bait stations are installed throughout the
apartment building. Cockroaches obtain quick acting poison from these stations
and distribute it throughout the colony. The manufacturer of these bait stations
provides some indication of the space-to-bait efficacy, but the manager suspects
that this guidance was not calculated with NYC roaches in mind. NYC roaches, the
manager rationalizes, have more hustle than traditional roaches; and NYC
buildings are built differently than other common residential buildings in the
US. This is particularly important as the uni§t cost for each bait station per
year is quite high.

### The goal

The manager wishes to employ your services to help them to find the optimal
number of roach bait stations they should place in each of their buildings in
order to minimize the number of cockroach complaints while also keeping
expenditure on pest control affordable. 

A subset of the company's buildings have been randomly selected for an experiment: 

* At the beginning of each month, a pest inspector randomly places a number of
bait stations throughout the building, without knowledge of the current
cockroach levels in the building
* At the end of the month, the manager records
the total number of cockroach complaints in that building. 
* The manager would like to determine the optimal number of traps ($\textrm{traps}$) that
balances the lost revenue ($R$) that complaints ($\textrm{complaints}$) generate
with the all-in cost of maintaining the traps ($\textrm{TC}$). 

Fortunately, Bayesian data analysis provides a coherent framework for us to tackle this problem.

Formally, we are interested in finding

$$
\arg\max_{\textrm{traps} \in \mathbb{N}} \mathbb{E}_{\text{complaints}}[R(\textrm{complaints}(\textrm{traps})) - \textrm{TC}(\textrm{traps})]
$$

The property manager would also, if possible, like to learn how these results 
generalize to buildings they haven't treated so they can understand the
potential costs of pest control at buildings they are acquiring as well as for
the rest of their building portfolio.

As the property manager has complete control over the number of traps set, the
random variable contributing to this expectation is the number of complaints
given the number of traps. We will model the number of complaints as a function
of the number of traps.


## The data

The data provided to us is in a file called `pest_data.RDS`. Let's
load the data and see what the structure is:

```{r load-data}
pest_data <- readRDS('data/pest_data.RDS')
str(pest_data)
```

We have access to the following fields: 

* `complaints`: Number of complaints per building per month
* `building_id`: The unique building identifier
* `traps`: The number of traps used per month per building
* `date`: The date at which the number of complaints are recorded
* `live_in_super`: An indicator for whether the building as a live-in super
* `age_of_building`: The age of the building
* `total_sq_foot`: The total square footage of the building
* `average_tenant_age`: The average age of the tenants per building
* `monthly_average_rent`: The average monthly rent per building
* `floors`: The number of floors per building

First, let's see how many buildings we have in the data:

```{r describe-data}
length(unique(pest_data$building_id))
```

And make some plots of the raw data: 

```{r data-plot-1}
ggplot(pest_data, aes(x = complaints)) + 
  geom_bar()
```

```{r data-plot-2}
ggplot(pest_data, aes(x = traps, y = complaints, color = live_in_super == TRUE)) + 
  geom_jitter()
```

```{r data-plot-ts, fig.width = 6, fig.height = 8}
ggplot(pest_data, aes(x = date, y = complaints, color = live_in_super == TRUE)) + 
  geom_line(aes(linetype = "Number of complaints")) + 
  geom_point(color = "black") + 
  geom_line(aes(y = traps, linetype = "Number of traps"), color = "black", size = 0.25) + 
  facet_wrap(~building_id, scales = "free", ncol = 2, labeller = label_both) + 
  scale_x_date(name = "Month", date_labels = "%b") + 
  scale_y_continuous(name = "", limits = range(pest_data$complaints)) + 
  scale_linetype_discrete(name = "") + 
  scale_color_discrete(name = "Live-in super")
```


The first question we'll look at is just whether the number of complaints per
building per month is associated with the number of bait stations per building
per month, ignoring the temporal and across-building variation (we'll get to
that later). This requires only two variables, $\textrm{complaints}$ and
$\textrm{traps}$. How should we model the number of complaints?


## Modeling count data : Poisson distribution

We already know some rudimentary information about what we should expect. The
number of complaints over a month should be either zero or an integer. The
property manager tells us that it is possible but unlikely that number of
complaints in a given month is zero. Occasionally there are a very large number
of complaints in a single month. A common way of modeling this sort of skewed,
single bounded count data is as a Poisson random variable. One concern about
modeling the outcome variable as Poisson is that the data may be
over-dispersed, but we'll start with the Poisson model and then check 
whether over-dispersion is a problem by comparing our model's predictions 
to the data.

### Model 

Given that we have chosen a Poisson regression, we define the likelihood to be
the Poisson probability mass function over the number bait stations placed in
the building, denoted below as `traps`. This model assumes that the mean and
variance of the outcome variable `complaints` (number of complaints) is the
same. We'll investigate whether this is a good assumption after we fit the 
model.

For building $b = 1,\dots,10$ at time (month) $t = 1,\dots,12$, we have

$$
\begin{align*}
\textrm{complaints}_{b,t} & \sim \textrm{Poisson}(\lambda_{b,t}) \\
\lambda_{b,t} & = \exp{(\eta_{b,t})} \\
\eta_{b,t} &= \alpha + \beta \, \textrm{traps}_{b,t}
\end{align*}
$$


### Prior predictive checks

Before we fit the model to real data, we should check that our priors and model
can generate plausible simulated data. 

In R we can simulate from the prior predictive distribution using a 
function like this:

```{r simulate-poisson-data}
# using normal distributions for priors on alpha and beta 
simple_poisson_dpg <- function(traps, alpha_mean, alpha_sd, beta_mean, beta_sd) {
  N <- length(traps)
  alpha <- rnorm(1, mean = alpha_mean, sd = alpha_sd);
  beta <- rnorm(1, mean = beta_mean, sd = beta_sd);
  complaints <- rpois(N, lambda = exp(alpha + beta * traps))
  return(complaints)
}
```

Try with $N(0,10)$ priors on both parameters: 

```{r prior-pred-really-bad}
# you can run this chunk multiple times to keep generating different datasets
simple_poisson_dpg(
  pest_data$traps,
  alpha_mean = 0,
  alpha_sd = 10,
  beta_mean = 0,
  beta_sd = 10
)
```

Try with $N(0,1)$ priors on both parameters: 

```{r prior-pred-bad}
simple_poisson_dpg(
  pest_data$traps,
  alpha_mean = 0,
  alpha_sd = 1,
  beta_mean = 0,
  beta_sd = 1
)
```

Let's calculate some summary statistics of 1000 data sets generated according
to the $N(0,1)$ priors:

```{r prior-pred-check}
prior_preds <- t(replicate(1000, expr = {
  simple_poisson_dpg(
    traps = pest_data$traps,
    alpha_mean = 0,
    alpha_sd = 1,
    beta_mean = 0,
    beta_sd = 1
  )
}))

dim(prior_preds)
```

```{r prior-pred-mean}
summary(apply(prior_preds, 1, mean))
```
```{r prior-pred-min}
summary(apply(prior_preds, 1, min))
```
```{r prior-pred-max}
summary(apply(prior_preds, 1, max))
```


A more reasonable (though still quite flexible) prior:

```{r}
simple_poisson_dpg(
  traps = pest_data$traps,
  alpha_mean = log(7),
  alpha_sd = 0.5,
  beta_mean = -0.25,
  beta_sd = 0.5
)
```

Simulate 1000 times and calculate summary stats:

```{r prior-pred-better}
prior_preds <- t(replicate(10000, expr = {
  simple_poisson_dpg(
    traps = pest_data$traps,
    alpha_mean = log(7),
    alpha_sd = 0.5,
    beta_mean = -0.25,
    beta_sd = 0.5
  )
}))
```

```{r}
summary(apply(prior_preds, 1, mean))

summary(apply(prior_preds, 1, min))

summary(apply(prior_preds, 1, max))
```


### Writing and fitting our first Stan program

Let's encode the model in a Stan program. 

* Write `simple_poisson_regression.stan` together, put in `stan_programs` directory.

```{r compile-simple-poisson}
comp_simple <- stan_model('stan_programs/simple_poisson_regression.stan')
```

To fit the model to the data we'll first create a list to pass
to Stan using the variables in the `pest_data` data frame. The names of the 
list elements must match the names used in the `data` block of the Stan program.

```{r stan-data}
standata_simple <- list(
  N = nrow(pest_data), 
  complaints = pest_data$complaints,
  traps = pest_data$traps,
  alpha_mean = log(7),
  alpha_sd = log(2),
  beta_mean = log(2),
  beta_sd = log(2)
)
str(standata_simple)
```

We have already compiled the model, so we can jump straight to sampling from it.

```{r fit_simple, cache=TRUE}
fit_simple <- sampling(comp_simple, data = standata_simple,
                       # these are the defaults but specifying them anyway
                       # so you can see how to use them: 
                       # posterior sample size = chains * (iter-warmup)
                       chains = 4, iter = 2000, warmup = 1000, cores=4)
```

Print the summary of the intercept and slope parameters: 

```{r results_simple_P}
print(fit_simple, pars = c('alpha','beta'))
```

We can also plot the posterior distributions: 

```{r mcmc_hist}
# https://mc-stan.org/bayesplot/reference/MCMC-distributions
draws <- as.matrix(fit_simple, pars = c('alpha','beta'))
mcmc_hist(draws) # marginal posteriors of alpha and beta
mcmc_scatter(draws, alpha = 0.2, size = 1) # posterior of (alpha,beta)
```

And compare them to draws from the prior:


```{r compare-prior-posterior}
alpha_prior_post <- cbind(alpha_prior = rnorm(4000, log(7), 1), 
                          alpha_posterior = draws[, "alpha"])
mcmc_hist(alpha_prior_post, facet_args = list(nrow = 2), binwidth = 0.1) + 
  xlim(range(alpha_prior_post))


beta_prior_post <- cbind(beta_prior = rnorm(4000, -0.25, 0.5), 
                         beta_posterior = draws[, "beta"])
mcmc_hist(beta_prior_post, facet_args = list(nrow = 2), binwidth = 0.05) + 
  xlim(range(beta_prior_post))
```


From the posterior of `beta`, it appears the number of bait stations set in a
building is associated with the number of complaints about cockroaches that were
made in the following month. However, we still need to consider how well the
model fits.


### Posterior predictive checking

```{r y_rep_simple}
# see http://mc-stan.org/rstan/articles/stanfit_objects.html for various ways
# of extracting contents from stanfit objects
y_rep <- as.matrix(fit_simple, pars = "pred_complaints_rep")
```

```{r marginal_PPC}
# https://mc-stan.org/bayesplot/reference/PPC-distributions#plot-descriptions
ppc_dens_overlay(y = standata_simple$complaints, yrep = y_rep[1:200,])
```

In the plot above we have the kernel density estimate of the observed data ($y$,
thicker curve) and 200 simulated data sets ($y_{rep}$, thin curves) from the
posterior predictive distribution. Here the posterior predictive
simulations are not as dispersed as the real data and don't seem to capture the
rate of zeros well at all. This suggests the Poisson model may not be sufficiently
flexible for this data.

Let's explore this further by looking directly at the proportion of zeros in the
real data and predicted data.
```{r ppc_stat-prop_zero}
# calculate the proportion of zeros in a vector
prop_zero <- function(x) mean(x == 0)

# https://mc-stan.org/bayesplot/reference/PPC-test-statistics#plot-descriptions
ppc_stat(
  y = standata_simple$complaints,
  yrep = y_rep,
  stat = "prop_zero",
  binwidth = .01
)
```
The plot above shows the observed proportion of zeros (thick vertical line) and
a histogram of the proportion of zeros in each of the simulated data sets. It is
clear that the model does not capture this feature of the data well at all.


The rootogram is another useful plot to compare the observed vs expected number
of complaints. This is a plot of the square root of the expected counts
(continuous line) vs the observed counts (blue histogram)

```{r rootogram}
# https://mc-stan.org/bayesplot/reference/PPC-discrete#plot-descriptions
ppc_rootogram(standata_simple$complaints, yrep = y_rep)
```

If the model was fitting well these would be relatively similar, however in this
figure we can see the number of complaints is underestimated if there are few
complaints, over-estimated for medium numbers of complaints, and underestimated
if there are a large number of complaints.

The `ppc_bars()` function will make a bar plot of the observed values and
overlay prediction intervals but not on the square root scale (unlike the
rootogram):

```{r ppc_bars}
# https://mc-stan.org/bayesplot/reference/PPC-discrete#plot-descriptions
ppc_bars(standata_simple$complaints, yrep = y_rep)
```


We can also view how the predicted number of complaints varies with the number
of traps:

```{r intervals}
ppc_intervals(y = standata_simple$complaints, yrep = y_rep,
              # jitter number of traps since multiple observations at some values
              x = standata_simple$traps + rnorm(standata_simple$N, 0, 0.2)) +  
  labs(x = "Number of traps", y = "Number of complaints")
```


## Expanding the model: multiple predictors

Currently, our model's mean parameter is a rate of complaints per 30 days, but
we're modeling a process that occurs over an area as well as over time. We have
the square footage of each building, so if we add that information into the
model, we can interpret our parameters as a rate of complaints per square foot
per 30 days.

$$
\begin{align*}
\textrm{complaints}_{b,t} & \sim \textrm{Poisson}(\textrm{sq_foot}_b\,\lambda_{b,t}) \\
\lambda_{b,t} & = \exp{(\eta_{b,t} )} \\
\eta_{b,t} &= \alpha + \beta \, \textrm{traps}_{b,t}
\end{align*}
$$

The term $\text{sq_foot}$ is called an exposure term. If we log the term, we can 
put it in $\eta_{b,t}$:

$$
\begin{align*}
\textrm{complaints}_{b,t} & \sim \textrm{Poisson}(\lambda_{b,t}) \\
\lambda_{b,t} & = \exp{(\eta_{b,t} )} \\
\eta_{b,t} &= \alpha + \beta \, \textrm{traps}_{b,t} + \textrm{log_sq_foot}_b
\end{align*}
$$


We will also include whether there is a live-in super or not as a predictor 
for the number of complaints, which gives us gives us: 

$$
\eta_{b,t} = \alpha + \beta \, \textrm{traps}_{b,t} +
\beta_{\rm super} \textrm{live_in_super}_{b,t} +
\textrm{log_sq_foot}_b
$$
 
Add these new variables to the data list for Stan.

```{r add-predictors-to-standata}
standata_simple$log_sq_foot <- pest_data$log_sq_foot_1e4 # log(total_sq_foot/1e4)
standata_simple$live_in_super <- pest_data$live_in_super
standata_simple$beta_super_mean <- log(1)
standata_simple$beta_super_sd <- log(2)
```


### Stan program for Poisson multiple regression

Now we need a new Stan program that uses multiple predictors.

* Write `multiple_poisson_regression.stan`

```{r compile-multi-poisson}
comp_multi <- stan_model('stan_programs/multiple_poisson_regression.stan')
```

### Fit the Poisson multiple regression

```{r fit_multi}
fit_multi <- sampling(comp_multi, data = standata_simple, 
                      refresh = 500) # turn off printed progress updates
print(fit_multi, pars = c("alpha", "beta", "beta_super"))
```


```{r ppc_dens_overlay-random-subset}
y_rep <- as.matrix(fit_multi, pars = "pred_complaints_rep")
ppc_dens_overlay(standata_simple$complaints, y_rep[1:200,])

# in this case we have very high effective sample sizes, but if there is
# nontrivial autocorrelation then it's better to take a random sample of the draws, 
# for example:
ids <- sample(nrow(y_rep), size = 200)
ppc_dens_overlay(standata_simple$complaints, y_rep[ids, ])
```

This again looks like we haven't captured the smaller counts very well, nor
have we captured the larger counts.

We're still severely underestimating the proportion of zeros in the data:

```{r}
prop_zero <- function(x) mean(x == 0)
ppc_stat(y = standata_simple$complaints, yrep = y_rep, 
         stat = "prop_zero", binwidth = 0.01)
```

Ideally this vertical line would fall somewhere within the histogram.

We can also plot uncertainty intervals for the predicted complaints for different
numbers of traps.

```{r}
ppc_intervals(
  y = standata_simple$complaints, 
  yrep = y_rep,
  x = standata_simple$traps + rnorm(standata_simple$N, 0, 0.2)
) + 
  labs(x = "Number of traps", y = "Number of complaints")
```

We can see that we've increased the tails a bit more at the larger numbers of traps
but we still have some large observed numbers of complaints that the model
would consider extremely unlikely events. 
 

## Modeling count data with the Negative Binomial

When we considered modeling the data using a Poisson, we saw that the model
didn't appear to fit as well to the data as we would like. In particular the
model underpredicted low and high numbers of complaints, and overpredicted the
medium number of complaints. This is one indication of over-dispersion, where
the variance is larger than the mean. A Poisson model doesn't fit over-dispersed
count data very well because the same parameter $\lambda$, controls both the
expected counts and the variance of these counts. The natural alternative to
this is the negative binomial model:

$$
\begin{align*}
\text{complaints}_{b,t} & \sim \text{Neg-Binomial}(\lambda_{b,t}, \phi) \\
\lambda_{b,t} & = \exp{(\eta_{b,t})} \\
\eta_{b,t} &= \alpha + \beta \, {\rm traps}_{b,t} + \beta_{\rm super} \, {\rm super}_{b} + \text{log_sq_foot}_{b}
\end{align*}
$$

In Stan the negative binomial mass function we'll use is called 
$\texttt{neg_binomial_2_log}(\text{ints} \, y, \text{reals} \, \eta, \text{reals} \, \phi)$ 
in Stan. Like the `poisson_log` function, this negative binomial mass function
that is parameterized in terms of its log-mean, $\eta$, but it also has a
precision $\phi$ such that

$$
\mathbb{E}[y] \, = \lambda = \exp(\eta)
$$

$$
\text{Var}[y] = \lambda + \lambda^2/\phi = \exp(\eta) + \exp(\eta)^2 / \phi.
$$ 

As $\phi$ gets larger the term $\lambda^2 / \phi$ approaches zero and so 
the variance of the negative-binomial approaches $\lambda$, i.e., the
negative-binomial gets closer and closer to the Poisson.

### Stan program for negative-binomial regression

```{r add-phi-predictors-to-standata}
standata_simple$phi_mean <- 0.5
standata_simple$phi_sd <- 0.5
```

* Write `multiple_NB_regression.stan` together

```{r compile-multi-NB, cache=TRUE, results="hide", message=FALSE}
comp_multi_NB <- stan_model('stan_programs/multiple_NB_regression.stan')
```


### Fit to data and check the fit

```{r fit_multi_NB}
fit_multi_NB <- sampling(comp_multi_NB, data = standata_simple, refresh = 250)

# to demonstrate extracting as a list instead of using as.matrix
samps_NB <- rstan::extract(fit_multi_NB) 
```

Let's look at our predictions vs. the data.

```{r ppc_dens_overlay_NB}
ppc_dens_overlay(y = standata_simple$complaints, yrep = samps_NB$pred_complaints_rep[1:200,])
```

It appears that our model now captures both the number of small counts better
as well as the tails. 

Let's check the proportion of zeros:

```{r prop_zero_NB}
ppc_stat(y = standata_simple$complaints, yrep = samps_NB$pred_complaints_rep, 
         stat = "prop_zero", binwidth = 0.01)
```


Check predictions by number of traps: 

```{r}
ppc_intervals(
  y = standata_simple$complaints, 
  yrep = samps_NB$pred_complaints_rep,
  x = standata_simple$traps + rnorm(standata_simple$N, 0, 0.2)
) + 
  labs(x = "Number of traps (jittered)", y = "Number of complaints")
```

We haven't used the fact that the data are clustered by building yet. A posterior 
predictive check might elucidate whether it would be a good idea to add the building
information into the model.

```{r ppc-group-means, fig.width=7, fig.height=6}
ppc_stat_grouped(
  y = standata_simple$complaints, 
  yrep = samps_NB$pred_complaints_rep, 
  group = pest_data$building_id, 
  stat = 'mean',
  binwidth = 0.2
)
```

We're getting plausible predictions for most building means but some are
estimated better than others and some have larger uncertainties than we might
expect. If we explicitly model the variation across buildings we may be able to
get much better estimates.