mrc-ide · pwinskill · May 24, 2023 · Feb 15, 2023 · Feb 15, 2023 · Feb 15, 2023
diff --git a/vignettes/Variation.Rmd b/vignettes/Variation.Rmd
@@ -1,91 +1,151 @@
 ---
-title: "Variation"
-output: html_document
+title: "Stochastic Variation"
+output: rmarkdown::html_vignette
 vignette: >
   %\VignetteIndexEntry{Variation}
   %\VignetteEngine{knitr::rmarkdown}
   %\VignetteEncoding{UTF-8}
 ---
 
-```{r setup, include=FALSE}
+```{r setup}
 library(malariasimulation)
-library(cowplot)
-library(ggplot2)
 ```
 
-## Variation in outputs
+`malariasimulation` is a stochastic, individual-based model and, as such, simulations run with identical parameterisations will generate non-identical, sometimes highly variable, outputs. To illustrate this, we will compare the prevalence and incidence of malaria over a year in simulations with a small and a larger population, where a smaller population and the incidence measure are both more likely to have wide variations between simulation runs.
 
-Malariasimulation is a stochastic model, so there will be be variation in model outputs. For example, we could look at detected malaria cases over a year...
+### Plotting functions
+First, we will create a few plotting functions to visualise outputs. 
+```{r}
+cols <- c("#E69F00", "#56B4E9", "#009E73", 
+          "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
+
+plot_prev <- function(output, ylab = TRUE){
+  if (ylab == TRUE) {
+    ylab = "Prevalence in children aged 2-10 years"
+    } else {ylab = ""}
+  plot(x = output$timestep, y = output$n_detect_730_3650 / output$n_730_3650, 
+       type = "l", col = cols[3], ylim = c(0, 1),
+       xlab = "Time (days)", ylab = ylab, lwd = 2,
+       xaxs = "i", yaxs = "i", bty = "l")
+  grid(lty = 2, col = "grey80", lwd = 1)
+}
+
+plot_inci <- function(output, ylab = TRUE){
+  if (ylab == TRUE) {
+    ylab = "Incidence in children aged 0-5 years"
+    } else {ylab = ""}
+  plot(x = output$timestep, y = output$n_inc_clinical_0_1825 / output$n_0_1825, 
+       type = "l", col = cols[5], ylim = c(0, 0.1),
+       xlab = "Time (days)", ylab = ylab, lwd = 2,
+       xaxs = "i", yaxs = "i", bty = "l")
+  grid(lty = 2, col = "grey80", lwd = 1)
+}
+```
 
+## Parameterisation  
+First, we will use the `get_parameters()` function to generate a list of parameters, accepting the default values, for two different population sizes and use the `set_equilibrium()` function to initialise the model at a given entomological inoculation rate (EIR). The only parameter which changes between the two parameter sets is the argument for `human_population`.  
 ```{r}
-# A small population
-p <- get_parameters(list(
+# A small population 
+simparams_small <- get_parameters(list(
   human_population = 1000,
+  clinical_incidence_rendering_min_ages = 0,
+  clinical_incidence_rendering_max_ages = 5 * 365,
   individual_mosquitoes = FALSE
 ))
-p <- set_equilibrium(p, 2)
-small_o <- run_simulation(365, p)
-p <- get_parameters(list(
+
+simparams_small <- set_equilibrium(parameters = simparams_small, init_EIR = 50)
+
+# A larger population 
+simparams_big <- get_parameters(list(
   human_population = 10000,
+  clinical_incidence_rendering_min_ages = 0,
+  clinical_incidence_rendering_max_ages = 5 * 365,
   individual_mosquitoes = FALSE
 ))
-p <- set_equilibrium(p, 2)
-big_o <- run_simulation(365, p)
-
-plot_grid(
-  ggplot(small_o) + geom_line(aes(timestep, n_detect_730_3650)) + ggtitle('n_detect at n = 1,000'),
-  ggplot(small_o) + geom_line(aes(timestep, p_detect_730_3650)) + ggtitle('p_detect at n = 1,000'),
-  ggplot(big_o) + geom_line(aes(timestep, n_detect_730_3650)) + ggtitle('n_detect at n = 10,000'),
-  ggplot(big_o) + geom_line(aes(timestep, p_detect_730_3650)) + ggtitle('p_detect at n = 10,000')
-)
+
+simparams_big <- set_equilibrium(parameters = simparams_big, init_EIR = 50)
 ```
 
-The n_detect output shows the result of sampling individuals who would be detected by microscopy.
 
-While the p_detect output shows the sum of probabilities of detection in the population.
+## Simulations
 
-Notice that the p_detect output is slightly smoother. That's because it forgoes the sampling step. At low population sizes, p_detect will be smoother than the n_detect counterpart.
+The `n_detect_730_3650` output below shows the total number of individuals in the age group rendered (here, 730-3650 timesteps or 2-10 years) who have microscopy-detectable malaria. Notice that the output is smoother at a higher population.  
 
-## Estimating variation 
+Some outcomes will be more sensitive than others to stochastic variation. In the plots below, prevalence is smoother than incidence even at the same population. 
 
-We can estimate the variation in the number of detectable cases by repeating the simulation several times...
+```{r, fig.align="center",fig.height=6, fig.width=8}
+# A small population
+output_small_pop <- run_simulation(timesteps = 365, parameters = simparams_small)
 
-```{r}
+# A larger population
+output_big_pop <- run_simulation(timesteps = 365, parameters = simparams_big)
+
+# Plotting 
+par(mfrow = c(2,2))
+plot_prev(output_small_pop); title("Prevalence at n = 1,000")
+plot_inci(output_small_pop); title("Incidence at n = 1,000")
+plot_prev(output_big_pop, ylab = FALSE); title("Prevalence at n = 10,000")
+plot_inci(output_big_pop); title("Incidence at n = 10,000")
+```
+
+## Estimating stochastic variation 
+
+One way to estimate the variation in prevalence is by repeating the simulation several times with the function `run_simulation_with_repetitions()`, then plotting the IQR and 95% confidence intervals. Running multiple simulations is particularly important for simulations of a small population size, due to the disproportionate impact of stochasticity in small populations. 
+
+__However, it is important to note that while running a simulation with repetitions is an option that we illustrate in this example, for the majority of use cases, it is much preferable to increase population size to minimise stochastic variation or to use parameter draws, while also using a large enough population, to estimate uncertainty (see the Parameter Draws vignette for more information). The required population size might depend on the outcome of interest. For example, a larger population size would be needed if measuring the impact of an intervention given only to children, versus an intervention given to the entire population. Additionally, as demonstrated above, some outcomes will be more sensitive than others to stochastic variation.__
+
+```{r, fig.align="center",fig.height=5, fig.width=10}
 outputs <- run_simulation_with_repetitions(
-  timesteps = 365,
-  repetitions = 10,
-  overrides = p,
-  parallel=TRUE
+  timesteps = 365, # Each repetition will be run for 365 days.
+  repetitions = 10, # The simulation will be run 10 times.
+  overrides = simparams_big, # Using the parameter list with the larger population that we created earlier.
+  parallel = TRUE # The runs will be done in parallel.
 )
 
-df <- aggregate(
-  outputs$n_detect_730_3650,
-  by=list(outputs$timestep),
+# Calculate IQR and 95% confidence intervals for each of the repetitions
+output <- aggregate(
+  outputs$n_detect_730_3650 / outputs$n_730_3650,
+  by = list(outputs$timestep),
   FUN = function(x) {
     c(
       median = median(x),
       lowq = unname(quantile(x, probs = .25)),
       highq = unname(quantile(x, probs = .75)),
       mean = mean(x),
-      lowci = mean(x) + 1.96*sd(x),
-      highci = mean(x) - 1.96*sd(x)
+      lowci = mean(x) + 1.96 * sd(x),
+      highci = mean(x) - 1.96 * sd(x)
     )
   }
 )
 
-df <- data.frame(cbind(t = df$Group.1, df$x))
-
-plot_grid(
-  ggplot(df)
-  + geom_line(aes(x = t, y = median, group = 1))
-  + geom_ribbon(aes(x = t, ymin = lowq, ymax = highq), alpha = .2)
-  + xlab('timestep') + ylab('n_detect') + ggtitle('IQR spread'),
-  ggplot(df)
-  + geom_line(aes(x = t, y = mean, group = 1))
-  + geom_ribbon(aes(x = t, ymin = lowci, ymax = highci), alpha = .2)
-  + xlab('timestep') + ylab('n_detect') + ggtitle('95% confidence interval')
-)
-
-```
-
-These are useful techniques for comparing the expected variance from model outputs to outcomes from trials with lower population sizes.
+output <- data.frame(cbind(timestep = output$Group.1, output$x))
+
+# Plot the IQR and 95% CI
+par(mfrow = c(1, 2))
+xx <- c(output$timestep, rev(output$timestep))
+yy <- c(output$lowq, rev(output$highq))
+plot(x = output$timestep, y = output$median, 
+       type = "n", xlim = c(-1, 365), ylim = c(0.75, 0.85),
+       xlab = "Time (days)", ylab = "Prevalence in children aged 2-10 years",
+       xaxs = "i", yaxs = "i", bty = "l")
+polygon(xx, yy,
+        col = "grey90", border= "grey95")
+lines(x = output$timestep, y = output$median, col = cols[3], lwd = 2)
+grid(lty = 2, col = "grey80", lwd = 1)
+title("IQR spread")
+legend("bottomleft", legend = c("Median","IQR Spread"), col = c(cols[3], "grey90"),
+       lwd = 5, xpd = TRUE, horiz = TRUE, cex = 1, seg.len = 1, bty = "n")
+
+xx <- c(output$timestep, rev(output$timestep))
+yy <- c(output$lowci, rev(output$highci))
+plot(x = output$timestep, y = output$mean, 
+       type = "n", xlim = c(-1, 365), ylim = c(0.75, 0.85),
+       xlab = "Time (days)", ylab = "",
+       xaxs = "i", yaxs = "i", bty = "l")
+polygon(xx, yy, col = "grey90", border= "grey95")
+lines(x = output$timestep, y = output$mean, col = cols[3], lwd = 2)
+grid(lty = 2, col = "grey80", lwd = 1)
+title("95% Confidence interval")
+legend("bottomleft", inset = 0, legend = c("Mean", "95% CI"), col = c(cols[3], "grey90"),
+       lwd = 5, xpd = TRUE, horiz = TRUE, cex = 1, seg.len = 1, bty = "n")
+```