Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vignettes/variation #225

Merged
merged 21 commits into from
May 24, 2023
Merged
Changes from 17 commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
00e3b9b
First edits to add plotting functions in base R
kellymccain28 Feb 15, 2023
71e78f8
Updated some plots to base R
kellymccain28 Feb 15, 2023
ff529e6
More plot modifications to avoid ggplot2
kellymccain28 Feb 15, 2023
db6a683
Removing cowplot dependency and changing last plot to base r
kellymccain28 Feb 15, 2023
8095e07
Updating plot rendering
kellymccain28 Feb 21, 2023
b212c28
Improved explanations and plots
kellymccain28 Feb 23, 2023
45b0d87
Fixing formatting, edits to explanations
kellymccain28 Feb 27, 2023
557c033
adding clarification/explanation
kellymccain28 Feb 28, 2023
ea354a6
fix y axis label
kellymccain28 Feb 28, 2023
5283b19
Updating plots, editing text, removed p_detect references
kellymccain28 Mar 2, 2023
0000e12
Last minutes updates to the text
kellymccain28 Mar 2, 2023
edb9336
updating to rmarkdown::html_vignette to match other vignettes
kellymccain28 Mar 2, 2023
bb25014
added section on parameter variation
Mar 9, 2023
3a5af92
Edits from pete/hillary's comments; added incidence plot, cleaned up …
kellymccain28 Mar 10, 2023
fe02151
Merge branch 'vignettes/variation' of https://github.com/mrc-ide/mala…
kellymccain28 Mar 10, 2023
fd42b12
Removed Lydia's section - she will make new vignette
kellymccain28 Mar 10, 2023
2b057d0
changed last plots to plot prevalence / fixed y axes
kellymccain28 Mar 10, 2023
1583772
remove Estimating stochastic variation, add stochastic elimination
kellymccain28 Mar 28, 2023
77b371c
Removing unclear sentence
kellymccain28 Mar 30, 2023
dbd2323
last edits
kellymccain28 Mar 30, 2023
05da94c
Changed plot so that beginning of sim is not 0% prev
kellymccain28 Apr 21, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
164 changes: 112 additions & 52 deletions vignettes/Variation.Rmd
Original file line number Diff line number Diff line change
@@ -1,91 +1,151 @@
---
title: "Variation"
output: html_document
title: "Stochastic Variation"
output: rmarkdown::html_vignette
vignette: >
%\VignetteIndexEntry{Variation}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
```{r setup}
library(malariasimulation)
library(cowplot)
library(ggplot2)
```

## Variation in outputs
`malariasimulation` is a stochastic, individual-based model and, as such, simulations run with identical parameterisations will generate non-identical, sometimes highly variable, outputs. To illustrate this, we will compare the prevalence and incidence of malaria over a year in simulations with a small and a larger population, where a smaller population and the incidence measure are both more likely to have wide variations between simulation runs.
kellymccain28 marked this conversation as resolved.
Show resolved Hide resolved

Malariasimulation is a stochastic model, so there will be be variation in model outputs. For example, we could look at detected malaria cases over a year...
### Plotting functions
First, we will create a few plotting functions to visualise outputs.
```{r}
cols <- c("#E69F00", "#56B4E9", "#009E73",
"#F0E442", "#0072B2", "#D55E00", "#CC79A7")

plot_prev <- function(output, ylab = TRUE){
if (ylab == TRUE) {
ylab = "Prevalence in children aged 2-10 years"
} else {ylab = ""}
plot(x = output$timestep, y = output$n_detect_730_3650 / output$n_730_3650,
type = "l", col = cols[3], ylim = c(0, 1),
xlab = "Time (days)", ylab = ylab, lwd = 2,
xaxs = "i", yaxs = "i", bty = "l")
grid(lty = 2, col = "grey80", lwd = 1)
}

plot_inci <- function(output, ylab = TRUE){
if (ylab == TRUE) {
ylab = "Incidence in children aged 0-5 years"
} else {ylab = ""}
plot(x = output$timestep, y = output$n_inc_clinical_0_1825 / output$n_0_1825,
type = "l", col = cols[5], ylim = c(0, 0.1),
xlab = "Time (days)", ylab = ylab, lwd = 2,
xaxs = "i", yaxs = "i", bty = "l")
grid(lty = 2, col = "grey80", lwd = 1)
}
```

## Parameterisation
First, we will use the `get_parameters()` function to generate a list of parameters, accepting the default values, for two different population sizes and use the `set_equilibrium()` function to initialise the model at a given entomological inoculation rate (EIR). The only parameter which changes between the two parameter sets is the argument for `human_population`.
```{r}
# A small population
p <- get_parameters(list(
# A small population
simparams_small <- get_parameters(list(
human_population = 1000,
clinical_incidence_rendering_min_ages = 0,
clinical_incidence_rendering_max_ages = 5 * 365,
individual_mosquitoes = FALSE
))
p <- set_equilibrium(p, 2)
small_o <- run_simulation(365, p)
p <- get_parameters(list(

simparams_small <- set_equilibrium(parameters = simparams_small, init_EIR = 50)

# A larger population
simparams_big <- get_parameters(list(
human_population = 10000,
clinical_incidence_rendering_min_ages = 0,
clinical_incidence_rendering_max_ages = 5 * 365,
individual_mosquitoes = FALSE
))
p <- set_equilibrium(p, 2)
big_o <- run_simulation(365, p)

plot_grid(
ggplot(small_o) + geom_line(aes(timestep, n_detect_730_3650)) + ggtitle('n_detect at n = 1,000'),
ggplot(small_o) + geom_line(aes(timestep, p_detect_730_3650)) + ggtitle('p_detect at n = 1,000'),
ggplot(big_o) + geom_line(aes(timestep, n_detect_730_3650)) + ggtitle('n_detect at n = 10,000'),
ggplot(big_o) + geom_line(aes(timestep, p_detect_730_3650)) + ggtitle('p_detect at n = 10,000')
)

simparams_big <- set_equilibrium(parameters = simparams_big, init_EIR = 50)
```

The n_detect output shows the result of sampling individuals who would be detected by microscopy.

While the p_detect output shows the sum of probabilities of detection in the population.
## Simulations

Notice that the p_detect output is slightly smoother. That's because it forgoes the sampling step. At low population sizes, p_detect will be smoother than the n_detect counterpart.
The `n_detect_730_3650` output below shows the total number of individuals in the age group rendered (here, 730-3650 timesteps or 2-10 years) who have microscopy-detectable malaria. Notice that the output is smoother at a higher population.

## Estimating variation
Some outcomes will be more sensitive than others to stochastic variation. In the plots below, prevalence is smoother than incidence even at the same population.
kellymccain28 marked this conversation as resolved.
Show resolved Hide resolved

We can estimate the variation in the number of detectable cases by repeating the simulation several times...
```{r, fig.align="center",fig.height=6, fig.width=8}
# A small population
output_small_pop <- run_simulation(timesteps = 365, parameters = simparams_small)

```{r}
# A larger population
output_big_pop <- run_simulation(timesteps = 365, parameters = simparams_big)

# Plotting
kellymccain28 marked this conversation as resolved.
Show resolved Hide resolved
par(mfrow = c(2,2))
plot_prev(output_small_pop); title("Prevalence at n = 1,000")
plot_inci(output_small_pop); title("Incidence at n = 1,000")
plot_prev(output_big_pop, ylab = FALSE); title("Prevalence at n = 10,000")
plot_inci(output_big_pop); title("Incidence at n = 10,000")
kellymccain28 marked this conversation as resolved.
Show resolved Hide resolved
```

## Estimating stochastic variation
kellymccain28 marked this conversation as resolved.
Show resolved Hide resolved

One way to estimate the variation in prevalence is by repeating the simulation several times with the function `run_simulation_with_repetitions()`, then plotting the IQR and 95% confidence intervals. Running multiple simulations is particularly important for simulations of a small population size, due to the disproportionate impact of stochasticity in small populations.

__However, it is important to note that while running a simulation with repetitions is an option that we illustrate in this example, for the majority of use cases, it is much preferable to increase population size to minimise stochastic variation or to use parameter draws, while also using a large enough population, to estimate uncertainty (see the Parameter Draws vignette for more information). The required population size might depend on the outcome of interest. For example, a larger population size would be needed if measuring the impact of an intervention given only to children, versus an intervention given to the entire population. Additionally, as demonstrated above, some outcomes will be more sensitive than others to stochastic variation.__

```{r, fig.align="center",fig.height=5, fig.width=10}
outputs <- run_simulation_with_repetitions(
timesteps = 365,
repetitions = 10,
overrides = p,
parallel=TRUE
timesteps = 365, # Each repetition will be run for 365 days.
repetitions = 10, # The simulation will be run 10 times.
overrides = simparams_big, # Using the parameter list with the larger population that we created earlier.
parallel = TRUE # The runs will be done in parallel.
)

df <- aggregate(
outputs$n_detect_730_3650,
by=list(outputs$timestep),
# Calculate IQR and 95% confidence intervals for each of the repetitions
output <- aggregate(
outputs$n_detect_730_3650 / outputs$n_730_3650,
by = list(outputs$timestep),
FUN = function(x) {
c(
median = median(x),
lowq = unname(quantile(x, probs = .25)),
highq = unname(quantile(x, probs = .75)),
mean = mean(x),
lowci = mean(x) + 1.96*sd(x),
highci = mean(x) - 1.96*sd(x)
lowci = mean(x) + 1.96 * sd(x),
highci = mean(x) - 1.96 * sd(x)
)
}
)

df <- data.frame(cbind(t = df$Group.1, df$x))

plot_grid(
ggplot(df)
+ geom_line(aes(x = t, y = median, group = 1))
+ geom_ribbon(aes(x = t, ymin = lowq, ymax = highq), alpha = .2)
+ xlab('timestep') + ylab('n_detect') + ggtitle('IQR spread'),
ggplot(df)
+ geom_line(aes(x = t, y = mean, group = 1))
+ geom_ribbon(aes(x = t, ymin = lowci, ymax = highci), alpha = .2)
+ xlab('timestep') + ylab('n_detect') + ggtitle('95% confidence interval')
)

```

These are useful techniques for comparing the expected variance from model outputs to outcomes from trials with lower population sizes.
output <- data.frame(cbind(timestep = output$Group.1, output$x))

# Plot the IQR and 95% CI
par(mfrow = c(1, 2))
xx <- c(output$timestep, rev(output$timestep))
yy <- c(output$lowq, rev(output$highq))
plot(x = output$timestep, y = output$median,
type = "n", xlim = c(-1, 365), ylim = c(0.75, 0.85),
xlab = "Time (days)", ylab = "Prevalence in children aged 2-10 years",
xaxs = "i", yaxs = "i", bty = "l")
polygon(xx, yy,
col = "grey90", border= "grey95")
lines(x = output$timestep, y = output$median, col = cols[3], lwd = 2)
grid(lty = 2, col = "grey80", lwd = 1)
title("IQR spread")
legend("bottomleft", legend = c("Median","IQR Spread"), col = c(cols[3], "grey90"),
lwd = 5, xpd = TRUE, horiz = TRUE, cex = 1, seg.len = 1, bty = "n")

xx <- c(output$timestep, rev(output$timestep))
yy <- c(output$lowci, rev(output$highci))
plot(x = output$timestep, y = output$mean,
kellymccain28 marked this conversation as resolved.
Show resolved Hide resolved
type = "n", xlim = c(-1, 365), ylim = c(0.75, 0.85),
xlab = "Time (days)", ylab = "",
xaxs = "i", yaxs = "i", bty = "l")
polygon(xx, yy, col = "grey90", border= "grey95")
lines(x = output$timestep, y = output$mean, col = cols[3], lwd = 2)
grid(lty = 2, col = "grey80", lwd = 1)
title("95% Confidence interval")
legend("bottomleft", inset = 0, legend = c("Mean", "95% CI"), col = c(cols[3], "grey90"),
lwd = 5, xpd = TRUE, horiz = TRUE, cex = 1, seg.len = 1, bty = "n")
```