Merge pull request #20 from nfidd/small-edits-from-reading

Read through up to forecast vis
nfidd · Nov 4, 2024 · 5b489c0 · 5b489c0
2 parents bf04dd5 + 90608e1
commit 5b489c0
Show file tree

Hide file tree

Showing 3 changed files with 226 additions and 13 deletions.
diff --git a/sessions/forecast-ensembles.qmd b/sessions/forecast-ensembles.qmd
@@ -77,11 +77,150 @@ set.seed(123)
 In this session we will use the forecasts from different models. There all shared the same basic renewal with delays structure but used different models for the evolution of the effective reproduction number over time. These were:
 
 - A random walk model (what we have looked at so far)
-- A differenced autoregressive model referred to as "More statistical"
 - A simple model of susceptible depletion referred to as "More mechanistic"
+- A differenced autoregressive model referred to as "More statistical"
 
 For the purposes of this session the precise details of the models are not critical to the concepts we are exploring.
 
+
+::: {.callout-note collapse="true"}
+
+## More information on the mechanistic model (optional)
+One way to potentially improve the renewal model is to add more mechanistic structure. In the [forecasting concepts session](forecasting-concepts), we saw that the renewal model was making unbiased forecasts when the reproduction number was constant but that it overestimated the number of cases when the reproduction number was reducing due to susceptible depletion.
+
+::: {.callout-warning}
+This is slightly cheating as we know the future of this outbreak and so can make a model to match. This is easy to do and important to watch for if wanting to make generalisable methods.
+:::
+
+This suggests that we should add a term to the renewal model which captures the depletion of susceptibles. One way to do this is to add a term which is proportional to the number of susceptibles in the population. This is the idea behind the _SIR model_ which is a simple compartmental model of infectious disease transmission. If we assume that susceptible depletion is the only mechanism which is causing the reproduction number to change, we can write the reproduction model as:
+
+$$
+R_t = \frac{S_{t-1}}{N} R_0
+$$
+
+::: {.callout-note}
+This approximates susceptible depletion as a linear function of the number of susceptibles in the population. This is a simplification but it is a good starting point.
+:::
+
+::: {.callout-note collapse="true"}
+
+## What behaviour would we expect from this model?
+
+```{r}
+n <- 100
+N <- 1000
+R0 <- 1.5
+S <- rep(NA, n)
+S[1] <- N
+Rt <- rep(NA, n) ## reproduction number
+Rt[1] <- R0
+I <- rep(NA, n)
+I[1] <- 1
+for (i in 2:n) {
+  Rt[i] <- (S[i-1]) / N * R0
+  I[i] <- I[i-1] * Rt[i]
+  S[i] <- S[i-1] - I[i]
+}
+
+data <- tibble(t = 1:n, Rt = Rt)
+
+ggplot(data, aes(x = t, y = Rt)) +
+  geom_line() +
+  labs(title = "Simulated data from an SIR model",
+       x = "Time",
+       y = "Rt")
+```
+:::
+
+The key assumptions we are making here are: 
+
+- The population is constant and we roughly know the size of the population.
+- The reproduction number only changes due to susceptible depletion
+- The number of new cases at each time is proportional to the number of susceptibles in the population.
+
+:::
+
+::: {.callout-note collapse="true"}
+## More information on the statistical model (optional)
+
+Adding more mechanistic structure is not always possible and, if we don't specify mechanisms correctly, might make forecasts worse.
+Rather than adding more mechanistic structure to the renewal model, we could add more statistical structure with the aim of improving performance.
+Before we do this, we need to think about what we want from a forecasting model.
+As we identified above, we want a model which is unbiased and which has good short-term forecasting properties.
+We know that we want it to be able to adapt to trends in the reproduction number and that we want it to be able to capture the noise in the data.
+A statistical term that can be used to describe a time series with a trend is saying that the time series is _non-stationary_.
+More specifically, a _stationary_ time series is defined as one whose statistical properties, such as mean and variance, do not change over time.
+In infectious disease epidemiology, this would only be expected for endemic diseases without external seasonal influence.
+
+The random walk model we used in the [forecasting visualisation session](forecast-visualisation) is a special case of a more general class of models called _autoregressive (AR) models_. 
+AR models are a class of models which predict the next value in a time series as a linear combination of the previous values in the time series.
+The random walk model is specifically a special case of an AR(1) model where the next value in the time series is predicted as the previous value, multiplied by a value between 1 and -1 , plus some noise. This becomes a random walk when the multipled value is 0.
+
+ For the log-transformed reproduction number ($log(R_t)$), the model is:
+
+$$
+log(R_t) = \phi log(R_{t-1}) + \epsilon_t
+$$
+
+where $\epsilon_t$ is a normally distributed error term with mean 0 and standard deviation $\sigma$ and $\phi$ is a parameter between -1 and 1. If we restrict $\phi$ to be between 0 and 1, we get a model which is biased towards a reproduction number of 1 but which can still capture trends in the data that decay over time.
+
+::: {.callout-note collapse="true"}
+## What behaviour would we expect from this model?
+
+```{r}
+n <- 100
+phi <- 0.4
+sigma <- 0.1
+log_R <- rep(NA, n)
+log_R[1] <- rnorm(1, 0, sigma)
+for (i in 2:n) {
+  log_R[i] <- phi * log_R[i-1] + rnorm(1, 0, sigma)
+}
+data <- tibble(t = 1:n, R = exp(log_R))
+
+ggplot(data, aes(x = t, y = R)) +
+  geom_line() +
+  labs(title = "Simulated data from an exponentiated AR(1) model",
+       x = "Time",
+       y = "R")
+```
+:::
+
+However, we probably don't want a model which is biased towards a reproduction number of 1 (unless we have good reason to believe that is the expected behaviour). So what should we do?
+
+Returning to the idea that the reproduction number is a _non-stationary_ time series, as we expect to have a trend in the reproduction numbers we want to capture, we can use a method from the field of time series analysis called _differencing_ to make the time series stationary. This involves taking the difference between consecutive values in the time series. For the log-transformed reproduction number, this would be:
+
+$$
+log(R_t) - log(R_{t-1}) = \phi (log(R_{t-1}) - log(R_{t-2})) + \epsilon_t
+$$
+
+::: {.callout-note collapse="true"}
+## What behaviour would we expect from this model?
+
+Again we look at an R function that implements this model:
+
+```{r geometric-diff-ar}
+geometric_diff_ar
+```
+
+We can use this function to simulate a differenced AR process:
+
+```{r}
+log_R <- geometric_diff_ar(init = 1, noise = rnorm(100), std = 0.1, damp = 0.1)
+
+data <- tibble(t = seq_along(log_R), R = exp(log_R))
+
+ggplot(data, aes(x = t, y = R)) +
+  geom_line() +
+  labs(title = "Simulated data from an exponentiated AR(1) model with differencing",
+       x = "Time",
+       y = "R")
+```
+
+:::
+
+:::
+
 As previously, we have fitted these models to a range of forecast dates so you don't have to wait for the models to fit.
 We will now evaluate the forecasts from the mechanistic and statistical models.
 

diff --git a/sessions/forecast-visualisation.qmd b/sessions/forecast-visualisation.qmd
@@ -6,7 +6,7 @@ bibliography: ueifid.bib
 
 # Introduction
 
-Epidemiological forecasts are probabilistic statements about what might happen in the future of population disease burden.
+Epidemiological forecasts are probabilistic statements about what might happen to population disease burden in the future.
 In this session we will make some simple forecasts using a commonly used infectious disease model, based on the renewal equation.
 We will see how we can visualise such forecasts, and visually compare them to what really happened.
 
@@ -56,16 +56,19 @@ set.seed(123)
 Forecasting is the process of making predictions about the future based on past and present data.
 In the context of infectious disease epidemiology, forecasting is usually the process of predicting the future course of some metric of infectious disease incidence or prevalence based on past and present data.
 Here we focus on forecasting observed data (the number of individuals with new symptom onset) but forecasts can also be made for other quantities of interest such as the number of infections, the reproduction number, or the number of deaths.
-Epidemiological forecasting is closely related to nowcasting and, when using mechanistic approaches, estimation of the reproduction number. In fact, the model we will use for forecasting is the same as the model we used for nowcasting and estimation of the reproduction number.
-The only difference is that we will extend the model into the future.
-In order to make things simpler we will remove the nowcasting part of the model, but in principle all these approaches could be combined in a single model.
+Epidemiological forecasting is closely related to nowcasting and, when using mechanistic approaches, estimation of the reproduction number. In fact, the model we will use for forecasting could also be used for [real-time estimation of the reproduction number](https://nfidd.github.io/nfidd/sessions/R-estimation-and-the-renewal-equation.html) and [nowcasting](https://nfidd.github.io/nfidd/sessions/joint-nowcasting.html).
+The only difference is that here we extend the model into the future.
 
 # Extending a model into the future
 
-We will a data set of infections we are providing in the `nfidd` R package, and look at forecasts made using a renewal equation model.
-This model simply assumes that number infectious people determines the number of future infectious people via the reproduction number $R_t$.
+We will use a data set of infections we generate using functions from the `nfidd` R package, and look at forecasts made using a renewal equation model.
+This model assumes that number infectious people determines the number of future infectious people via the reproduction number $R_t$.
+
+::: {.callout-info}
+# The reproduction number $R_t$
 
 The reproduction number $R_t$ (sometimes called the _effective_ reproduction number) more generally describes the average number of secondary infections caused by a single infectious individual and can vary in time and space as a function of differences in population level susceptibility, changes in behaviour, policy, seasonality etc. The reproduction number $R_t$ is therefore a more general concept than the _basic_ reproduction number $R_0$ which represents the average number of secondary infections caused by a single infectious individual in a completely susceptible population.
+:::
 
 ::: {.callout-tip collapse="true"}
 # The discrete renewal equation (optional)
@@ -80,14 +83,57 @@ $$
 
 Here, $I_t$ is the number of infected individuals on day $t$, $R_t$ is the current value of the reproduction number and $g_i$ is the probability of a secondary infection occurring $i$ days after the infector became infected themselves, with a maximum $g_\mathrm{max}$.
 
-The forecasting model we visualise in this session is based on $R_t$ doing a random walk in time.
-The step size of this random walk is estimated from data, and this is then used to project into the future.
+The forecasting model we visualise in this session is based on $R_t$ doing a geometric random walk in time.
+The step size of this geometric random walk is estimated from data, and this is then used to project into the future.
 For more information on this approach to forecasting see, for example, @funk2018real.
 
+::: {.callout-note collapse="true"}
+# The geometric random walk model (optional)
+
+We might expect the reproduction number to change smoothly over time (except in situations of drastic change such as a very effective intervention) and to be similar at adjacent time points. We can model this by assuming that the reproduction number at time $t$ is a random draw from a normal distribution with mean equal to the reproduction number at time $t-1$ and some standard deviation $\sigma$. This can be described as a random walk model for the reproduction number. In fact, rather than using this model directly, a better choice might be to use a model where the logarithm of the reproduction number does a random walk, as this will ensure that the reproduction number is always positive and that changes are multiplicative rather than additive (i.e as otherwise the same absolute change in the reproduction number would have a larger effect when the reproduction number is small which likely doesn't match your intuition for how outbreaks evolve over time). We can write this model as
+
+$$
+\sigma \sim HalfNormal(0, 0.05) \\
+$$
+$$
+\log(R_0) \sim \mathcal{Lognormal}(-0.1, 0.5)
+$$
+$$
+\log(R_t) \sim \mathcal{N}(\log(R_{t-1}), \sigma)
+$$
+
+Here we have placed a prior on the standard deviation of the random walk, which we have assumed to be half-normal (i.e., normal but restricted to being non-negative) with a mean of 0 and a standard deviation of 0.05. This is a so-called _weakly informative prior_ that allows for some variation in the reproduction number over time but not an unrealistic amount. We have also placed a prior on the initial reproduction number, which we have assumed to be log-normally distributed with a mean of -0.1 and a standard deviation of 0.5. This is a weakly informative prior that allows for a wide range of initial reproduction numbers but has a mean of approximately 1.
+The last line is the geometric random walk.
+
+## Simulating a geometric random walk
+
+You can have a look at an R function for performing the geometric random walk:
+
+```{r geometric-random-walk}
+geometric_random_walk
+```
+
+We can use this function to simulate a random walk:
+
+```{r simulate-geometric-walk}
+R <- geometric_random_walk(init = 1, noise = rnorm(100), std = 0.1)
+data <- tibble(t = seq_along(R), R = exp(R))
+
+ggplot(data, aes(x = t, y = R)) +
+  geom_line() +
+  labs(title = "Simulated data from a random walk model",
+       x = "Time",
+       y = "R")
+```
+
+You can repeat this multiple times either with the same parameters or changing some to get a feeling for what this does.
+
+:::
+
 :::
 
 We will now look at some data from an infectious disease outbreak and forecasts from a renewal equation model.
-First, we load the data from the `nfidd` package
+First, we simulate some data using functions from the `nfidd` package.
 
 ```{r, load-simulated-onset}
 gen_time_pmf <- make_gen_time_pmf()
@@ -102,9 +148,15 @@ This uses a data set of infection times that is included in the R package, and a
 Do not worry too much about the details of this.
 The important thing is that we now have a data set `onset_df` of symptom onset.
 
+::: {.callout-tip collapse="true"}
+# How does this work?
+
+If you want to see how this works look at the function code by running `simulate_onsets` in the console.
+:::
+
 ## Visualising the forecast
 
-We can now visualise a forecast made from a renewal equation model we used for forecasting.
+We can now visualise a forecast made from the renewal equation model we used for forecasting.
 Once again, this forecast is provided in the `nfidd` package which we loaded earlier.
 We will first extract the forecast and then plot the forecasted number of symptom onsets alongside the observed number of symptom onsets before the forecast was made.
 
@@ -261,6 +313,11 @@ Do they seem to under or over predict consistently?
   - It looks like the model is consistently over predicting on the natural scale but this is less clear on the log scale.
 :::
 
+# Going further
+
+- What other ways could we visualise the forecasts especially if we had forecasts from multiple models and locations or both?
+- Based on these visualisations how could we improve the model?
+
 # Wrap up
 
 # References

diff --git a/sessions/slides/introduction-to-the-course-and-the-instructors.qmd b/sessions/slides/introduction-to-the-course-and-the-instructors.qmd
@@ -29,16 +29,33 @@ whoarewe <- lapply(sort(instructors), \(instructor) {
 - no comprehensive training resource for infectious disease forecasting
 :::
 
+# Aim of the course
+
+::: {.incremental}
+- learn to visualize and interpret infectious disease forecasts
+- evaluate forecasts using multiple approaches
+- create forecast ensembles to improve predictions
+:::
+
+# Approach
+
+Each session in the course:
+::: {.incremental}
+- builds on the previous one so that participants will have an overview of forecast evaluation by the end of the course
+- starts with a short introductory talk
+- mainly consists of interactive content that participants will work through
+:::
+
 # Who are we?
 
 ```{r whoarewe, results='asis'}
 cat(":::: {.columns}\n")
 for (instructor in seq_along(whoarewe)) {
   cat(
     "::: {.column width=\"", floor(100 / length(whoarewe)), "%\"}\n",
-    "![](", whoarewe[[instructor]]$avatar_url, ")\n",
+    "![](", whoarewe[[instructor]]$avatar_url, ")\n\n",
     "[", whoarewe[[instructor]]$name, "](",
-    whoarewe[[instructor]]$html_url, ")\n",
+    whoarewe[[instructor]]$html_url, ")\n\n",
     ":::\n", sep = ""
   )
 }