Various fixes

athowes · Dec 31, 2023 · c2c857d · c2c857d
1 parent d60062e
commit c2c857d
Show file tree

Hide file tree

Showing 22 changed files with 153 additions and 156 deletions.
diff --git a/01-introduction.Rmd b/01-introduction.Rmd
@@ -80,7 +80,7 @@ knitr::include_graphics("figures/introduction/gbd.png")
 
 The data used were gathered in national household surveys or routinely collected from healthcare facilities providing HIV services.
 An important feature of these data are the location and time at which observations were recorded.
-Spatio-temporal data have important recurring commonalities which occur across a diverse range of application settings.
+Spatio-temporal data have important recurring commonalities across a diverse range of application settings.
 The work conducted in this thesis uses, and aspires to contribute to, techniques from spatio-temporal statistics.
 
 Computation is an essential part of modern statistical practice.
@@ -95,15 +95,16 @@ This thesis is structured as follows:
 * Chapter \@ref(bayes-st) introduces the statistical concepts and notation used throughout the thesis, focusing on Bayesian modelling and computation, spatio-temporal statistics, and survey methods.
 * Chapter \@ref(beyond-borders): The prevailing model for spatial structure used in small-area estimation [@besag1991bayesian] was intended to analyse a grid of pixels.
 In disease mapping, areas correspond to the administrative divisions of a country, which are typically not a grid.
-I used a simulation and survey data studies to evaluate the practical consequences of this concern 
+I used simulation and survey data studies to evaluate the practical consequences of this concern 
 * Chapter \@ref(multi-agyw): Adolescent girls and young women are a demographic group at disproportionate risk of acquiring HIV infection.
 The Global AIDS Strategy recommends prioritising interventions on the basis of behaviour to prevent the most new infections using available resources.
 I estimated the size of behavioural risk groups across priority countries to enable implementation of this strategy, and assessed the potential benefits in terms of numbers of new infections prevented.
 This work [@howes2023spatio] was included in the UNAIDS (Joint United Nations Programme on HIV/AIDS) Global AIDS Update 2022 and 2023.
 * Chapter \@ref(naomi-aghq): The Naomi small-area estimation model [@eaton2021naomi] is used by countries to estimate district-level HIV indicators.
 First, to allow for compatibility with Naomi, I implemented the integrated nested Laplace approximations using automatic differentiation, opening the door to a new class of fast, flexible, and accurate Bayesian inference algorithms.
+The implementation was using models for a clinical trial of an epilepsy drug, and for the prevalence of the parasitic worm Loa loa.
 Second, I developed an approximate Bayesian inference method combining adaptive Gauss-Hermite quadrature with principal components analysis.
 I applied these method to data from Malawi, and analysed the consequences of inference method choice for policy relevant outcomes.
-* Chapter \@ref(conclusions): Finally, I discuss avenues for future work, and my conclusions regarding the research, as well as its strengths and weaknesses.
+* Chapter \@ref(conclusions): Finally, I discuss contributions of the research, avenues for future work, and some broader reflections.
 
 Though chronological order is recommended, Chapters \@ref(beyond-borders), \@ref(multi-agyw) and \@ref(naomi-aghq) may be read in any order, or as stand-alone studies, if preferred.
diff --git a/03-bayesian.Rmd b/03-bayesian.Rmd
@@ -243,12 +243,6 @@ In combination, large data (big $n$) and models with a large number of parameter
 
 ### Small-area estimation {#small-area-estimation}
 
-Data always has some cost to collect.
-This cost can be significant and prohibitive, especially for data relating to people where collection is difficult to automate.
-In spatio-temporal statistics, there are a large number of possible locations in space and time.
-Given the cost of data collection, often no or limited direct observations may be available for any given space-time location.
-Direct estimates of indicators of interest are either impossible or inaccurate in this setting.
-
 (ref:zmb-maps) Simulation of a simple random sample $y_i \sim \text{Bin}(m, p_i)$ with varying sample size $m = 5, 25, 125$ in each of the $i = 1, \ldots, 156$ constituencies of Zambia. Direct estimates were obtained by the empirical ratio of data to sample size. Modelled estimates were obtained using a logistic regression with linear predictor given by an intercept and a spatial random effect. Estimates of HIV indicators for Zambia have previously been generated at the district-level, comprising 116 spatial units. Moving forward, there is interest in generating estimates at the higher-resolution constituency level, as program planning is devolved locally. The viridis colour palette, as implemented by the `viridis` R package [@viridis], was used in this figure. It is use often throughout this thesis because it is perceptually uniform and accessible to colourblind viewers [@smith2015better]. This figure was adapted from a presentation given for the Zambia HIV Estimates Technical Working Group, available from [`https://github.com/athowes/zambia-unaids`](https://github.com/athowes/zambia-unaids).
 
 ```{r zmb-maps, fig.cap="(ref:zmb-maps)"}
@@ -261,6 +255,13 @@ knitr::include_graphics("figures/bayesian/zmb-maps.png")
 knitr::include_graphics("figures/bayesian/zmb-scatter.png")
 ```
 
+Data always has some cost to collect.
+This cost can be significant and prohibitive.
+Especially for data relating to people, where collection is difficult to automate.
+In spatio-temporal statistics, there are a large number of possible locations in space and time.
+Given the cost of data collection, often no or limited direct observations may be available for any given space-time location.
+Direct estimates of indicators of interest are either impossible or inaccurate in this setting.
+
 Small-area estimation [SAE; @pfeffermann2013new] methods aim to overcome the limitations of small data by sharing information.
 In the spatio-temporal setting sharing of information occurs across space and time.
 Prior knowledge that observations in one spatio-temporal location are correlated with those at another (Section \@ref(correlation-structure)) can be used to improve estimates.

diff --git a/04-beyond-borders.Rmd b/04-beyond-borders.Rmd
@@ -315,12 +315,7 @@ This choice results in the centroid kernel
 \begin{equation}
     K(A_i, A_j) = k(c_i, c_j). (\#eq:ck)
 \end{equation}
-The centroid kernel has been used:
-
-* in environmental epidemiology [@wakefield1999spatial];
-* for US election modelling [@flaxman2015supported];
-* to model the reproduction number of COVID-19 [@teh2021efficient].
-
+The centroid kernel has been used in environmental epidemiology [@wakefield1999spatial], for US election modelling [@flaxman2015supported], and to model the reproduction number of COVID-19 [@teh2021efficient].
 In a model comparison study @best2005comparison (Section 3) simulated data representing heterogeneous exposure to air pollution, including elevated rates of exposure near two hypothetical point source locations, and found that the centroid kernel tended to over-smooth the high-risk areas.
 That said, it unsurprising that a stationary covariance function would struggle to recover non-stationary structure.
 
@@ -337,7 +332,7 @@ This covariance structure is equivalent to that obtained by aggregating a spatia
 In the machine learning literature, models of this kind have been studied under the name aggregated Gaussian processes [@law2018variational; @tanaka2019spatially; @yousefi2019multi; @hamelijnck2019multi; @chau2021deconditional].
 Examples of use of this model in statistical practice are rare.
 
-#### Accounting for heterogeneity
+#### Accounting for heterogeneity {#ik-hetrogeneity}
 
 Additional information accounting for heterogeneity over $A_i$ may be incorporated into the integrated kernel.
 This can be accomplished using weighting distributions $\{w_i\}$ which represent an unequal contribution of each point to the similarity measure.
@@ -501,7 +496,7 @@ The sensitivity analysis in Appendix \@ref(lengthscale-prior) illustrates the ex
 
 Approximate Bayesian inference was conducted using adaptive Gauss-Hermite quadrature [AGHQ; @stringer2022fast] with $k = 3$ quadrature points over a marginal Laplace approximation via the `aghq` package [@stringer2021implementing].
 Models were implemented using a Template Model Builder C++ template for the log-posterior via the `TMB` package [@kristensen2016tmb].
-Appendix \@ref(aghq-nuts) compares posterior mean and standard deviations from AGHQ to those obtained using the No-U-Turn Sampling (NUTS) Hamiltonian Monte Carlo (HMC) algorithm run using `Stan` [@carpenter2017stan] via the `tmbstan` package [@monnahan2018no].
+Appendix \@ref(aghq-nuts) compares posterior mean and standard deviations from AGHQ to those obtained using the No-U-Turn Sampler (NUTS) Hamiltonian Monte Carlo (HMC) algorithm run using `Stan` [@carpenter2017stan] via the `tmbstan` package [@monnahan2018no].
 
 ### Model assessment {#model-assessment}
 
@@ -513,11 +508,10 @@ Let $\omega$ be the true value of $\phi$ used in the simulation.
 The accuracy of latent field parameter and hyperparameter posterior marginals from each model were assessed using three methods.
 These were the mean squared error (MSE), the continuous ranked probability score [CRPS; @matheson1976scoring], and the calibration.
 
-The MSE is a simple and popular measure given by
+The MSE is a simple and popular measure, calculated using samples as
 \begin{equation}
 \text{MSE}(f, \omega) \approx \frac{1}{S} \sum_{s = 1}^S (\phi_s - \omega)^2.
 \end{equation}
-
 The CRPS is a strictly proper scoring rule (SPSR) which has favourable properties and is often regarded as a default choice [@gneiting2007strictly].
 Any scoring rule which is not strictly proper rewards a misrepresentation of beliefs.
 The CRPS is
@@ -741,6 +735,8 @@ The work in this chapter does not address the broader question of under which ci
 The models used for spatial structure in this chapter were all stationary.
 Although stationarity assumptions may be violated by HIV survey data, it remains challenging to estimate non-stationary spatial structure [@paciorek2006spatial].
 
+It would be of interest to implement the integrated kernel model with population-based weighting (Section \@ref(ik-hetrogeneity)).
+
 <!-- * What are my suggestions for new models to develop on the basis of this work -->
 <!-- * GMRFs perform smoothing more locally -->
 <!-- * BYM2 proportion parameter in the case where there is an outlier ends up too IID. Urban rural or other covariates. Heavier tail than Gaussian. -->

diff --git a/05-multi-agyw.Rmd b/05-multi-agyw.Rmd
@@ -26,7 +26,7 @@ I developed the statistical model, building upon an earlier version of the analy
 The model and results for 13 countries are presented in @howes2023spatio.
 Outputs are implemented in a spreadsheet tool ([`https://hivtools.unaids.org/pse/`](https://hivtools.unaids.org/pse/)) for use in national HIV response planning.
 The tool is being updated by inclusion of more countries to the analysis, and extension of the methodology, including to additional risk groups.
-Code for the analysis in this chapter is available from [`https://github.com/athowes/multi-agyw`](https://github.com/athowes/multi-agyw).
+Code for the analysis in this chapter is available from [`https://github.com/athowes/multi-agyw`](https://github.com/athowes/multi-agyw) and supported by the `multi.utils` R package [@howes2023multiutils].
 
 ## Background
 
@@ -535,51 +535,42 @@ To allow the log odds ratio for the highest risk group to vary based on general
 I ensured that log odds ratios for the FSW risk group were at least as large as those for the multiple or non-regular partner(s) risk group.
 
 Given the fitted log odds ratios, I disaggregated Naomi estimates of PLHIV $H_{ia}$ on the logit scale using numerical optimisation.
-I did this by by finding the values of $\theta_{ia}$ which minimise the equation
+To do so, I found the values of $\theta_{ia}$ which minimised the equation
 \begin{equation}
 f(\theta_{ia}) = \sum_{k = 1}^4 \left( \text{logistic}(\theta_{ia} + \log(\text{OR}_k)) \cdot N_{iak} \right) - H_{ia},
 \end{equation}
-where
-\begin{equation}
-\text{logistic}(x) = \exp(x) / (1 + \exp(x)),
-\end{equation}
-such that
-\begin{equation}
-\text{logistic}(\hat \theta_{ia} + \log(\text{OR}_k)) = \rho_{iak}.
-\end{equation}
-These values are given by
+where $\text{logistic}(x) = \exp(x) / (1 + \exp(x))$ such that $\text{logistic}(\hat \theta_{ia} + \log(\text{OR}_k)) = \rho_{iak}$.
+These values were given by
 \begin{equation}
 \hat \theta_{ia} = \arg\min_{\theta_{ia} \in [-10, 10]} f(\theta_{ia})^2.
 \end{equation}
 The number of PLHIV were obtained by $H_{iak} = \rho_{iak} N_{iak}$, where $N_{iak}$ is the risk group population size.
 
 ### Disaggregation of Naomi incidence estimates
 
-I calculated the number of new HIV infections by risk group using linear disaggregation
+I used linear disaggregation to calculate the number of new HIV infections by risk group
 \begin{align}
     I_{ia} &= \sum_k I_{iak} = \sum_k \lambda_{iak} (1 - \rho_{iak}) N_{iak} \\
     &= 0 + \lambda_{ia2} (1 - \rho_{ia2}) N_{ia2} + \lambda_{ia3} (1 - \rho_{ia3}) {ia3} + \lambda_{ia4} (1 - \rho_{ia4}) N_{ia4} \\
     &= \lambda_{ia2} \left((1 - \rho_{ia2}) N_{ia2}  + \text{RR}_{3} (1 - \rho_{ia3}) N_{ia3} + \text{RR}_4(\lambda_{ia}) (1 - \rho_{i4}) N_{ia4}  \right),
 \end{align}
 where $\text{RR}_{2}$, $\text{RR}_{3}$ and $\text{RR}_{4}(\cdot)$ are the HIV risk ratios given in Table \@ref(tab:risk-groups), and $(1 - \rho_{iak}) N_{iak}$ are the susceptible population sizes in each risk group.
 The risk ratio for FSW was defined as a function of district-level incidence in the general population $\lambda_{ia}$.
-
 Risk group specific HIV incidence estimates were then given by
 \begin{align}
     \lambda_{ia1} &= 0, \\
     \lambda_{ia2} &= \frac{I_{ia}}{(1 - \rho_{ia2}) N_{ia2} + \text{RR}_{3} (1 - \rho_{ia3}) N_{ia3} + \text{RR}_4(\lambda_{ia}) (1 - \rho_{ia4}) N_{ia4}}, \\
     \lambda_{ia3} &= \text{RR}_{3} \lambda_{ia2}, \\
     \lambda_{ia4} &= \text{RR}_4(\lambda_{ia}) \lambda_{ia2}.
 \end{align}
-I evaluated these equations using Naomi model estimates of the number of new HIV infections $I_{ia} = \lambda_{ia} N_{ia}$.
+These equations were evaluated using Naomi model estimates of the number of new HIV infections $I_{ia} = \lambda_{ia} N_{ia}$.
 The number of new HIV infections were $I_{iak} = \lambda_{iak} N_{iak}$.
 
 ### Expected new infections reached
 
-I calculated the number of new infections that would be reached prioritising according to each possible stratification of the population.
-That is, for all $2^3 = 8$ possible combinations of stratification by location, age, and risk group. 
-
-To illustrate this approach, consider stratification by age.
+The number of new infections that would be reached prioritising according to each possible stratification of the population were calculated.
+Each possible stratification refers to all $2^3 = 8$ possible combinations of stratification by location, age, and risk group.
+As an illustration, consider stratification by age.
 I first aggregated the number of new HIV infections and HIV incidence such that
 \begin{align}
     I_a &= \sum_{ik} I_{iak}, \\
@@ -703,11 +694,7 @@ fsw_iso3 <- df_3p1_subnational %>%
   summarise(estimate_smoothed = sum(estimate_smoothed * population_mean, na.rm = TRUE) / sum(population_mean, na.rm = TRUE))
 ```
 
-Figure \@ref(fig:aaa-variance-proportions) and Figure \@ref(fig:3p1-within-between-country-variation) show posterior mean estimates for the proportion in each risk group for the final model in 2018, the most recent year included in our analysis.
-I focused on the most recent estimates because they are the most relevant to inform ongoing HIV policy.
-In subsequent results, all estimates refer to 2018, unless otherwise indicated.
-
-(ref:3p1-continental-map) The spatial distribution (posterior mean) of the AGYW risk group proportions in 2018. Estimates are stratified by risk group (columns) and five-year age group (rows). Countries in grey were not included in the analysis. A limitation of this figure is that using a common colour scale, desirable for other reasons, makes it challenging to see spatial variation in the FSW risk group.
+(ref:3p1-continental-map) The posterior mean of the AGYW risk group proportions over space in 2018. Estimates are stratified by risk group (columns) and five-year age group (rows). Countries in grey were not included in the analysis. A limitation of this figure is that using a common colour scale, though desirable for other reasons, makes it challenging to see spatial variation in the FSW risk group.
 
 ```{r 3p1-continental-map, fig.cap="(ref:3p1-continental-map)"}
 knitr::include_graphics(paste0("resources/multi-agyw/", resource_version, "/depends/3p1-continental-map.png"))
@@ -719,6 +706,10 @@ knitr::include_graphics(paste0("resources/multi-agyw/", resource_version, "/depe
 knitr::include_graphics(paste0("resources/multi-agyw/", resource_version, "/depends/3p1-within-between-country-variation.png"))
 ```
 
+Figure \@ref(fig:aaa-variance-proportions) and Figure \@ref(fig:3p1-within-between-country-variation) show posterior mean estimates for the proportion in each risk group for the final model in 2018, the most recent year included in our analysis.
+I focused on the most recent estimates because they are the most relevant to inform ongoing HIV policy.
+In subsequent results, all estimates refer to 2018, unless otherwise indicated.
+
 The median national FSW proportion was `r fsw_age_Y015_019$median`\% (95\% CI `r fsw_age_Y015_019$lower`--`r fsw_age_Y015_019$upper`) for the 15-19 age group, `r fsw_age_Y020_024$median`\% (95\% CI `r fsw_age_Y020_024$lower`--`r fsw_age_Y020_024$upper`) for the 20-24 age group and `r fsw_age_Y025_029$median`\% (95\% CI `r fsw_age_Y025_029$lower`--`r fsw_age_Y025_029$upper`) for the 25-29 age group, in line with the results displayed in Figure \@ref(fig:age-disagg-fsw-line).
 
 In the 20-24 and 25-29 year age groups, the majority of women were either cohabiting or had non-regular or multiple partner(s).
@@ -733,6 +724,12 @@ The exception was Mozambique, where the majority (`r 100 - round(moz_Y015_019_no
 
 #### Coverage assessment
 
+(ref:coverage) Probability integral transform (PIT) histograms (top row) and empirical cumulative distribution function (ECDF) difference plots (bottom row) for the final selected model.
+
+```{r coverage, fig.cap="(ref:coverage)"}
+knitr::include_graphics(paste0("resources/multi-agyw/", resource_version, "/depends/coverage.png"))
+```
+
 To assess the calibration of the fitted model, I calculated the quantile $q$ of each observation within the posterior predictive distribution.
 For calibrated models, these quantiles, known as probability integral transform (PIT) values [@dawid1984present; @bosse2022scoringutils], should follow a uniform distribution $q \sim \mathcal{U}[0, 1]$.
 To generate samples from the posterior predictive distribution, I applied the multinomial likelihood to samples from the latent field, setting the sample size to be the floor of the Kish effective sample size.
@@ -743,12 +740,6 @@ To test for uniformity, I used the binomial distribution based simultaneous conf
 I found the only significant deviation from uniformity occurred in the right-hand tail of the one cohabiting partner risk group.
 That is to say, the proportion of the PIT values which were greater than 0.95 was significantly more than would be expected under a calibrated model.
 
-(ref:coverage) Probability integral transform (PIT) histograms (top row) and empirical cumulative distribution function (ECDF) difference plots (bottom row) for the final selected model.
-
-```{r coverage, fig.cap="(ref:coverage)"}
-knitr::include_graphics(paste0("resources/multi-agyw/", resource_version, "/depends/coverage.png"))
-```
-
 #### Variance decomposition {#variance-decomposition}
 
 ```{r}