-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathtutorial.Rmd
554 lines (336 loc) · 22.8 KB
/
tutorial.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
---
title: "Imputation"
author: "David Coleman"
date: "2023-05-30"
output: html_document
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
library(tidyverse)
```
## Gap-filling AusTraits data
Missingness is inherent to datasets in the biological sciences and the trait data assembled in the AusTraits plant trait database is no exception. Take for example this subset of numeric trait values below.
```{r echo = F}
aus <- read_csv("Assembling_trait_matrix/data/230403_summarised_austraits_traits.csv", show_col_types = FALSE)
missingness = aus %>% summarise(across(leaf_length:leaf_lifespan, list(count = ~ sum(is.na(.))))) %>%
pivot_longer(cols=everything(), names_to="trait_name", values_to="missing_observations") %>%
mutate(percent_missing = round(missing_observations/nrow(df)*100)) %>%
arrange(percent_missing)
knitr::kable(missingness)
```
The summary table shows us that a few of the traits have over 60% taxonomic coverage (plant height, leaf length and leaf width) but the majority are very sparsely represented across the complete species list of Australia's plant taxa.
Unfortunately, many trait metrics that are used to represent the functionality of the ecosystems require a complete taxon-trait matrix with no gaps (Nock et al., 2016; Pakeman, 2014).
This is where gap-filling techniques can help. Gap-filling is the statistical imputation of missing data.
The intention of imputing values is to imitate the qualities and metrics such as the distribution and summary statistics of a complete dataset (Van Buuren, 2018). Imputation is not the same as prediction. Although the gap-filled values should be plausible, using the gap-filled datasets for individual taxonomic analyses are generally not advisable (Johnson et al., 2021; Joswig et al., 2022).
## Gap-filling
Before beginning the process of data imputation, it is worth following the steps outlined below (many of which are explained in greater detail by Johnson (2021) or Joswig (2022)):
First, rule out the "complete cases" method – that is, removal of any missing values of the dataset - as a viable option for the data analysis. This is the simplest way to deal with missing data and can sometimes outperform imputation depending on the completeness of the dataset and the needs of the data analysis.
Secondly, it is advisable to maximise the amount of “real” data in the analysis wherever possible, either from published sources or other datasets. The degree of missingness is highly important to the success of the imputation and the error of the imputed values increases with increasing missingness. Although several thresholds of missingness have been put forward as a cut off for when imputation should not be carried out (e.g. 50% missing data) (Madley-Dowd et al., 2019), there is no fixed amount of missingness beyond which imputation should not be used. Rather, there are varying degrees of accuracy and confidence in the imputed values which may or may not be suitable for the analysis.
Thirdly, supplementary datasets can often assist in the imputation process when traits are poorly represented across a taxonomic dataset. For example, to gap-fill leaf area in the example above (77% missing), the more complete traits of leaf length and leaf width probably correlate to some degree with leaf area and may therefore be helpful. Alternatively, many plant traits are correlated with taxonomy and with environmental variables such as climate, so adding taxonomic and environmental data to the existing trait matrix may also assist with the imputation process.
## Gap-filling techniques
The choice of gap-filling methodologies has increased rapidly over recent years and there are now more than 160 different R packages to help researchers impute missing data (Johnson et al., 2021). We will use one here (Rphylopars, Goolsby et al., 2017) as a case study of how to go about creating and evaluating gap-filled datasets and provide example code to adapt the technique to a second package which has been used previously for plant trait datasets (BHPMF, Schrodt et al., 2015).
## Preparing the data
Rphylopars is an R package that uses phylogenetic information in combination with other trait information to estimate missing values across a taxon-trait matrix. The package is best suited to gap-filling traits that have a strong phylogenetic signal.
Rphylopars requires a phylogenetic tree of all taxa to impute values which can be generated using the V.PhyloMaker2 package (Jin & Qian, 2019). The following code includes a way to generate a taxonomic phylogeny with three levels: family, genus and species, as recorded in AusTraits. Due to the amount of time required to generate the large phylogenetic tree, the code to generate the tree object is commented out, and the tree has been stored in the github repository and can be reread into R.
```{r echo=FALSE}
# install and load the package
#install.packages("Rphylopars")
library(Rphylopars)
library(V.PhyloMaker2)
# Code to create the phylogeny - only needs to be done once
#aus_taxa = left_join(austraits$traits, austraits$taxa %>% select("taxon_name", "genus", "family"), by = "taxon_name")
# taxa <-
# data.frame(
# species = aus_taxa$taxon_name,
# genus = aus_taxa$genus,
# family = aus_taxa$family
# )
#
# taxa <- distinct(taxa)
# tree <- phylo.maker(taxa)
# saveRDS(tree,"Rphylopars/downloads/autraitstree.rds")
# read in the pylogeny
tree = readRDS("Rphylopars/downloads/autraitstree.rds")
phylo <- tree[[1]]
```
## Data transformation
Most numeric traits in AusTraits are positively skewed and require log transformation before analysis.
```{r}
aus = aus %>% mutate(across(leaf_length:leaf_C_per_dry_mass, ~log10(.x), .names="log10_{.col}"),
# a formatting step necessary for the Rphylopars package
species = gsub(" ", "_", taxon_name)) %>%
select(-taxon_name) %>%
# remove taxa not in the phylogentic tree
filter(species %in% phylo$tip.label)
#select some traits that will help predict leaf area
data_in = aus %>% select(
species,
log10_leaf_area,
log10_leaf_length,
log10_leaf_width,
log10_leaf_mass_per_area
) %>% arrange(species)
```
Its always best practice to look at the data before making any imputations to ensure there are not large parts of the phylogeny or trait distribution that are missing.
```{r}
#trait
hist(data_in$log10_leaf_area, main = "leaf area trait data", xlab = "log leaf area (mm2)")
#phylogeny gaps
test = aus %>% filter(is.na(log10_leaf_area),is.na(log10_leaf_length), is.na(log10_leaf_width), is.na(log10_leaf_mass_per_area))
test1 = aus %>% filter(!species %in% test$species) %>% group_by(genus) %>% summarise(present = n())
test2 = test %>% group_by(genus) %>% summarise(missing = n())
test3 = full_join(test1, test2, by = "genus") %>% mutate(ratio = present/missing)
```
## Now to the actual gap-filling
In order to assess the validity of the imputation, we will fit a linear model between the some imputed and observed data. This requires removing a third or so of leaf area observations from the dataset, running the imputation function on the remaining data and then assessing how closely the imputed values align with this subset of leaf area observations.
```{r}
# Create a vector of a 33% random sample of leaf areas
values_to_NA = sample(which(!is.na(data_in$log10_leaf_area)), length(which(!is.na(data_in$log10_leaf_area)))*0.33)
# replace these with NA
data_in$log10_leaf_area[values_to_NA] = NA
# Run the phylogenetic comparative analysis
phylopars_result <-
phylopars(trait_data = data_in,
tree = phylo,
model = "BM")
```
Time passes...
Once the imputation is complete, calculate some metrics to assess the success of the imputation. They are also useful to compare this particular imputation with other imputation replicates using Rphylopars or other methods.
```{r}
# Extract the estimated values
estimated_values <- phylopars_result$anc_recon
est = rownames_to_column(data.frame(estimated_values), var = "species") %>%
# The output computes values for all levels of the phylogeny, hence filter to the original species
filter(species %in% data_in$species)
# add back in the original log transformed leaf area
out = left_join(est, aus %>% rename("leaf_area_original"= "log10_leaf_area") %>% select(species, leaf_area_original)) %>% arrange(species)
# filter to just those that were removed
missing_data = out[values_to_NA,]
# Visually assess
missing_data %>%
ggplot(aes(x = leaf_area_original, y = log10_leaf_area)) + geom_point()
# R2
R2 = summary(lm(leaf_area_original ~ log10_leaf_area, data = missing_data))$adj.r.squared
#RMSE
RMSE = sqrt(mean((missing_data$leaf_area_original - missing_data$log10_leaf_area)^2))
```
Here we have assessed the success of the imputation using the linear relationship between observed data and imputed values, metrics that are appropriate for a broad scope imputation. A broad scope imputation is where one set of imputations may be used for many projects and analyses (Van Buuren, 2018), as opposed to a narrow scope imputation where a specific outcome is required from the complete dataset. In these cases, other metrics are required to assess the success of the imputation in these cases (Johnson et al., 2021).
## Variance
Rphylopars and other gap-filling packages also normally provide a measure of uncertainty (the Variance) for each imputed value which can also give a measure of the success of the imputation - lower mean uncertainty over all imputed values in a dataset suggests a more reliable imputation. Imputed values with the highest values of uncertainty are generally associated with taxa that have little or no input data for the species or within the phylogeny, have high diversity for this trait within the phylogeny or a combination of both factors.
```{r}
# Uncertainty
estimated_variance = phylopars_result$anc_var
rownames_to_column(data.frame(estimated_variance), var = "species") %>%
filter(species %in% data_in$species) %>% arrange(species) -> est1
# look at the mean and median Variance for the imputed observations
mean_var = mean(est1$log10_leaf_area[est1$log10_leaf_area != 0])
median_var = median(est1$log10_leaf_area[est1$log10_leaf_area != 0])
```
When imputing values from a sparse taxon-trait matrix, it will probably be necessary to exclude imputed values from the dataset once the process is complete. There are several protocols for selecting the imputed values for exclusion. These include the removal of imputed values outside the range of original values in the dataset, removal of observations over a certain threshold of uncertainty or removal of clades and taxa that have poor coverage in the trait dataset. Techniques will vary depending on the desired outcome of the analysis but it is always best practice to record and explain how this step was completed.
## Improving the result
There are many ways to optimise the imputation when using imputation tools like Rphylopars
# 1
Rerun the imputation process multiple times in a loop and store the metrics to assess their consistency over different sampling patterns of the data. Each run should produce slightly different metrics due to differences in which data is removed by random sampling when using the sample() function.
```{r}
out = data.frame()
for (i in 1:3){
# For each loop, a different but reproducible random sample of leaf area observations is removed.
set.seed(i)
values_to_NA = sample(which(!is.na(data_in$log10_leaf_area)), length(which(!is.na(data_in$log10_leaf_area)))*0.33)
data_in$log10_leaf_area[values_to_NA] = NA
# Run the phylogenetic comparative analysis
phylopars_result <-
phylopars(trait_data = data_in,
tree = phylo,
model = "BM")
# Extract the estimated values
estimated_values <- phylopars_result$anc_recon
est = rownames_to_column(data.frame(estimated_values), var = "species") %>%
# The output computes values for all levels of the phylogeny, hence filter to the original species
filter(species %in% data_in$species)
# add back in the original log transformed leaf area
est = left_join(est, aus %>% rename("leaf_area_original"= "log10_leaf_area") %>% select(species, leaf_area_original)) %>% arrange(species)
# filter to just those that were removed
missing_data = est[values_to_NA,]
# R2
r = summary(lm(leaf_area_original ~ log10_leaf_area, data = missing_data))$adj.r.squared
#RMSE
RMSE = sqrt(mean((missing_data$leaf_area_original - missing_data$log10_leaf_area)^2))
# Uncertainty
estimated_variance = phylopars_result$anc_var
rownames_to_column(data.frame(estimated_variance), var = "species") %>%
filter(species %in% data_in$species) %>% arrange(species) -> est1
# look at the mean and median Variance for the imputed observations
mean_var = mean(est1$log10_leaf_area[est1$log10_leaf_area != 0])
median_var = median(est1$log10_leaf_area[est1$log10_leaf_area != 0])
metrics = data.frame(rep = i, R2 = r, mean_variance = mean_var, median_variance = median_var)
# store the result
out = rbind(out, metrics)
}
```
# 2
Repeat the analysis with other external data and see if metrics improve. In the example below the same AusTraits data as above is used but with two extra climate variables, Mean Annual Temperature (MAT) and Mean Annual Precipitation (MAP).
```{r}
# read in the data
aus <- read_csv("Assembling_trait_matrix/data/climate_vars.csv", show_col_types = F)
aus = aus %>% mutate(across(leaf_length:leaf_C_per_dry_mass, ~log10(.x), .names="log10_{.col}"),
# a formatting step necessary for the Rphylopars package
species = gsub(" ", "_", taxon_name)) %>%
select(-taxon_name) %>%
# remove taxa not in the phylogentic tree
filter(species %in% phylo$tip.label)
#select some traits that will help predict leaf area
data_in = aus %>% select(
species,
log10_leaf_area,
log10_leaf_length,
log10_leaf_width,
log10_leaf_mass_per_area,
MAT,
MAP
) %>% arrange(species)
out_with_climate = data.frame()
for (i in 1:3){
# For each loop, a different but reproducible random sample of leaf area observations is removed.
set.seed(i)
values_to_NA = sample(which(!is.na(data_in$log10_leaf_area)), length(which(!is.na(data_in$log10_leaf_area)))*0.33)
data_in$log10_leaf_area[values_to_NA] = NA
# Run the phylogenetic comparative analysis
phylopars_result <-
phylopars(trait_data = data_in,
tree = phylo,
model = "BM")
# Extract the estimated values
estimated_values <- phylopars_result$anc_recon
est = rownames_to_column(data.frame(estimated_values), var = "species") %>%
# The output computes values for all levels of the phylogeny, hence filter to the original species
filter(species %in% data_in$species)
# add back in the original log transformed leaf area
est = left_join(est, aus %>% rename("leaf_area_original"= "log10_leaf_area") %>% select(species, leaf_area_original)) %>% arrange(species)
# filter to just those that were removed
missing_data = est[values_to_NA,]
# R2
R2 = summary(lm(leaf_area_original ~ log10_leaf_area, data = missing_data))$adj.r.squared
#RMSE
RMSE = sqrt(mean((missing_data$leaf_area_original - missing_data$log10_leaf_area)^2))
# Uncertainty
estimated_variance = phylopars_result$anc_var
rownames_to_column(data.frame(estimated_variance), var = "species") %>%
filter(species %in% data_in$species) %>% arrange(species) -> est1
# look at the mean and median Variance for the imputed observations
mean_var = mean(est1$log10_leaf_area[est1$log10_leaf_area != 0])
median_var = median(est1$log10_leaf_area[est1$log10_leaf_area != 0])
metrics = data.frame(rep = i, R2 = R2, mean_variance = mean_var, median_variance = median_var)
# store the result
out_with_climate = rbind(out_with_climate, metrics)
}
```
# 3
Use a different method such as BHPMF.
In order to run the following code, you will need to restart R in version 3.4.4 and install the BHPMF package. The example below uses Base R functions to wrangle the data rather than those from the tidyverse packages above, to avoid the need to install older versions of the packages compatible with R versions 3.4.4.
More details about the package and a similar vignette are on the github page: https://github.com/fisw10/BHPMF
```{r}
library(BHPMF)
# The output of the gap-filling is stored in a series of files in the directory below
tmp.dir <- dirname("BHPMF/output_data/leaf_area/tmp/")
# Read in the data
original_data = read.csv("BHPMF/input_data/climate_vars.csv", stringsAsFactors = F)
```
Unlike Rphylopars which requires all traits to have a strong phylogenetic signal, supplementary traits can be useful to include in the gap-filling process when using the BHPMF package, which serve to "stabalise" the imputed values (Joswig et al., 2022). Furthermore, BHPMF cannot gap-fill trait values for taxa without any trait data (i.e. blank cells in all columns), so including other trait data can maximise the numbers of taxa included in the gap-filling process (Joswig et al., 2022). In this example, I've added data from a numeric trait in AusTraits with high taxonomic coverage: plant height.
```{r}
# select the traits you'd like. In addition to the leaf dimensions traits used in the previous method, I've included plant height as a "stabilising" influence (See Joswig et al., 2022)
original_data = original_data[, c("X", "taxon_name", "genus", "family", "leaf_length", "leaf_width", "leaf_area", "plant_height")]
```
```{r}
# remove a third of the observations and store the original data before proceeding with the gap-filling.
original_data$original_leaf_area = original_data$leaf_area
values_to_NA = sample(which(!is.na(original_data$leaf_area)),
length(which(!is.na(original_data$leaf_area)))*0.33)
# record which values will be removed
original_data$values_to_NA = NA
original_data$values_to_NA[values_to_NA] = TRUE
original_data$leaf_area[values_to_NA] = NA
# remove taxa for which no trait values are known.
original_data = original_data[-which(is.na(original_data$leaf_area)&
is.na(original_data$leaf_length)&
is.na(original_data$leaf_width)&
is.na(original_data$plant_height)),]
data = original_data
```
These steps prepare the data for the BHPMF function. You need an index column and the taxonomic hierarchy in one object, (hierarchy.info) and the trait values in another (trait.info).
```{r}
hierarchy.info = data[,c(1, 2, 3, 4)]
trait.info.original = data[,c(5:length(data))]
trait.info = data[,c(5,6, 7, 8)]
# This step log transforms and normalises (produces Z-scores) of each trait.
back_trans_pars <- list()
rm_col <- c()
for(i in 1:ncol(trait.info)){
x <- trait.info[,i] # goes through the columns
min_x <- min(x, na.rm = T) # takes the min of each column
if(min_x < 0.00000000001){
x <- x - min_x + 1 # make this optional if min x is neg
}
logx <- log10(x)
mlogx <- mean(logx, na.rm = T)
slogx <- sd(logx, na.rm = T)
x <- (logx - mlogx)/slogx # Z transformation
# store the params
back_trans_pars[[i]] <- list(min_x = min_x,
mlogx = mlogx,
slogx = slogx)
trait.info[,i] <- x
}
```
Now run the gap-filling function.
```{r}
trait.info = as.matrix(trait.info)
hierarchy.info = as.matrix(hierarchy.info)
GapFilling(trait.info, hierarchy.info, tuning = F,
mean.gap.filled.output.path = paste0(tmp.dir,"/mean_gap_filled.txt"),
std.gap.filled.output.path= paste0(tmp.dir,"/std_gap_filled.txt"), tmp.dir=tmp.dir)
```
Read in the result to analyse and backtransform
```{r}
output = read.delim("BHPMF/output_data/leaf_area/mean_gap_filled.txt")
# back transform the data
for (i in 1:ncol(output) ){
x <- output[,i] # goes through the columns
y = back_trans_pars[[i]]
min_x = y[[1]]
mlogx = y[[2]]
slogx = y[[3]]
x = 10^(x*slogx + mlogx)
if(min_x < 0.00000000001){
x <- x + min_x - 1 # make this optional if min x is neg
}
output[,i] <- x
}
```
Generate the same metrics as for Rphylopars
```{r}
out = cbind(original_data[names(original_data) %in% (c("family", "genus", "taxon_name", "original_leaf_area", "values_to_NA"))], output)
out = out[!is.na(out$values_to_NA),]
# log the data to compare with previous methods
out$log_modelled = log10(out$leaf_area)
out$log_original = log10(out$original_leaf_area)
plot(out$log_original, out$log_modelled)
#R2
R2 = summary(lm(log_original ~ log_modelled, data = out))$adj.r.squared
#RMSE
RMSE = sqrt(mean((out$log_original - out$log_modelled)^2))
# Uncertainty (variance)
var = read.delim("BHPMF/output_data/leaf_area/std_gap_filled.txt")
mean(var$leaf_area)
median(var$leaf_area)
```
## References
Goolsby, E. W., Bruggeman, J., & Ané, C. (2017). Rphylopars: Fast multivariate phylogenetic comparative methods for missing data and within-species variation. Methods in Ecology and Evolution, 8(1), 22–27. https://doi.org/10.1111/2041-210X.12612
Jin, Y., & Qian, H. (2019). V.PhyloMaker: An R package that can generate very large phylogenies for vascular plants. Ecography, 42(8), 1353–1359. https://doi.org/10.1111/ecog.04434
Johnson, T. F., Isaac, N. J. B., Paviolo, A., & González-Suárez, M. (2021). Handling missing values in trait data. Global Ecology and Biogeography, 30(1), 51–62. https://doi.org/10.1111/geb.13185
Joswig, J. S., Kattge, J., Kraemer, G., Mahecha, M. D., Rüger, N., Schaepman, M. E., Schrodt, F., & Schuman, M. C. (2022). Imputing missing data in plant traits: A guide to improve gap-filling. Global Ecology and Biogeography, n/a(n/a). https://doi.org/10.1111/geb.13695
Madley-Dowd, P., Hughes, R., Tilling, K., & Heron, J. (2019). The proportion of missing data should not be used to guide decisions on multiple imputation. Journal of Clinical Epidemiology, 110, 63–73. https://doi.org/10.1016/j.jclinepi.2019.02.016
Nock, C. A., Vogt, R. J., & Beisner, B. E. (2016). Functional traits. ELS, 1–8.
Pakeman, R. J. (2014). Functional trait metrics are sensitive to the completeness of the species’ trait data? Methods in Ecology and Evolution, 5(1), 9–15. https://doi.org/10.1111/2041-210X.12136
Schrodt, F., Kattge, J., Shan, H., Fazayeli, F., Joswig, J., Banerjee, A., Reichstein, M., Bönisch, G., Díaz, S., Dickie, J., Gillison, A., Karpatne, A., Lavorel, S., Leadley, P., Wirth, C. B., Wright, I. J., Wright, S. J., & Reich, P. B. (2015). BHPMF – a hierarchical Bayesian approach to gap-filling and trait prediction for macroecology and functional biogeography. Global Ecology and Biogeography, 24(12), 1510–1521. https://doi.org/10.1111/geb.12335
Van Buuren, S. (2018). Flexible imputation of missing data. CRC press.