forked from microbiome/OMA
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path30_differential_abundance.Rmd
506 lines (418 loc) · 18.5 KB
/
30_differential_abundance.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
# Differential abundance {#differential-abundance}
```{r setup, echo=FALSE, results="asis"}
library(rebook)
chapterPreamble()
```
## Differential abundance analysis
This section provides an overview and examples of *differential
abundance analysis (DAA)* based on one of the [openly available
datasets](https://microbiome.github.io/mia/reference/mia-datasets.html)
in mia to illustrate how to perform differential abundance analysis
(DAA). DAA identifies differences in the abundances of individual
taxonomic groups between two or more groups (e.g. treatment vs
control). This can be performed at any phylogenetic level.
We perform DAA to identify biomarkers and/or gain understanding of a
complex system by looking at its isolated components. For example,
identifying that a bacterial taxon is different between a patient
group with disease *X* vs a healthy control group might lead to
important insights into the pathophysiology. Changes in the microbiota
might be cause or a consequence of a disease. Either way, it can
help to understand the system as a whole. Be aware that this approach
has also been criticized recently [@Quinn2021].
### Examples and tools
There are many tools to perform DAA. The most popular tools, without going into
evaluating whether or not they perform well for this task, are:
- [ALDEx2](https://bioconductor.org/packages/release/bioc/html/ALDEx2.html)
- [ANCOM-BC](https://bioconductor.org/packages/release/bioc/html/ANCOMBC.html)
- [corncob](https://cran.r-project.org/web/packages/corncob/index.html)
- [DESeq2](https://bioconductor.org/packages/release/bioc/html/DESeq2.html)
- [edgeR](https://bioconductor.org/packages/release/bioc/html/edgeR.html)
- [LEFse](https://bioconductor.org/packages/release/bioc/html/lefser.html)
- [limma voom](https://bioconductor.org/packages/release/bioc/html/limma.html)
- [LinDA](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-022-02655-5)
- [MaAsLin2](https://www.bioconductor.org/packages/release/bioc/html/Maaslin2.html)
- [metagenomeSeq](https://www.bioconductor.org/packages/release/bioc/html/metagenomeSeq.html)
- [t-test](https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/t.test)
- [Wilcoxon test](https://www.rdocumentation.org/packages/stats/versions/3.6.2/topics/wilcox.test)
We recommend to have a look at @Nearing2022
who compared all these listed methods across 38
different datasets. Because different methods use different approaches
(parametric vs non-parametric, different normalization techiniques, assumptions
etc.), results can differ between methods.
Unfortunately, as @Nearing2022 point out, they
can differ substantially. Therefore, it is highly recommended to pick several
methods to get an idea about how robust and potentially reproducible your
findings are depending on the method. In this section we demonstrate 4 methods
that can be recommended based on recent literature (ANCOM-BC, ALDEx2, Maaslin2
and LinDA) and we will compare the results between them.
Note that the purpose of this section is to show how to perform DAA in R, not
how to correctly do causal inference. Depending on your experimental setup
and your theory, you must determine how to specify any model exactly.
E.g., there might be confounding factors that might drive (the absence of)
differences between the shown groups that we ignore here for simplicity.
However, we will show how you could include covariates in those models.
Furthermore, we picked a dataset that merely has microbial abundances in a TSE
object as well as a grouping variable in the sample data. We simplify the
analysis by only including 2 of the 3 groups.
```{r load-pkg-data}
library(mia)
library(patchwork)
library(tidySummarizedExperiment)
library(ALDEx2)
library(Maaslin2)
library(MicrobiomeStat)
library(knitr)
library(tidyverse)
# For ANCOMBC we need development version
# Can be installed with:
# remotes::install_github("FrederickHuangLin/ANCOMBC")
library(ANCOMBC)
# we use the dmn_se dataset and restrict it to
# obese vs lean for easy illustration
data(dmn_se)
se <- dmn_se
# To enable all features and advantages of TreeSE, we convert the object from SE to TreeSE
tse <- as(se, "TreeSummarizedExperiment")
tse <- tse[ ,colData(tse)$pheno != "Overwt"]
colData(tse)$pheno <- fct_drop(colData(tse)$pheno, "Overwt")
# how many observations do we have per group?
count(as.data.frame(colData(tse)), pheno) %>% kable()
# set a seed because some tools can randomly vary and then produce
# different results:
set.seed(1)
```
### Prevalence Filtering
Before we jump to our analyses, we may want to perform prevalence filtering.
@Nearing2022 found that applying a 10% threshold
for the prevalence of the taxa generally resulted in more robust results.
Some tools have builtin arguments for that. By applying the threshold to our
input data, we can make sure it is applied for all tools. Below we show how to
do this in `mia`:
```{r}
tse <- subsetByPrevalentTaxa(tse, detection = 0, prevalence = 0.1)
```
### ALDEx2
In this section, we will show how to perform a simple ALDEx2 analysis.
If you wanted to pick a single method, this method could be recommended to use.
According to the developers experience, it tends to identify the common
features identified by other methods. This statement is in line with a recent
independent evaluation by @Nearing2022.
Please also have a look at the more extensive
[vignette](https://bioconductor.org/packages/release/bioc/vignettes/ALDEx2/inst/doc/ALDEx2_vignette.html)
that covers this flexible tool in more depth. ALDEx2 estimates technical
variation within each sample per taxon by utilizing the Dirichlet distribution.
It furthermore applies the centered-log-ratio transformation (or closely
related log-ratio transforms). Depending on the experimental setup, it will
perform a two sample Welch's T-test and Wilcoxon-test or a one-way ANOVA and
Kruskal-Wallis-test. For more complex study designs, there is a possibility to
utilize the `glm` functionality within ALDEx2. The Benjamini-Hochberg procedure
is applied in any case to correct for multiple testing. Below we show a simple
example that illustrates the workflow.
```{r}
# Generate Monte Carlo samples of the Dirichlet distribution for each sample.
# Convert each instance using the centred log-ratio transform.
# This is the input for all further analyses.
x <- aldex.clr(
reads = assay(tse),
conds = colData(tse)$pheno,
# 128 recommened for ttest, 1000 for rigorous effect size calculation
mc.samples = 128,
denom = "all",
verbose = FALSE
)
# calculates expected values of the Welch's t-test and Wilcoxon rank test on
# the data returned by aldex.clr
x_tt <- aldex.ttest(
x,
paired.test = FALSE,
verbose = FALSE)
# determines the median clr abundance of the feature in all samples and in
# groups, the median difference between the two groups, the median variation
# within each group and the effect size, which is the median of the ratio
# of the between group difference and the larger of the variance within groups
x_effect <- aldex.effect(x, CI = TRUE, verbose = FALSE)
# combine all outputs
aldex_out <- data.frame(x_tt, x_effect)
```
Now, we can create a so called Bland-Altman or MA plot (left). It shows the
association between the relative abundance and the magnitude of the difference
per sample. Next to that, we can also create a plot that shows the dispersion
on the x-axis instead of log-ratio abundance. Red dots represent genera that are
differentially abundant ($q \leq 0.1$) between the 2 groups. Black points are
rare taxa and grey ones are abundant taxa. The dashed line represent an effect
size of 1. See @Gloor2016 to learn more about these plots.
```{r}
par(mfrow = c(1, 2))
aldex.plot(
aldex_out,
type = "MA",
test = "welch",
xlab = "Log-ratio abundance",
ylab = "Difference",
cutoff = 0.05
)
aldex.plot(
aldex_out,
type = "MW",
test = "welch",
xlab = "Dispersion",
ylab = "Difference",
cutoff = 0.05
)
```
The evaluation as differential abundant in above plots is based on the
corrected pvalue. According to the ALDEx2 developers, the safest approach is to
identify those features where the 95% CI of the
effect size does not cross 0. As we can see in below table, this is not the
case for any of the identified
genera (see overlap column, which indicates the proportion of overlap). Also,
the authors recommend to focus on effect sizes and CIs rather than interpreting
the pvalue. To keep the comparison simple, we will here use the pvalue as
decision criterion. But please be aware that the effect size together with the
CI is a better answer to the question we are typically interested in
(see also [this article](https://www.nature.com/articles/d41586-019-00857-9)).
```{r}
rownames_to_column(aldex_out, "genus") %>%
filter(wi.eBH <= 0.05) %>% # here we chose the wilcoxon output rather than tt
select(genus, we.eBH, wi.eBH, effect, overlap) %>%
kable()
```
### ANCOM-BC
The analysis of composition of microbiomes with bias correction
(ANCOM-BC) [@Huang2020]
is a recently developed method for differential abundance testing. It is based
on an
earlier published approach [@Mandal2015].
The previous version of ANCOM was among the methods that produced the
most consistent results and is probably a conservative approach
[@Nearing2022].
However, the new ANCOM-BC method operates quite differently compared to the
former ANCOM method.
As the only method, ANCOM-BC incorporates the so called *sampling fraction*
into the model. The latter term could be empirically estimated by the ratio of
the library size to the microbial load. According to the authors, variations in
this sampling fraction would bias differential abundance analyses if ignored.
Furthermore, this method provides p-values and confidence intervals for each
taxon. It also controls the FDR and it is computationally simple to implement.
As we will see below, to obtain results, all that is needed is to pass
a tse object to the `ancombc()` function. Below, we first specify a formula.
In this formula, other covariates could potentially be included to adjust for
confounding. We show this further below.
Please check the
[function documentation](https://rdrr.io/github/FrederickHuangLin/ANCOMBC/man/ancombc.html)
to learn about the additional arguments that we specify below.
```{r warning = FALSE}
# perform the analysis
out <- ancombc(
data = tse,
formula = "pheno",
p_adj_method = "fdr",
prv_cut = 0, # no prev filtering necessary anymore
lib_cut = 0,
group = "pheno",
struc_zero = TRUE,
neg_lb = TRUE,
tol = 1e-5,
max_iter = 100,
conserve = TRUE,
alpha = 0.05,
global = TRUE # multi group comparison will be deactivated automatically
)
# store the results in res
res <- out$res
```
The object `out` contains all model output. Again, see the
[documentation of the function](https://rdrr.io/github/FrederickHuangLin/ANCOMBC/man/ancombc.html)
under **Value** for an explanation of all the output objects. Our question
whether taxa are differentially abundant can be answered by looking at the
`res` object, which now contains dataframes with the coefficients,
standard errors, p-values and q-values. Conveniently, there is a dataframe
`diff_abn`. Here, for each taxon it is indicated whether it is differentially
abundant between the groups (again, keep in mind that the answer is not
black-white). Below we show the first 6 entries of this dataframe:
```{r}
kable(head(res$diff_abn))
```
### MaAsLin2
Next, we will illustrate how to use MaAsLin2, which is the next generation of
MaAsLin. As it is based on generalized linear models, it is flexible for different study designs and covariate
structures. The official package tutorial can be found [here](https://github.com/biobakery/biobakery/wiki/maaslin2).
```{r results = "hide"}
# maaslin expects features as columns and samples as rows
# for both the asv/otu table as well as meta data
asv <- t(assay(tse))
meta_data <- data.frame(colData(tse))
# you can specifiy different GLMs/normalizations/transforms. We used similar
# settings as in Nearing et al. (2021) here:
fit_data <- Maaslin2(
asv,
meta_data,
output = "DAA example",
transform = "AST",
fixed_effects = "pheno",
# random_effects = c(...), # you can also fit MLM by specifying random effects
# specifying a ref is especially important if you have more than 2 levels
reference = "pheno,Lean",
normalization = "TSS",
standardize = FALSE,
min_prevalence = 0 # prev filterin already done
)
```
```{r}
# which genera are identified as differentially abundant? (leave out "head" to
# see all)
kable(head(filter(fit_data$results, qval <= 0.05)))
# A folder will be created that is called like the above specified output.
# It contains also figures to visualize the difference between genera
# for the significant ones.
```
### LinDA
Lastly, we cover linear models for differential abundance analysis of
microbiome compositional data (@Zhou2022). This tool is very
similar to ANCOMBC with few differences: 1) LinDA correct for the compositional
bias differently
using the mode of all regression coefficients. 2) The authors claim that it
runs 100-1000x faster than ANCOMBC and 3) it support hierarchical models.
The latter could be ignored as ANCOMBC will be supporting hierarchical models
with the next release. Nevertheless, LinDA seems a promising tool that achieves
the best power/fdr trade-off together with ANCOMBC according to the authors. The
speed might make it the choice for bigger datasets or datasets with a very high
number of features.
```{r}
otu.tab <- as.data.frame(assay(tse))
meta <- as.data.frame(colData(tse)) %>% select(pheno)
res <- linda(
otu.tab,
meta,
formula = '~pheno',
alpha = 0.05,
prev.filter = 0,
mean.abund.filter = 0)
# to scan the table for genera where H0 could be rejected:
kable(head(filter(as.data.frame(res$output), phenoObese.reject)))
```
### Comparison of the methods
When we compare the methods in the context of a research question, we could
look at e.g. at whether they agree based on the applied decision criterion
(e.g. adjusted p value < 0.05). That is what we illustrate here. First we will
look at how many taxa were identified by each method to begin with. In the next
step we will look at the intersection of identified taxa. To achieve that, we
first create a dataframe that summarises the decision criterion for each method
and shows a score from 0 to 3 indicating how many methods agreed on a particular
taxon.
```{r}
# Rename "taxon" column from ancombc results so that it match with others
colnames(out$res$diff_abn)[colnames(out$res$diff_abn) == "taxon"] <- "genus"
summ <- full_join(
rownames_to_column(aldex_out, "genus") %>%
select(genus, aldex2 = wi.eBH),
out$res$diff_abn %>%
select(genus, ancombc = phenoObese),
by = "genus") %>%
full_join(
select(fit_data$results, genus = feature, maaslin2 = qval),
by = "genus") %>%
full_join(
rownames_to_column(as.data.frame(res$output), "genus") %>%
select(genus, LinDA = phenoObese.reject),
by = "genus") %>%
mutate(
across(c(aldex2, maaslin2), ~ .x <= 0.05),
# the following line would be necessary without prevalence filtering
# as some methods output NA
#across(-genus, function(x) ifelse(is.na(x), FALSE, x)),
score = rowSums(across(c(aldex2, ancombc, maaslin2, LinDA)))
)
# This is how it looks like:
kable(head(summ))
```
Now we can answer our questions:
```{r}
# how many genera were identified by each method?
summarise(summ, across(where(is.logical), sum)) %>%
kable()
# which genera are identified by all methods?
filter(summ, score == 4) %>% kable()
```
We see that each method identified at least 9 genera as differentially
abundant. Eight of those that were identified by ALDEx2,
were also identified by the other methods. We could plot the data for
any method or for those taxa that were identified by all methods:
```{r}
plot_data <- data.frame(t(assay(tse)))
plot_data$pheno <- colData(tse)$pheno
# create a plot for each genus where the score is indicated in the title
plots <- pmap(select(summ, genus, score), function(genus, score) {
ggplot(plot_data, aes_string("pheno", genus)) +
geom_boxplot(aes(fill = pheno), outlier.shape = NA) +
geom_jitter(width = 0.2, alpha = 0.5) +
ggtitle(glue::glue("Score {score}")) +
theme_bw() +
theme(legend.position = "none")
})
# now we can show only those genera that have at least score 3 (or 2 or 1)
robust_plots <- plots[summ$score == 4]
# to display this nicely in the book we use patchwork here:
# (we show first 8)
robust_plots[[1]] +
robust_plots[[2]] +
robust_plots[[3]] +
robust_plots[[4]] +
robust_plots[[5]] +
robust_plots[[6]] +
robust_plots[[7]] +
robust_plots[[8]] +
plot_layout(nrow = 2)
# or if we have most trust in any specific method we can show genera that
# are differentially abundant according to that method and then look in the
# title how many methods also identified it (we only show first 6 here):
ancombc_plots <- plots[summ$ancombc]
ancombc_plots[[1]] +
ancombc_plots[[2]] +
ancombc_plots[[3]] +
ancombc_plots[[4]] +
ancombc_plots[[5]] +
ancombc_plots[[6]]
```
### Confounding variables
To perform causal inference, it is crucial that the method is able to include
covariates in the model. This is not possible with e.g. the Wilcoxon test.
Other methods such as both ANCOM methods, ALDEx2, LinDA, MaAsLin2 and others
allow this. Below we show how to include a covariate in ANCOM-BC.
It is very similar for all the methods that allow this. Since in this dataset
there are no covariates, I first simulate a new variable and add it to the TSE
object.
```{r}
# Add random age variable for demonstration
colData(tse)$age <- sample(18:64, ncol(tse), replace = TRUE)
out_cov = ancombc(
data = tse,
formula = "pheno + age", # here we add age to the model
p_adj_method = "fdr",
prv_cut = 0, # we did that already
lib_cut = 0,
group = "pheno",
struc_zero = TRUE,
neg_lb = TRUE,
tol = 1e-5,
max_iter = 100,
conserve = TRUE,
alpha = 0.05,
global = TRUE
)
# now the model answers the question: holding age constant, are
# bacterial taxa differentially abundant? Or, if that is of interest,
# holding phenotype constant, is age associated with bacterial abundance?
# Again we only show the first 6 entries.
kable(head(out_cov$res$diff_abn))
```
In the next section of this book chapter we cover methods that can also take
into account the phylogenetic information of bacterial taxa to perform
group-wise associations.
## Tree-based methods
### Group-wise associations testing based on balances
TO DO!!!
## Session Info {-}
```{r sessionInfo, echo=FALSE, results='asis'}
prettySessionInfo()
```