forked from microbiome/OMA
-
Notifications
You must be signed in to change notification settings - Fork 0
/
10_manipulation.Rmd
356 lines (271 loc) · 10.8 KB
/
10_manipulation.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
# (PART) Focus Topics {-}
# Data Manipulation {#datamanipulation}
```{r setup, echo=FALSE, results="asis"}
library(rebook)
chapterPreamble()
```
## Tidying and subsetting
### Tidy data
For several custom analysis and visualization packages, such as those from the
`tidyverse`, the `SE` data can be converted to long data.frame format with
`meltAssay`.
```{r include=FALSE}
# REMOVE THIS CHUNK ################################################################
# when bioc devel version has mergeSE
if( !require(devtools) ){
install.packages("devtools")
library(devtools)
}
devtools::install_github("microbiome/mia", upgrade = "never")
library(mia)
####################################
```
```{r}
library(mia)
data(GlobalPatterns, package="mia")
tse <- GlobalPatterns
tse <- transformSamples(tse, method="relabundance")
molten_tse <- meltAssay(tse,
add_row_data = TRUE,
add_col_data = TRUE,
assay_name = "relabundance")
molten_tse
```
### Subsetting
**Subsetting** data helps to draw the focus of analysis on particular
sets of samples and / or features. When dealing with large data
sets, the subset of interest can be extracted and investigated
separately. This might improve performance and reduce the
computational load.
Load:
* mia
* dplyr
* knitr
* data `GlobalPatterns`
```{r include = FALSE}
# load libraries and data
library(mia)
library(dplyr)
library(knitr)
```
Let us store `GlobalPatterns` into `tse` and check its original number of features (rows) and samples (columns). **Note**: when subsetting by sample, expect the number of columns to decrease; when subsetting by feature, expect the number of rows to decrease.
```{r}
# store data into se and check dimensions
data("GlobalPatterns", package="mia")
tse <- GlobalPatterns
# show dimensions (features x samples)
dim(tse)
```
#### Subset by sample (column-wise)
For the sake of demonstration, here we will extract a subset containing only the samples of human origin (feces, skin or tongue), stored as `SampleType` within `colData(tse)` and also in `tse`.
First, we would like to see all the possible values that `SampleType` can take on and how frequent those are:
```{r}
# inspect possible values for SampleType
unique(tse$SampleType)
```
```{r eval = FALSE}
# show recurrence for each value
tse$SampleType %>% table()
```
```{r echo = FALSE}
# show recurrence for each value
tse$SampleType %>% table() %>% kable() %>%
kableExtra::kable_styling("striped", latex_options="scale_down") %>%
kableExtra::scroll_box(width = "100%")
```
**Note**: after subsetting, expect the number of columns to equal the
sum of the recurrences of the samples that you are interested
in. For instance, `ncols = Feces + Skin + Tongue = 4 + 3 + 2 = 9`.
Next, we _logical index_ across the columns of `tse` (make sure to
leave the first index empty to select all rows) and filter for the
samples of human origin. For this, we use the information on the
samples from the meta data `colData(tse)`.
```{r}
# subset by sample
tse_subset_by_sample <- tse[ , tse$SampleType %in% c("Feces", "Skin", "Tongue")]
# show dimensions
dim(tse_subset_by_sample)
```
As a sanity check, the new object `tse_subset_by_sample` should have
the original number of features (rows) and a number of samples
(columns) equal to the sum of the samples of interest (in this case
9).
Several characteristics can be used to subset by sample:
* origin
* sampling time
* sequencing method
* DNA / RNA barcode
* cohort
#### Subset by feature (row-wise)
Similarly, here we will extract a subset containing only the features
that belong to the Phyla "Actinobacteria" and "Chlamydiae", stored as
`Phylum` within `rowData(tse)`. However, subsetting by feature implies
a few more obstacles, such as the presence of NA elements and the
possible need for agglomeration.
As previously, we would first like to see all the possible values that
`Phylum` can take on and how frequent those are:
```{r}
# inspect possible values for Phylum
unique(rowData(tse)$Phylum)
```
```{r eval = FALSE}
# show recurrence for each value
rowData(tse)$Phylum %>% table()
```
```{r echo = FALSE}
# show recurrence for each value
rowData(tse)$Phylum %>% table() %>% kable() %>%
kableExtra::kable_styling("striped", latex_options="scale_down") %>%
kableExtra::scroll_box(width = "100%")
```
**Note**: after subsetting, expect the number of columns to equal the
sum of the recurrences of the feature(s) that you are interested
in. For instance, `nrows = Actinobacteria + Chlamydiae = 1631 + 21 =
1652`.
Depending on your research question, you might need to or need not
agglomerate the data in the first place: if you want to find the
abundance of each and every feature that belongs to Actinobacteria and
Chlamydiae, agglomeration is not needed; if you want to find the total
abundance of all the features that belong to Actinobacteria or
Chlamydiae, agglomeration is recommended.
##### Non-agglomerated data
Next, we _logical index_ across the rows of `tse` (make sure to leave
the second index empty to select all columns) and filter for the
features that fall in either Actinobacteria or Chlamydiae. For this,
we use the information on the samples from the meta data
`rowData(tse)`.
The first term with the `%in%` operator are includes all the features
of interest, whereas the second term after the AND operator `&`
filters out all the features that present a NA in place of Phylum.
```{r}
# subset by feature
tse_subset_by_feature <- tse[rowData(tse)$Phylum %in% c("Actinobacteria", "Chlamydiae") & !is.na(rowData(tse)$Phylum), ]
# show dimensions
dim(tse_subset_by_feature)
```
As a sanity check, the new object `tse_subset_by_feature` should have the original number of samples (columns) and a number of features (rows) equal to the sum of the features of interest (in this case 1652).
##### Agglomerated data
When total abundances of certain Phyla are of relevance, the data is initially agglomerated by Phylum. Then, similar steps as in the case of not agglomerated data are followed.
```{r}
# agglomerate by Phylum
tse_phylum <- tse %>% agglomerateByRank(rank = "Phylum")
# subset by feature and get rid of NAs
tse_phylum_subset_by_feature <- tse_phylum[rowData(tse_phylum)$Phylum %in% c("Actinobacteria", "Chlamydiae") & !is.na(rowData(tse_phylum)$Phylum), ]
# show dimensions
dim(tse_phylum_subset_by_feature)
```
**Note**: as data was agglomerated, the number of rows equal the
number of Phyla used to index (in this case, just 2)
Alternatively:
```{r}
# store features of interest into phyla
phyla <- c("Phylum:Actinobacteria", "Phylum:Chlamydiae")
# subset by feature
tse_phylum_subset_by_feature <- tse_phylum[phyla, ]
# show dimensions
dim(tse_subset_by_feature)
```
The code above returns the not agglomerated version of the data.
Fewer characteristics can be used to subset by feature:
* Taxonomic rank
* Meta-taxonomic group
For subsetting by Kingdom, agglomeration does not apply, whereas for
the other ranks it can be applied if necessary.
#### Subset by sample and feature
Finally, we can subset data by sample and feature at once. The
resulting subset contains all the samples of human origin and all the
features of Phyla "Actinobacteria" or "Chlamydiae".
```{r}
# subset by sample and feature and get rid of NAs
tse_subset_by_sample_feature <- tse[rowData(tse)$Phylum %in% c("Actinobacteria", "Chlamydiae") & !is.na(rowData(tse)$Phylum), tse$SampleType %in% c("Feces", "Skin", "Tongue")]
# show dimensions
dim(tse_subset_by_sample_feature)
```
**Note**: the dimensions of `tse_subset_by_sample_feature` agree with
those of the previous subsets (9 columns filtered by sample and 1652
rows filtered by feature).
If a study was to consider and quantify the presence of Actinobacteria
as well as Chlamydiae in different sites of the human body,
`tse_subset_by_sample_feature` might be a suitable subset to start
with.
#### Remove empty columns and rows
Sometimes data might contain, e.g., features that are not present in any of the samples.
This might occur after subsetting for instance. In certain analyses we might want to
remove those instances.
```{r}
# Agglomerate data at Genus level
tse_genus <- agglomerateByRank(tse, rank = "Genus")
# List bacteria that we want to include
genera <- c("Class:Thermoprotei", "Genus:Sulfolobus", "Genus:Sediminicola")
# Subset data
tse_genus_sub <- tse_genus[genera, ]
tse_genus_sub
```
```{r}
# List total counts of each sample
colSums(assay(tse_genus_sub, "counts"))
```
Now we can see that certain samples do not include any bacteria. We can remove those.
```{r}
# Remove samples that do not have present any bacteria
tse_genus_sub <- tse_genus_sub[ , colSums(assay(tse_genus_sub, "counts")) != 0 ]
tse_genus_sub
```
Same thing can be done for features.
```{r}
# Take only those samples that are collected from feces, skin, or tongue
tse_genus_sub <- tse_genus[ , colData(tse_genus)$SampleType %in% c("Feces", "Skin", "Tongue")]
tse_genus_sub
```
```{r}
# How many bacteria there are that are not present?
sum(rowSums(assay(tse_genus_sub, "counts")) == 0)
```
We can see that there are bacteria that are not present in these samples that we chose.
We can remove those bacteria from the data.
```{r}
# Take only those bacteria that are present
tse_genus_sub <- tse_genus_sub[rowSums(assay(tse_genus_sub, "counts")) > 0, ]
tse_genus_sub
```
### Splitting
You can split data base on variables. There are available functions `splitByRanks`
and `splitOn`.
`splitByRanks` splits the data based on rank. Since the elements of the output list
share columns, they can be stored into `altExp`.
```{r splitbyRanks}
altExps(tse) <- splitByRanks(tse)
altExps(tse)
```
If you want to split the data based on other variable than taxonomy rank, use
`splitOn`. It works for row-wise and column-wise splitting.
```{r splitOn}
splitOn(tse, "SampleType")
```
## Merge data
`mia` package has `mergeSEs` function that merges multiple `SummarizedExperiment`
objects. For example, it is possible to combine multiple `TreeSE` objects which each
includes one sample.
`mergeSEs` works like `dplyr` joining functions. In fact, there are available
`dplyr-like` aliases of `mergeSEs`, such as `full_join`.
```{r merge1}
# Take subsets for demonstrative purpose
tse1 <- tse[, 1]
tse2 <- tse[, 2]
tse3 <- tse[, 3]
tse4 <- tse[1:100, 4]
```
```{r merge2}
# With inner join, we want to include all shared rows. When using mergeSEs function
# all the samples are always preserved.
tse <- mergeSEs(list(tse1, tse2, tse3, tse4), join = "inner")
tse
```
```{r merge3}
# left join preserves all the rows of the 1st object
tse <- mia::left_join(tse1, tse4, missing_values = 0)
tse
```
### Additional functions
* [mapTaxonomy](https://microbiome.github.io/mia/reference/taxonomy-methods.html)
* [mergeRows/mergeCols](https://microbiome.github.io/mia/reference/merge-methods.html)