10_manipulation.Rmd

# (PART) Focus Topics {-}


# Data Manipulation {#datamanipulation}

```{r setup, echo=FALSE, results="asis"}
library(rebook)
chapterPreamble()
```


## Tidying and subsetting

### Tidy data

For several custom analysis and visualization packages, such as those from the 
`tidyverse`, the `SE` data can be converted to long data.frame format with 
`meltAssay`.    

```{r include=FALSE}
# REMOVE THIS CHUNK ################################################################
# when bioc devel version has mergeSE 
if( !require(devtools) ){
    install.packages("devtools")
    library(devtools)
}
devtools::install_github("microbiome/mia", upgrade = "never")
library(mia)
####################################
```

```{r}
library(mia)
data(GlobalPatterns, package="mia")
tse <- GlobalPatterns
tse <- transformSamples(tse, method="relabundance")

molten_tse <- meltAssay(tse,
                        add_row_data = TRUE,
                        add_col_data = TRUE,
                        assay_name = "relabundance")
molten_tse
```

### Subsetting

**Subsetting** data helps to draw the focus of analysis on particular
  sets of samples and / or features. When dealing with large data
  sets, the subset of interest can be extracted and investigated
  separately. This might improve performance and reduce the
  computational load.

Load:

* mia
* dplyr
* knitr
* data `GlobalPatterns`

```{r include = FALSE}
# load libraries and data
library(mia)
library(dplyr)
library(knitr)
```

Let us store `GlobalPatterns` into `tse` and check its original number of features (rows) and samples (columns). **Note**: when subsetting by sample, expect the number of columns to decrease; when subsetting by feature, expect the number of rows to decrease.

```{r}
# store data into se and check dimensions
data("GlobalPatterns", package="mia")
tse <- GlobalPatterns
# show dimensions (features x samples)
dim(tse) 
```

#### Subset by sample (column-wise)

For the sake of demonstration, here we will extract a subset containing only the samples of human origin (feces, skin or tongue), stored as `SampleType` within `colData(tse)` and also in `tse`.

First, we would like to see all the possible values that `SampleType` can take on and how frequent those are: 

```{r}
# inspect possible values for SampleType
unique(tse$SampleType)
```
```{r eval = FALSE}
# show recurrence for each value
tse$SampleType %>% table()
```
```{r echo = FALSE}
# show recurrence for each value
tse$SampleType %>% table() %>% kable() %>%
    kableExtra::kable_styling("striped", latex_options="scale_down") %>% 
    kableExtra::scroll_box(width = "100%")
```

**Note**: after subsetting, expect the number of columns to equal the
  sum of the recurrences of the samples that you are interested
  in. For instance, `ncols = Feces + Skin + Tongue = 4 + 3 + 2 = 9`.

Next, we _logical index_ across the columns of `tse` (make sure to
leave the first index empty to select all rows) and filter for the
samples of human origin. For this, we use the information on the
samples from the meta data `colData(tse)`.

```{r}
# subset by sample
tse_subset_by_sample <- tse[ , tse$SampleType %in% c("Feces", "Skin", "Tongue")]

# show dimensions
dim(tse_subset_by_sample)
```

As a sanity check, the new object `tse_subset_by_sample` should have
the original number of features (rows) and a number of samples
(columns) equal to the sum of the samples of interest (in this case
9).

Several characteristics can be used to subset by sample:

* origin
* sampling time
* sequencing method
* DNA / RNA barcode
* cohort

#### Subset by feature (row-wise)

Similarly, here we will extract a subset containing only the features
that belong to the Phyla "Actinobacteria" and "Chlamydiae", stored as
`Phylum` within `rowData(tse)`. However, subsetting by feature implies
a few more obstacles, such as the presence of NA elements and the
possible need for agglomeration.

As previously, we would first like to see all the possible values that
`Phylum` can take on and how frequent those are:
  
```{r}
# inspect possible values for Phylum
unique(rowData(tse)$Phylum)
```
```{r eval = FALSE}
# show recurrence for each value
rowData(tse)$Phylum %>% table()
```
```{r echo = FALSE}
# show recurrence for each value
rowData(tse)$Phylum %>% table() %>% kable() %>%
    kableExtra::kable_styling("striped", latex_options="scale_down") %>% 
    kableExtra::scroll_box(width = "100%")
```

**Note**: after subsetting, expect the number of columns to equal the
  sum of the recurrences of the feature(s) that you are interested
  in. For instance, `nrows = Actinobacteria + Chlamydiae = 1631 + 21 =
  1652`.

Depending on your research question, you might need to or need not
agglomerate the data in the first place: if you want to find the
abundance of each and every feature that belongs to Actinobacteria and
Chlamydiae, agglomeration is not needed; if you want to find the total
abundance of all the features that belong to Actinobacteria or
Chlamydiae, agglomeration is recommended.

##### Non-agglomerated data

Next, we _logical index_ across the rows of `tse` (make sure to leave
the second index empty to select all columns) and filter for the
features that fall in either Actinobacteria or Chlamydiae. For this,
we use the information on the samples from the meta data
`rowData(tse)`.

The first term with the `%in%` operator are includes all the features
of interest, whereas the second term after the AND operator `&`
filters out all the features that present a NA in place of Phylum.

```{r}
# subset by feature
tse_subset_by_feature <- tse[rowData(tse)$Phylum %in% c("Actinobacteria", "Chlamydiae") & !is.na(rowData(tse)$Phylum), ]

# show dimensions
dim(tse_subset_by_feature)
```

As a sanity check, the new object `tse_subset_by_feature` should have the original number of samples (columns) and a number of features (rows) equal to the sum of the features of interest (in this case 1652).

##### Agglomerated data

When total abundances of certain Phyla are of relevance, the data is initially agglomerated by Phylum. Then, similar steps as in the case of not agglomerated data are followed.

```{r}
# agglomerate by Phylum
tse_phylum <- tse %>% agglomerateByRank(rank = "Phylum")

# subset by feature and get rid of NAs
tse_phylum_subset_by_feature <- tse_phylum[rowData(tse_phylum)$Phylum %in% c("Actinobacteria", "Chlamydiae") & !is.na(rowData(tse_phylum)$Phylum), ]

# show dimensions
dim(tse_phylum_subset_by_feature)
```

**Note**: as data was agglomerated, the number of rows equal the
  number of Phyla used to index (in this case, just 2)

Alternatively:

```{r}
# store features of interest into phyla
phyla <- c("Phylum:Actinobacteria", "Phylum:Chlamydiae")
# subset by feature
tse_phylum_subset_by_feature <- tse_phylum[phyla, ]
# show dimensions
dim(tse_subset_by_feature)
```

The code above returns the not agglomerated version of the data.

Fewer characteristics can be used to subset by feature:

* Taxonomic rank
* Meta-taxonomic group

For subsetting by Kingdom, agglomeration does not apply, whereas for
the other ranks it can be applied if necessary.

#### Subset by sample and feature

Finally, we can subset data by sample and feature at once. The
resulting subset contains all the samples of human origin and all the
features of Phyla "Actinobacteria" or "Chlamydiae".

```{r}
# subset by sample and feature and get rid of NAs
tse_subset_by_sample_feature <- tse[rowData(tse)$Phylum %in% c("Actinobacteria", "Chlamydiae") & !is.na(rowData(tse)$Phylum), tse$SampleType %in% c("Feces", "Skin", "Tongue")]

# show dimensions
dim(tse_subset_by_sample_feature)
```

**Note**: the dimensions of `tse_subset_by_sample_feature` agree with
  those of the previous subsets (9 columns filtered by sample and 1652
  rows filtered by feature).

If a study was to consider and quantify the presence of Actinobacteria
as well as Chlamydiae in different sites of the human body,
`tse_subset_by_sample_feature` might be a suitable subset to start
with.

#### Remove empty columns and rows

Sometimes data might contain, e.g., features that are not present in any of the  samples.
This might occur after subsetting for instance. In certain analyses we might want to
remove those instances.

```{r}
# Agglomerate data at Genus level 
tse_genus <- agglomerateByRank(tse, rank = "Genus")
# List bacteria that we want to include
genera <- c("Class:Thermoprotei", "Genus:Sulfolobus", "Genus:Sediminicola")
# Subset data
tse_genus_sub <- tse_genus[genera, ]

tse_genus_sub
```

```{r}
# List total counts of each sample
colSums(assay(tse_genus_sub, "counts"))
```

Now we can see that certain samples do not include any bacteria. We can remove those.

```{r}
# Remove samples that do not have present any bacteria
tse_genus_sub <- tse_genus_sub[ , colSums(assay(tse_genus_sub, "counts")) != 0 ]
tse_genus_sub
```

Same thing can be done for features.

```{r}
# Take only those samples that are collected from feces, skin, or tongue
tse_genus_sub <- tse_genus[ , colData(tse_genus)$SampleType %in% c("Feces", "Skin", "Tongue")]

tse_genus_sub
```

```{r}
# How many bacteria there are that are not present?
sum(rowSums(assay(tse_genus_sub, "counts")) == 0)
```

We can see that there are bacteria that are not present in these samples that we chose.
We can remove those bacteria from the data. 

```{r}
# Take only those bacteria that are present
tse_genus_sub <- tse_genus_sub[rowSums(assay(tse_genus_sub, "counts")) > 0, ]

tse_genus_sub
```

### Splitting

You can split data base on variables. There are available functions `splitByRanks` 
and `splitOn`.

`splitByRanks` splits the data based on rank. Since the elements of the output list
share columns, they can be stored into `altExp`. 

```{r splitbyRanks}
altExps(tse) <- splitByRanks(tse)
altExps(tse)
```

If you want to split the data based on other variable than taxonomy rank, use 
`splitOn`. It works for row-wise and column-wise splitting.

```{r splitOn}
splitOn(tse, "SampleType")
```

## Merge data

`mia` package has `mergeSEs` function that merges multiple `SummarizedExperiment`
objects. For example, it is possible to combine multiple `TreeSE` objects which each
includes one sample. 

`mergeSEs` works like `dplyr` joining functions. In fact, there are available
`dplyr-like` aliases of `mergeSEs`, such as `full_join`.

```{r merge1}
# Take subsets for demonstrative purpose
tse1 <- tse[, 1]
tse2 <- tse[, 2]
tse3 <- tse[, 3]
tse4 <- tse[1:100, 4]
```

```{r merge2}
# With inner join, we want to include all shared rows. When using mergeSEs function
# all the samples are always preserved.
tse <- mergeSEs(list(tse1, tse2, tse3, tse4), join = "inner")
tse
```

```{r merge3}
# left join preserves all the rows of the 1st object
tse <- mia::left_join(tse1, tse4, missing_values = 0)
tse
```

### Additional functions
* [mapTaxonomy](https://microbiome.github.io/mia/reference/taxonomy-methods.html)
* [mergeRows/mergeCols](https://microbiome.github.io/mia/reference/merge-methods.html)