07-clustering-analysis.Rmd

# MI-based clustering analysis

In this chapter, we will introduce more about the MICA component, and walk you through the MICA workflow, including preparing inputs, running MICA, visualizing and integrating MICA outputs into SparseEset object.

## Introduction to MICA

**MICA (Mutual Information-based Clustering Analysis)** is a clustering tool designed for single cell genomics data. Compared to most existing single-cell clustering algorithms, MICA has two unique features:

- MICA uses **mutual information** to measure cell-cell similarity for unsupervised clustering analysis, while most existing tools employ **linear-transformation of PCA** and/or **co-expression analysis using linear Pearson or Spearman correlations** that may not capture the nonlinear cell-cell distance.
- MICA uses **all high-quality features** for clustering, while most existing tools select the top highly variable features to improve the clustering speed. This is arbitrary and may lose the information that can distinguish close cell states.  

**MICA** is developed using Python framework, to take its strengths in calculation speed and memory consumption. A lot of effort has been made to improve the interoperability between Python and R. Now MICA works **seamlessly** with the SparseExpressionSet object. The input of MICA can be easily generated from the SparseExpressionSet object by `generateMICAinput()`, and the output of MICA, the clustering results, can be effortlessly visualized by `MICAplot()` and integrated into SparseExpressionSet object by `addMICAoutput()`.

## Generate MICA input

The standard input of MICA is **a normalized and log-transformed gene expression matrix**. scMINER can generate this matrix from sparse eSet object and save it into a file that can be directly read by MICA. MICA accepts `.h5ad` or `.txt` format as the input file, which can be easily generated by embedded function `generateMICAinput()`:
```{r generate-mica-input-txt, eval=FALSE}
## generate MICA input in txt format
generateMICAinput(input_eset = pbmc14k_log2cpm.eset, output_file = "/work-path/PBMC14k/MICA/micaInput.txt", overwrite = FALSE)

## check the format of MICA input
mica_input <- read.delim(system.file("extdata/demo_pbmc14k/PBMC14k/MICA/micaInput.txt", package = "scMINER"), header = T, sep = "\t", row.names = 1)
mica_input[1:5,1:5]
```

To use the .h5ad format, run the codes below.
```{r generate-mica-input-h5ad, eval=FALSE}
## generate MICA input in h5ad format: anndata package is needed
generateMICAinput(input_eset = pbmc14k_log2cpm.eset, output_file = "/work-path/PBMC14k/MICA/micaInput.h5ad", overwrite = FALSE)
```

In addition to generating the standard MICA input file, `generateMICAinput()` also returns the recommended commands of running MICA. You can copy the commands, modify according and run them.


## Run MICA

MICA features two different modes named by their different dimension reduction techniques:

- **Multi-Dimensional Scaling (MDS)** mode: this mode is more accurate and robust for small datasets (less than 5,000 cells, be default) due to its global dimension reduction nature;
- **Graph Embedding (GE)** mode: this mode works better with large datasets (more than 5,000 cells, by default) using a graph embedding approach to explore distant neighbor cells.

In this case, since there are 13,605 cells, we will use the `MICA GE` mode for the clustering:
```{r, run-mica-ge, engine = 'bash', eval=FALSE}
mica ge -i /work-path/PBMC14k/MICA/micaInput.txt -o /work-path/PBMC14k/MICA/micaOutput -minr 0.1 -maxr 9.0 -ss 0.05 -nw 4
```

This command will generate the clustering results of multiple resolutions, from 0.1 to 9.0, with a step size of 0.05. MICA supports multi-threading. The `-nw 4` here sets 4 threads to run in parallel. You can change it accordingly to speed up the clustering analysis.  

## Integrate MICA outputs into SparseEset object

MICA generates several files and save all of them in the output directory specified by the user with `-o` argument. The core, and only, output file we need for subsequent analysis is the clustering label file named in the format of `ProjectName_clustering_VisualizeMethod_euclidean_NumberOfDimensions_Resolution.txt`. In this case, since we used a range of resolutions, there are several clustering label files generated, one for each resolution. Based on the knowledge about PBMC14k dataset, we compared the results of different resolutions and picked `clustering_UMAP_euclidean_20_2.05.txt` for subsequent analysis.

```{r mica-output-core}
micaOutput <- read.table(system.file("extdata/demo_pbmc14k/MICA/clustering_UMAP_euclidean_20_2.05.txt", package = "scMINER"), header = TRUE, sep = "\t", quote = "", stringsAsFactors = F)
head(micaOutput)
```
As shown above, the clustering label file contains four columns:

- `ID`: cell barcodes;
- `X`: coordinates of UMAP_1 or tSNE_1;
- `Y`: coordinates of UMAP_2 or tSNE_2;
- `label`: labels of predicted clusters.

The clustering result can be easily easily added to the SparseExpressionSet object by `addMICAoutput()`:
```{r mica-output-vis}
pbmc14k_log2cpm.eset <- addMICAoutput(pbmc14k_log2cpm.eset, mica_output_file = system.file("extdata/demo_pbmc14k/MICA/clustering_UMAP_euclidean_20_2.05.txt", package = "scMINER"), visual_method = "umap")
head(pData(pbmc14k_log2cpm.eset))
```

## Visualize the MICA output

scMINER provides a function, `MICAplot()` to easily visualize the clustering results on a 2D plot, UMAP or tSNE. And it can be colored by multiple variables, including **cluster label**, **sample source**, **nUMI**, **nFeature**, **pctMito** and more.  

### Color-coded by cluster labels

```{r visualize-mica-output-cluster, fig.align='center', fig.width=6, fig.height=5}
library(ggplot2)
MICAplot(input_eset = pbmc14k_log2cpm.eset, color_by = "clusterID", X = "UMAP_1", Y = "UMAP_2", point.size = 0.1, fontsize.cluster_label = 6)
```

### Color-coded by true label of cell types

```{r visualize-mica-output-truelabel, fig.align='center', fig.width=6.5, fig.height=5}
MICAplot(input_eset = pbmc14k_log2cpm.eset, color_by = "trueLabel", X = "UMAP_1", Y = "UMAP_2", point.size = 0.1, fontsize.cluster_label = 4)
```

### Color-coded by nUMI, for QC purpose

```{r visualize-mica-output-numi, fig.align='center', fig.width=6.5, fig.height=5}
MICAplot(input_eset = pbmc14k_log2cpm.eset, color_by = "nUMI", do.logTransform = TRUE, point.size = 0.1)
```


### Color-coded by nFeature, for QC purpose

```{r visualize-mica-output-nfeature, fig.align='center', fig.width=6.5, fig.height=5}
MICAplot(input_eset = pbmc14k_log2cpm.eset, color_by = "nFeature", do.logTransform = TRUE, point.size = 0.1)
```
### Color-coded by pctMito, for QC purpose

```{r visualize-mica-output-pctmito, fig.align='center', fig.width=6.5, fig.height=5}
MICAplot(input_eset = pbmc14k_log2cpm.eset, color_by = "pctMito", do.logTransform = FALSE, point.size = 0.1)
```