08-phenotyping.Rmd

# Cell phenotyping 

A common step during single-cell data analysis is the annotation of cells based
on their phenotype. Defining cell phenotypes is often subjective and relies
on previous biological knowledge. The [Orchestrating Single Cell Analysis with Bioconductor](https://bioconductor.org/books/3.14/OSCA.basic/cell-type-annotation.html) book
presents a number of approaches to phenotype cells detected by single-cell RNA
sequencing based on reference datasets or prior knowledge of geneset importance.

In highly-multiplexed imaging, target proteins or molecules are manually
selected based on the biological question at hand. It narrows down the feature
space and facilitates the manual annotation of clusters to derive cell
phenotypes. We will therefore discuss and compare a number of clustering
approaches to group cells based on their similarity in marker expression in
Section \@ref(clustering).

Unlike single-cell RNA sequencing or CyTOF data, single-cell data derived from
highly-multiplexed imaging data often suffers from "lateral spillover" between
neighboring cells. This spillover caused by imperfect segmentation often hinders
accurate clustering to define specific cell phenotypes in multiplexed imaging
data. Tools have been developed to correct lateral spillover between cells
[@Bai2021] but the approach requires careful selection of the markers to
correct. In Section \@ref(classification) we will train and apply a random
forest classifier to classify cell phenotypes in the dataset as alternative
approach for clustering-based cell phenotyping. This approach has been previously used to
identify major cell phenotypes in metastatic melanoma and avoids clustering of
cells [@Hoch2022].

## Load data

We will first read in the previously generated `SpatialExperiment` object and
sample 2000 cells to visualize cluster membership.

```{r read-data-pheno, message=FALSE}
library(SpatialExperiment)
spe <- readRDS("data/spe.rds")

# Sample cells
set.seed(220619)
cur_cells <- sample(seq_len(ncol(spe)), 2000)
```

## Clustering approaches {#clustering}

In the first section to identify cellular phenotypes in the dataset, we will
present clustering approaches that group cells based on their similarity in
marker expression. A number of approaches have been developed to cluster data
derived from single-cell RNA sequencing technologies [@Yu2022] or CyTOF
[@Weber2016]. For demonstration purposes, we will highlight common clustering
approaches that are available in R and have been used for clustering cells
obtained from IMC. Two approaches rely on graph-based clustering and one
approach uses self organizing maps (SOM).

### Rphenograph

The PhenoGraph clustering approach was first described to group cells of a CyTOF
dataset [@Levine2015]. The algorithm first constructs a graph by detecting the
`k` nearest neighbours based on euclidean distance in expression space. In the
next step, edges between nodes (cells) are weighted by their overlap in nearest
neighbor sets. To quantify the overlap in shared nearest neighbor sets, the
jaccard index is used. The Louvain modularity optimization approach is used to
detect connected communities and partition the graph into clusters of cells.
This clustering strategy was used by Jackson, Fischer _et al._  and Schulz _et
al._ to cluster IMC data [@Jackson2020; @Schulz2018].

There are several different PhenoGraph implementations available in R. Here, we
use the one available at
[https://github.com/i-cyto/Rphenograph](https://github.com/i-cyto/Rphenograph).
For large datasets,
[https://github.com/stuchly/Rphenoannoy](https://github.com/stuchly/Rphenoannoy)
offers a more performant implementation of the algorithm.

In the following code chunk, we select the asinh-transformed mean pixel
intensities per cell and channel and subset the channels to the ones containing
biological variation. This matrix is transposed to store cells in rows. Within
the `Rphenograph` function, we select the 45 nearest neighbors for graph
building and louvain community detection (default). The function returns a list
of length 2, the first entry being the graph and the second entry containing the
community object. Calling `membership` on the community object will return
cluster IDs for each cell. These cluster IDs are then stored within the
`colData` of the `SpatialExperiment` object. Cluster IDs are mapped on top of
the UMAP embedding and single-cell marker expression within each cluster are
visualized in form of a heatmap.

```{r, echo=FALSE}
start_time <- Sys.time()
```

```{r rphenograph-1, message=FALSE}
library(Rphenograph)
library(igraph)
library(dittoSeq)
library(viridis)

mat <- t(assay(spe, "exprs")[rowData(spe)$use_channel,])

out <- Rphenograph(mat, k = 45)

clusters <- factor(membership(out[[2]]))

spe$pg_clusters <- clusters

dittoDimPlot(spe, var = "pg_clusters", 
             reduction.use = "UMAP", size = 0.2,
             do.label = TRUE) +
    ggtitle("Phenograph clusters expression on UMAP")
```

```{r, fig.height=8}
dittoHeatmap(spe[,cur_cells], 
             genes = rownames(spe)[rowData(spe)$use_channel],
             assay = "exprs", scale = "none",
             heatmap.colors = viridis(100), 
             annot.by = c("pg_clusters", "patient_id"),
             annot.colors = c(dittoColors(1)[1:length(unique(spe$pg_clusters))],
                              metadata(spe)$color_vectors$patient_id))
```

```{r, echo=FALSE}
end_time <- Sys.time()
```

The `Rphenograph` function call took 
`r round(as.numeric(difftime(end_time, start_time, units = "mins")), digits = 2)` minutes.

We can observe that some of the clusters only contain cells of a single patient.
This  can often be observed in the tumor compartment. In the next step, we
use the integrated cells (see Section \@ref(batch-effects)) in low dimensional
embedding for clustering. Here, the low dimensional embedding can
be directly accessed from the `reducedDim` slot.

```{r rphenograph-2, message=FALSE}
mat <- reducedDim(spe, "fastMNN")

out <- Rphenograph(mat, k = 45)

clusters <- factor(membership(out[[2]]))

spe$pg_clusters_corrected <- clusters

dittoDimPlot(spe, var = "pg_clusters_corrected", 
             reduction.use = "UMAP_mnnCorrected", size = 0.2,
             do.label = TRUE) +
    ggtitle("Phenograph clusters expression on UMAP, integrated cells")
```

```{r, fig.height=8}
dittoHeatmap(spe[,cur_cells], 
             genes = rownames(spe)[rowData(spe)$use_channel],
             assay = "exprs", scale = "none",
             heatmap.colors = viridis(100), 
             annot.by = c("pg_clusters_corrected", "pg_clusters","patient_id"),
             annot.colors = c(dittoColors(1)[1:length(unique(spe$pg_clusters_corrected))],
                              dittoColors(1)[1:length(unique(spe$pg_clusters))],
                              metadata(spe)$color_vectors$patient_id))
```

Clustering using the integrated embedding leads to clusters that contain cells
of different patients. Cluster annotation can now be performed by manually
labeling cells based on their marker expression (see Notes in Section
\@ref(clustering-notes)).

### Shared nearest neighbour graph {#snn-graph}

The [bluster](https://www.bioconductor.org/packages/release/bioc/html/bluster.html) 
package provides a simple interface to cluster cells using a number of different
[clustering approaches](https://www.bioconductor.org/packages/release/bioc/vignettes/bluster/inst/doc/clusterRows.html) and different metrics to [access cluster stability](https://www.bioconductor.org/packages/release/bioc/vignettes/bluster/inst/doc/diagnostics.html).  

For simplicity, we will focus on graph based clustering as this is the most
popular and a fast method for single-cell clustering. The `bluster` package
provides functionalities to build k-nearest neighbour (KNN) and it's weighted
version (shared nearest neighbor; SNN) graphs where nodes represent cells.
The user can chose the number of neighbors to consider (parameter `k`), 
the edge weighting method (parameter `type`) and the community detection
function to use (parameter `cluster.fun`). As all parameters affect the clustering
results, the `bluster` package provides the `clusterSweep` function to test
a number of parameter settings in parallel. In the following code chunk,
we select the asinh-transformed mean pixel intensities and subset the markers
of interest. The resulting matrix is transposed to fit to the requirements of 
the bluster package (cells in rows).

We test two different settings for `k`, two for `type` and two for `cluster.fun`.
This function call is parallelized by setting the `BPPARAM` parameter. We use
the `approxSilhouette` function to compute the silhouette width for each cell
and compute the average across all cells per parameter setting. 
Please see `?silhouette` for more information on how the silhouette 
width id computed for each cell. A large average silhouette width indicates 
cells that are well clustered.

For time reasons, we sample 10000 cells but it is recommended to increase
this number to include rare cells into parameter testing.

```{r, echo=FALSE}
start_time <- Sys.time()
```

```{r parameter-sweep, message=FALSE, fig.height=7}
library(bluster)
library(BiocParallel)
library(ggplot2)

sam_cells <- sample(seq_len(ncol(spe)), 10000)

mat <- t(assay(spe, "exprs")[rowData(spe)$use_channel,])

combinations <- clusterSweep(mat[sam_cells,], BLUSPARAM=SNNGraphParam(),
                             k=c(10L, 20L), 
                             type = c("rank", "jaccard"), 
                             cluster.fun=c("walktrap", "louvain"),
                             BPPARAM = MulticoreParam(RNGseed = 220427))

sil <- vapply(as.list(combinations$clusters), 
              function(x) mean(approxSilhouette(mat[sam_cells,], x)$width), 
              0)

ggplot(data.frame(method = names(sil),
                  sil = sil)) +
    geom_point(aes(method, sil)) +
    theme_classic(base_size = 15) +
    theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
    xlab("Cluster parameter combination") +
    ylab("Average silhouette width")
```

```{r, echo=FALSE}
end_time <- Sys.time()
```

The cluster parameter sweep took 
`r round(as.numeric(difftime(end_time, start_time, units = "mins")), digits = 2)` minutes.

Performing a cluster sweep takes some time as multiple function calls are run in parallel.
We do however recommend testing a number of different parameter settings to
access clustering performance.

Once parameter settings are known, we can either use the `clusterRows` function
of the `bluster` package to cluster cells or its convenient wrapper function
exported by the
[scran](https://bioconductor.org/packages/release/bioc/html/scran.html) package.
The `scran::clusterCells` function accepts a `SpatialExperiment` (or
`SingleCellExperiment`) object which stores cells in columns. By default, the
function detects the 10 nearest neighbours for each cell, performs rank-based
weighting of edges (see `?makeSNNGraph` for more information) and uses the
`cluster_walktrap` function to detect communities in the graph.

As we can see above, the `louvain` community detection approach outperforms the
default `walktrap` approach in this dataset with `k` being 20 and rank-based
edge weighting.

```{r, echo=FALSE}
start_time <- Sys.time()
```

```{r snn-1, message=FALSE}
library(scran)

clusters <- clusterCells(spe[rowData(spe)$use_channel,], 
                         assay.type = "exprs", 
                         BLUSPARAM = SNNGraphParam(k=20, 
                                                  cluster.fun = "louvain",
                                                  type = "rank",
                                                  BPPARAM = bpparam()))

spe$nn_clusters <- clusters

dittoDimPlot(spe, var = "nn_clusters", 
             reduction.use = "UMAP", size = 0.2,
             do.label = TRUE) +
    ggtitle("SNN clusters expression on UMAP")
```

```{r, fig.height=8}
dittoHeatmap(spe[,cur_cells], 
             genes = rownames(spe)[rowData(spe)$use_channel],
             assay = "exprs", scale = "none",
             heatmap.colors = viridis(100), 
             annot.by = c("nn_clusters", "patient_id"),
             annot.colors = c(dittoColors(1)[1:length(unique(spe$nn_clusters))],
                              metadata(spe)$color_vectors$patient_id))
```

```{r, echo=FALSE}
end_time <- Sys.time()
```

The shared nearest neighbor graph clustering approach took 
`r round(as.numeric(difftime(end_time, start_time, units = "mins")), digits = 2)` minutes.

This function was used by [@Tietscher2022] to cluster cells obtained by IMC. Setting
`type = "jaccard"` performs clustering similar to `Rphenograph` above and [Seurat](https://satijalab.org/seurat/articles/pbmc3k_tutorial.html#cluster-the-cells-1).

Similar to the results obtained by `Rphenograph`, some of the clusters are 
patient-specific. We can now perform clustering of the integrated cells
by directly specifying which low-dimensional embedding to use:

```{r snn-2, message=FALSE}
clusters <- clusterCells(spe[rowData(spe)$use_channel,], 
                         use.dimred = "fastMNN", 
                         BLUSPARAM = NNGraphParam(k=20, 
                                                  cluster.fun = "louvain",
                                                  type = "rank",
                                                  BPPARAM = bpparam()))

spe$nn_clusters_corrected <- clusters

dittoDimPlot(spe, var = "nn_clusters_corrected", 
             reduction.use = "UMAP_mnnCorrected", size = 0.2,
             do.label = TRUE) +
    ggtitle("SNN clusters expression on UMAP, integrated cells")
```

```{r, fig.height=8}
dittoHeatmap(spe[,cur_cells], 
             genes = rownames(spe)[rowData(spe)$use_channel],
             assay = "exprs", scale = "none",
             heatmap.colors = viridis(100), 
             annot.by = c("nn_clusters_corrected", "nn_clusters","patient_id"),
             annot.colors = c(dittoColors(1)[1:length(unique(spe$nn_clusters_corrected))],
                              dittoColors(1)[1:length(unique(spe$nn_clusters))],
                              metadata(spe)$color_vectors$patient_id))
```

### Self organizing maps

An alternative to graph-based clustering is offered by the
[CATALYST](https://bioconductor.org/packages/release/bioc/html/CATALYST.html)
package. The `cluster` function internally uses the
[FlowSOM](https://bioconductor.org/packages/release/bioc/html/FlowSOM.html)
package to group cells into 100 (default) clusters based on self organizing maps
(SOM). In the next step, the
[ConsensusClusterPlus](https://bioconductor.org/packages/release/bioc/html/ConsensusClusterPlus.html)
package is used to perform hierarchical consensus clustering of the previously
detected 100 SOM nodes into 1 to `maxK` clusters. Cluster stability for each `k`
can be assessed by plotting the `delta_area(spe)`. The optimal number
of clusters can be found by selecting the `k` at which a plateau is reached.
In the example below, the optimal `k` lies somewhere around 13.

```{r, echo=FALSE}
start_time <- Sys.time()
```

```{r flowSOM-1, message=FALSE}
library(CATALYST)

# Run FlowSOM and ConsensusClusterPlus clustering
spe <- cluster(spe, 
               features = rownames(spe)[rowData(spe)$use_channel],
               maxK = 30,
               seed = 220410)

# Assess cluster stability
delta_area(spe)

spe$som_clusters <- cluster_ids(spe, "meta13")

dittoDimPlot(spe, var = "som_clusters", 
             reduction.use = "UMAP", size = 0.2,
             do.label = TRUE) +
    ggtitle("SOM clusters expression on UMAP")
```

```{r, fig.height=8}
dittoHeatmap(spe[,cur_cells], 
             genes = rownames(spe)[rowData(spe)$use_channel],
             assay = "exprs", scale = "none",
             heatmap.colors = viridis(100), 
             annot.by = c("som_clusters", "patient_id"),
             annot.colors = c(dittoColors(1)[1:length(unique(spe$som_clusters))],
                              metadata(spe)$color_vectors$patient_id))
```

```{r, echo=FALSE}
end_time <- Sys.time()
```

Running FlowSOM clustering took `r round(as.numeric(difftime(end_time, start_time, units = "mins")), digits = 2)` minutes.

The `CATALYST` package does not provide functionality to perform `FlowSOM` and
`ConsensusClusterPlus` clustering directly on the batch-corrected, integrated cells. As an
alternative to the `CATALYST` package, the `bluster` package provides SOM
clustering when specifying the `SomParam()` parameter. Similar to the `CATALYST`
approach, we will first cluster the dataset into 100 clusters (also called
"codes"). These codes are then further clustered into a maximum of 30 clusters
using `ConsensusClusterPlus` (using hierarchical clustering and euclidean
distance). The delta area plot can be accessed using the (not exported)
`.plot_delta_area` function from `CATALYST`. Here, it seems that the plateau is
reached at a `k` of 16 and we will store the final cluster IDs within the
`SpatialExperiment` object.

```{r flowSOM-2, message=FALSE, results='hide', fig.show='hide'}
library(kohonen)
library(ConsensusClusterPlus)

# Select integrated cells
mat <- reducedDim(spe, "fastMNN")

# Perform SOM clustering
som.out <- clusterRows(mat, SomParam(100), full = TRUE)

# Cluster the 100 SOM codes into larger clusters
ccp <- ConsensusClusterPlus(t(som.out$objects$som$codes[[1]]),
                            maxK = 30,
                            reps = 100, 
                            distance = "euclidean", 
                            seed = 220410, 
                            plot = NULL)
```

```{r}
# Visualize delta area plot
CATALYST:::.plot_delta_area(ccp)

# Link ConsensusClusterPlus clusters with SOM codes and save in object
som.cluster <- ccp[[16]][["consensusClass"]][som.out$clusters]
spe$som_clusters_corrected <- as.factor(som.cluster)

dittoDimPlot(spe, var = "som_clusters_corrected", 
             reduction.use = "UMAP_mnnCorrected", size = 0.2,
             do.label = TRUE) +
    ggtitle("Phenograph clusters expression on UMAP, integrated cells")
```

```{r, fig.height=8}
dittoHeatmap(spe[,cur_cells], 
             genes = rownames(spe)[rowData(spe)$use_channel],
             assay = "exprs", scale = "none",
             heatmap.colors = viridis(100), 
             annot.by = c("som_clusters_corrected", "som_clusters","patient_id"),
             annot.colors = c(dittoColors(1)[1:length(unique(spe$som_clusters_corrected))],
                              dittoColors(1)[1:length(unique(spe$som_clusters))],
                              metadata(spe)$color_vectors$patient_id))
```

The `FlowSOM` clustering approach has been used by [@Hoch2022] to sub-cluster tumor
cells as measured by IMC.

### Compare between clustering approaches

Finally, we can compare the results of different clustering approaches. For
this, we visualize the number of cells that are shared between different
clustering results in a pairwise fashion. In the following heatmaps a high match
between clustering results can be seen for those clusters that are uniquely
detected in both approaches.

First, we will visualize the match between the three different approaches 
applied to the asinh-transformed counts.

```{r raw-counts, message=FALSE}
library(patchwork)
library(pheatmap)
library(gridExtra)

tab1 <- table(paste("Rphenograph", spe$pg_clusters), 
              paste("SNN", spe$nn_clusters))
tab2 <- table(paste("Rphenograph", spe$pg_clusters), 
              paste("SOM", spe$som_clusters))
tab3 <- table(paste("SNN", spe$nn_clusters), 
              paste("SOM", spe$som_clusters))

pheatmap(log10(tab1 + 10), color = viridis(100))
pheatmap(log10(tab2 + 10), color = viridis(100))
pheatmap(log10(tab3 + 10), color = viridis(100))
```

We observe that `Rphenograph` and the shared nearest neighbor (SNN) approach by
`scran` show similar results (first heatmap above). For example, Rphenograph
cluster 17 (a tumor cluster) is perfectly captured by SNN cluster 12. On the
other hand, the Neutrophil cluster (SNN cluster 8) is split into Rphenograph
cluster 8 and Rphenograph cluster 10. A common approach
is to now merge clusters that contain similar cell types and annotate them by
hand.

Below, a comparison between the clustering results of the integrated cells 
is shown.

```{r corrected, message=FALSE}
tab1 <- table(paste("Rphenograph", spe$pg_clusters_corrected), 
              paste("SNN", spe$nn_clusters_corrected))
tab2 <- table(paste("Rphenograph", spe$pg_clusters_corrected), 
              paste("SOM", spe$som_clusters_corrected))
tab3 <- table(paste("SNN", spe$nn_clusters_corrected), 
              paste("SOM", spe$som_clusters_corrected))

pheatmap(log10(tab1 + 10), color = viridis(100))
pheatmap(log10(tab2 + 10), color = viridis(100))
pheatmap(log10(tab3 + 10), color = viridis(100))
```

In comparison to clustering on the non-integrated cells, the clustering results
of the integrated cells show higher overlap. The SNN approach resulted in fewer
clusters and therefore matches better with the SOM clustering approach.

### Further clustering notes {#clustering-notes}

The `bluster` package provides a number of metrics to assess cluster stability
[here](https://www.bioconductor.org/packages/release/bioc/vignettes/bluster/inst/doc/diagnostics.html).
For brevity we only highlighted the use of the silhouette width but different
metrics should be tested to assess cluster stability.

To assign cell types to clusters, we manually annotate clusters based on their
marker expression. For example, SNN cluster 11 (clustering of the integrated
cells) shows high, homogeneous expression of CD20 and we might therefore label
this cluster as B cells. The next chapter \@ref(single-cell-visualization) will 
highlight single-cell visualization methods that can be helpful for manual 
cluster annotations. 

Full cell labeling can be seen in the following section.

## Cell type classification {#classification}

In this section, we will highlight a cell type classification approach based
on ground truth labeling and random forest classification. The rational for
this supervised cell phenotyping approach is to use the information contained
in the pre-defined markers to detect cells of interest. This approach was
used by Hoch _et al._ to classify cell types in a metastatic melanoma IMC
dataset [@Hoch2022].

The antibody panel used in the example data set mainly focuses on immune cell
types and little on tumor cell phenotypes. Therefore we will label the following
cell types:

* Tumor (E-cadherin positive)
* Stroma (SMA, PDGFRb positive)
* Plasma cells (CD38 positive)
* Neutrophil (MPO, CD15 positive)
* Myeloid cells (HLADR positive)
* B cells (CD20 positive)
* B next to T cells (CD20, CD3 positive)
* Regulatory T cells (FOXP3 positive)
* CD8+ T cells (CD3, CD8 positive)
* CD4+ T cells (CD3, CD4 positive)

The "B next to T cell" phenotype (`BnTcell`) is commonly observed in immune
infiltrated regions measured by IMC. We include this phenotype to account for B
cell/T cell interactions where precise classification into B cells or T cells is
not possible. The exact gating scheme can be seen at
[img/Gating_scheme.pdf](img/Gating_scheme.pdf).

As related approaches, [Astir](https://github.com/camlab-bioml/astir) and
[Garnett](https://cole-trapnell-lab.github.io/garnett/) use pre-defined panel
information to classify cell phenotypes based on their marker expression.

### Manual labeling of cells

The [cytomapper](https://www.bioconductor.org/packages/release/bioc/html/cytomapper.html)
package provides the `cytomapperShiny` function that allows gating of cells
based on their marker expression and visualization of selected cells directly
on the images. 

```{r cytomapperShiny, message=FALSE}
library(cytomapper)
if (interactive()) {
    
    images <- readRDS("data/images.rds")
    masks <- readRDS("data/masks.rds")
    
    cytomapperShiny(object = spe, mask = masks, image = images, 
                    cell_id = "ObjectNumber", img_id = "sample_id")
}
```

The labeled cells for this data set can be accessed at
[zenodo.org/record/6554611](https://zenodo.org/record/6554611) and were
downloaded in Section \@ref(prerequisites). Per image, the `cytomapperShiny`
function allows the export of gated cells in form of a `SingleCellExperiment` or
`SpatialExperiment` object. The cell label is stored in
`colData(object)$cytomapper_CellLabel` and the gates are stored in
`metadata(object)`. In the next section, we will read in and consolidate the
labeled data.

### Define color vectors

For consistent visualization of cell types, we will now pre-define their colors:

```{r cell-type-colors}
celltype <- setNames(c("#3F1B03", "#F4AD31", "#894F36", "#1C750C", "#EF8ECC", 
                       "#6471E2", "#4DB23B", "grey", "#F4800C", "#BF0A3D", "#066970"),
                     c("Tumor", "Stroma", "Myeloid", "CD8", "Plasma_cell", 
                       "Treg", "CD4", "undefined", "BnTcell", "Bcell", "Neutrophil"))

metadata(spe)$color_vectors$celltype <- celltype
```

### Read in and consolidate labeled data

Here, we will read in the individual `SpatialExperiment` objects containing the
labeled cells and concatenate them. In the process of concatenating the 
`SpatialExperiment` objects along their columns, the `sample_id` entry is 
appended by `.1, .2, .3, ...` due to replicated entries. 

```{r, read-in-labeled-data, message=FALSE}
library(SingleCellExperiment)
label_files <- list.files("data/gated_cells", 
                          full.names = TRUE, pattern = ".rds$")

# Read in SPE objects
spes <- lapply(label_files, readRDS)

# Merge SPE objects
concat_spe <- do.call("cbind", spes)
```

In the following code chunk we will identify cells that were labeled multiple
times. This occurs when different cell phenotypes are gated per image and can
affect immune cells that are located inside the tumor compartment.

We will first identify those cells that were uniquely labeled. In the next step,
we will identify those cells that were labeled twice AND were labeled as Tumor
cells. These cells will be assigned their immune cell label. Finally, we will
save the unique labels within the original `SpatialExperiment` object.

```{r, consolidate-labels}
cur_tab <- unclass(table(colnames(concat_spe), 
                         concat_spe$cytomapper_CellLabel))
cur_labels <- rep("doublets", nrow(cur_tab))
names(cur_labels) <- rownames(cur_tab)

# Single assignments
single_index <- rowSums(cur_tab) == 1
cur_labels[single_index] <- colnames(cur_tab)[apply(cur_tab[single_index,], 1, 
                                                    which.max)]

# Double assignment within the tumor
double_index <- rowSums(cur_tab) == 2 & cur_tab[,"Tumor"] == 1
no_tumor <- cur_tab[,colnames(cur_tab) != "Tumor"]
cur_labels[double_index] <- colnames(no_tumor)[apply(no_tumor[double_index,], 1, 
                                                    which.max)]

# Remove doublets
cur_labels <- cur_labels[cur_labels != "doublets"]
table(cur_labels)

# Transfer labels to SPE object
spe_labels <- rep("unlabeled", ncol(spe))
names(spe_labels) <- colnames(spe)
spe_labels[names(cur_labels)] <- cur_labels
spe$cell_labels <- spe_labels

# Number of cells labeled per patient
table(spe$cell_labels, spe$patient_id)
```

Based on these labels, we can now train a random forest classifier to classify
all remaining, unlabeled cells.

### Train classifier

In this section, we will use the
[caret](https://topepo.github.io/caret/index.html) framework for machine
learning in R. This package provides an interface to train a number of
regression and classification models in a coherent fashion. We use a random
forest classifier due to low number of parameters, high speed and an observed
high performance for cell type classification [@Hoch2022].

In the following section, we will first split the `SpatialExperiment` object
into labeled and unlabeled cells. Based on the labeled cells, we split
the data into a train (75% of the data) and test (25% of the data) dataset. 
We currently do not provide an independently labeled validation dataset.

The `caret` package provides the `trainControl` function, which specifies model
training parameters and the `train` function, which performs the actual model
training. While training the model, we also want to estimate the best model
parameters. In the case of the chosen random forest model (`method = "rf"`), we
only need to estimate a single parameters (`mtry`) which corresponds to the
number of variables randomly sampled as candidates at each split. To estimate
the best parameter, we will perform a 5-fold cross validation (set within
`trainControl`) over a tune length of 5 entries to `mtry`.

```{r, echo=FALSE}
start_time <- Sys.time()
```

```{r train-classifier, message=FALSE}
library(caret)

# Split between labeled and unlabeled cells
lab_spe <- spe[,spe$cell_labels != "unlabeled"]
unlab_spe <- spe[,spe$cell_labels == "unlabeled"]

# Randomly split into train and test data
set.seed(220925)
trainIndex <- createDataPartition(factor(lab_spe$cell_labels), p = 0.75)
train_spe <- lab_spe[,trainIndex$Resample1]
test_spe <- lab_spe[,-trainIndex$Resample1]

# Specify train parameters for 5-fold cross validation
fitControl <- trainControl(method = "cv",
                           number = 5)

# Select the data for training
cur_mat <- t(assay(train_spe, "exprs")[rowData(train_spe)$use_channel,])

# Train a random forest model for predicting cell labels
# This call also performs parameter tuning
rffit <- train(x = cur_mat, 
               y = factor(train_spe$cell_labels),
               method = "rf", ntree = 1000,
               tuneLength = 5,
               trControl = fitControl)

rffit
```

```{r, echo=FALSE}
end_time <- Sys.time()
```

Training the classifier took 
`r round(as.numeric(difftime(end_time, start_time, units = "mins")), digits = 2)` minutes.

### Classifier performance

We next observe the accuracy of the classifer when predicting cell 
phenotypes across the cross-validation as well as the test dataset.

First, we can visualize the classification accuracy during parameter
tuning:

```{r accuracy-tuning}
ggplot(rffit) + 
  geom_errorbar(data = rffit$results,
                aes(ymin = Accuracy - AccuracySD,
                    ymax = Accuracy + AccuracySD),
                width = 0.4) +
    theme_classic(base_size = 15)
```

The best value for `mtry` is 28 and is used when predicting new data.

It is often recommended to visualize the variable importance of the
classifier. The following plot specifies which variables (markers) are 
most important for classifying the data.

```{r variable-importance, fig.height = 10}
plot(varImp(rffit))
```

As expected, the markers that were used for gating (Ecad, CD3, CD20, HLADR,
CD8a, CD38, FOXP3) were important for classification.

To assess the accuracy, sensitivity, specificity, among other quality measures of
the classifier, we will now predict cell phenotypes in the test data.

```{r model-testing, message=FALSE}
# Select test data
cur_mat <- t(assay(test_spe, "exprs")[rowData(test_spe)$use_channel,])

# Predict cell phenotypes in test data
cur_pred <- predict(rffit, 
                    newdata = cur_mat)
```

While the overall classification accuracy can appear high, we also want
to check if each cell phenotype class is correctly predicted.
For this, we will calculate the confusion matrix between predicted and actual 
cell labels. This measure highlights individual cell phenotype classes that 
were not correctly predicted by the classifier. When setting `mode = "everything"`,
the `confusionMatrix` function returns all available prediction measures including
sensitivity, specificity, precision, recall and the F1 score per cell 
phenotype class.

```{r confusion-matrix}
cm <- confusionMatrix(data = cur_pred, 
                      reference = factor(test_spe$cell_labels), 
                      mode = "everything")

cm
```

To easily visualize these results, we can now plot the true positive rate (sensitivity)
versus the false positive rate (1 - specificity):

```{r specificity-sensitivity, message=FALSE}
library(tidyverse)

data.frame(cm$byClass) %>%
  mutate(class = sub("Class: ", "", rownames(cm$byClass))) %>%
  ggplot() + 
  geom_point(aes(1 - Specificity, Sensitivity, 
                 size = Detection.Rate,
                 fill = class),
             shape = 21) + 
  scale_fill_manual(values = metadata(spe)$color_vectors$celltype) +
  theme_classic(base_size = 15) + 
  ylab("Sensitivity (TPR)") +
  xlab("1 - Specificity (FPR)")
```

We observe high sensitivity and specificity for most cell types. Plasma cells
show the lowest true positive rate with 92% being sufficiently high. The size of
the circle specifies the number of cells per class.

Finally, to observe which cell phenotypes were wrongly classified, we can visualize
the distribution of classification probabilities per cell phenotype class:

```{r prediciton-probability, fig.height=15}
cur_pred <- predict(rffit, 
                    newdata = cur_mat, 
                    type = "prob")
cur_pred$truth <- factor(test_spe$cell_labels)

cur_pred %>%
  pivot_longer(cols = Bcell:Tumor) %>%
  ggplot() +
  geom_boxplot(aes(x = name, y = value, fill = name), outlier.size = 0.5) +
  facet_wrap(. ~ truth, ncol = 1) + 
  scale_fill_manual(values = metadata(spe)$color_vectors$celltype)  +
  theme(panel.background = element_blank(), 
        axis.text.x = element_text(angle = 45, hjust = 1))
```

The boxplots indicate the classification probabilities per class. The classifier
is well trained if classification probabilities are only high for the one 
specific class.

### Classification of new data

In the final section, we will now use the tuned and tested random forest 
classifier to predict the cell phenotypes of the unlabeled data.

First, we predict the cell phenotypes and extract their classification
probabilities.

```{r, unlabeled-prediction}
# Select unlabeled data
cur_mat <- t(assay(unlab_spe, "exprs")[rowData(unlab_spe)$use_channel,])

# Predict cell phenotypes
cell_class <- as.character(predict.train(rffit, 
                                         newdata = cur_mat, 
                                         type = "raw"))
names(cell_class) <- rownames(cur_mat)

table(cell_class)

# Extract classification probabilities
cell_prob <- predict.train(rffit, 
                           newdata = cur_mat, 
                           type = "prob")
```

Each cell is assigned to the class with highest probability. There are however
cases, where the highest probability is low meaning the cell can not be uniquely
assigned to a class. We next want to identify these cells and label them as 
"undefined". Here, we select a maximum classification probability threshold 
of 40% but this threshold needs to be adjusted for other datasets. The adjusted
cell labels are then stored in the `SpatialExperiment` object.

```{r, undefined-cells}
library(ggridges)

# Distribution of maximum probabilities
tibble(max_prob = rowMax(as.matrix(cell_prob)),
       type = cell_class) %>%
    ggplot() +
        geom_density_ridges(aes(x = max_prob, y = cell_class, fill = cell_class)) +
        scale_fill_manual(values = metadata(spe)$color_vectors$celltype) +
        theme_classic(base_size = 15) +
        xlab("Maximum probability") +
        ylab("Cell type") + 
        xlim(c(0,1.2))

# Label undefined cells
cell_class[rowMax(as.matrix(cell_prob)) < 0.4] <- "undefined"

# Store labels in SpatialExperiment onject
cell_labels <- spe$cell_labels
cell_labels[colnames(unlab_spe)] <- cell_class
spe$celltype <- cell_labels 

table(spe$celltype, spe$patient_id)
```

We can now compare the cell labels derived by classification to the different
clustering strategies. The first comparison is against the clustering results
using the asinh-transformed counts.

```{r compare-raw, message=FALSE}
tab1 <- table(spe$celltype, 
              paste("Rphenograph", spe$pg_clusters))
tab2 <- table(spe$celltype, 
              paste("SNN", spe$nn_clusters))
tab3 <- table(spe$celltype, 
              paste("SOM", spe$som_clusters))

pheatmap(log10(tab1 + 10), color = viridis(100))
pheatmap(log10(tab2 + 10), color = viridis(100))
pheatmap(log10(tab3 + 10), color = viridis(100))
```

We can see that Tumor and Myeloid cells span multiple clusters while
Neutrophiles are detected as an individual cluster by all clustering approaches.

We next compare the cell classification against clustering results using the 
integrated cells.

```{r compare-corrected, message=FALSE}
tab1 <- table(spe$celltype, 
              paste("Rphenograph", spe$pg_clusters_corrected))
tab2 <- table(spe$celltype, 
              paste("SNN", spe$nn_clusters_corrected))
tab3 <- table(spe$celltype, 
              paste("SOM", spe$som_clusters_corrected))

pheatmap(log10(tab1 + 10), color = viridis(100))
pheatmap(log10(tab2 + 10), color = viridis(100))
pheatmap(log10(tab3 + 10), color = viridis(100))
```

We observe a high agreement between the shared nearest neighbor clustering
approach using the integrated cells and the cell phenotypes derived by
classification.

In the next sections, we will highlight visualization strategies to verify the
correctness of the phenotyping approach. Specifically, Section
\@ref(outline-cells) shows how to outline identified cell phenotypes on
composite images.

```{r display-images, echo=FALSE, message = FALSE, results='hide'}
# This section serves quality control purposes.
library(testthat)
test_that("images are written out", {
    skip_on_ci()
    expect_true(TRUE)
    cell_types <- unique(spe$celltype)

    markers <- vector(mode = "list", length = length(cell_types))
    names(markers) <- cell_types

    markers[["Bcell"]] <- c("CD20", "CD3")
    markers[["Plasma_cell"]] <- c("CD38")
    markers[["CD4"]] <- c("CD4", "CD8a", "FOXP3")
    markers[["BnTcell"]] <- c("CD20","CD3")
    markers[["CD8"]] <- c("CD4", "CD8a", "FOXP3")
    markers[["Treg"]] <- c("CD4", "CD8a", "FOXP3")
    markers[["Myeloid"]] <- c("CD11c", "CD68", "HLADR", "CD163")
    markers[["Neutrophil"]] <- c("CD15", "MPO")
    markers[["Stroma"]] <- c("SMA", "PDGFRb")
    markers[["Tumor"]] <- c("Ecad")
    markers[["undefined"]] <- c("Ecad", "CD3", "CD20", "MPO", "CD38")
    
    images <- readRDS("data/images.rds")
    masks <- readRDS("data/masks.rds")
    
    images <- cytomapper::normalize(images, separateImages = TRUE)
    images <- cytomapper::normalize(images, separateImages = TRUE, 
                        inputRange = c(0, 0.2))
    
    if (!dir.exists("data/CellTypeValidation")) {
        dir.create("data/CellTypeValidation")
    }
    
    for(j in unique(spe$celltype)){
        cur_sce <- spe[,spe$celltype == j]
        cur_markers <- markers[[j]]
  
        if(dim(cur_sce)[2] == 0) {
            next(j)
        }
    
        if (length(cur_markers) == 1) {
            cur_col <- "red"
            names(cur_col) <- j
            cur_col <- list(cur_col)
            names(cur_col) <- "celltype"
            plotPixels(image = images, 
                object = cur_sce, 
                mask = masks, 
                cell_id = "ObjectNumber",
                img_id = "sample_id", 
                colour_by = cur_markers, 
                outline_by = "celltype",
                image_title = list(text = names(images),
                                cex = 1), 
                colour = cur_col,
                save_plot = list(filename = paste0("data/CellTypeValidation/",
                            j, ".png")))
        } else {
            plotPixels(image = images,
                   object = cur_sce, 
                   mask = masks, 
                   cell_id = "ObjectNumber",
                   img_id = "sample_id", 
                   colour_by = cur_markers, 
                   outline_by = "celltype",
                   image_title = list(text = names(images),
                                  cex = 1),
                   save_plot = list(filename = paste0("data/CellTypeValidation/",
                            j, ".png")))
        }
    }
    expect_true(TRUE)
    
})
```

Finally, we save the updated `SpatialExperiment` object.

```{r phenotyping-save-object}
saveRDS(spe, "data/spe.rds")
```

## Session Info

<details>
   <summary>SessionInfo</summary>
   
```{r, echo = FALSE}
sessionInfo()
```
</details>