06-quality_control.Rmd

# Image and cell-level quality control

The following section discusses possible quality indicators for data obtained
by IMC and other highly multiplexed imaging technologies. Here, we will focus
on describing quality metrics on the single-cell as well as image level.

## Read in the data

We will first read in the data processed in previous sections:

```{r read-data, message=FALSE}
images <- readRDS("data/images.rds")
masks <- readRDS("data/masks.rds")
spe <- readRDS("data/spe.rds")
```

## Segmentation quality control {#seg-quality}

The first step after image segmentation is to observe its accuracy.
Without having ground-truth data readily available, a common approach to 
segmentation quality control is to overlay segmentation masks on composite images
displaying channels that were used for segmentation. 
The [cytomapper](https://www.bioconductor.org/packages/release/bioc/html/cytomapper.html) 
package supports exactly this tasks by using the `plotPixels` function.

Here, we select 3 random images and perform image- and channel-wise
normalization (channels are first min-max normalized and scaled to a range of
0-1 before clipping the maximum intensity to 0.2).

```{r overlay-masks, message=FALSE}
library(cytomapper)
set.seed(20220118)
img_ids <- sample(seq_len(length(images)), 3)

# Normalize and clip images
cur_images <- images[img_ids]
cur_images <- normalize(cur_images, separateImages = TRUE)
cur_images <- normalize(cur_images, inputRange = c(0, 0.2))

plotPixels(cur_images,
           mask = masks[img_ids],
           img_id = "sample_id",
           missing_colour = "white",
           colour_by = c("CD163", "CD20", "CD3", "Ecad", "DNA1"),
           colour = list(CD163 = c("black", "yellow"),
                         CD20 = c("black", "red"),
                         CD3 = c("black", "green"),
                         Ecad = c("black", "cyan"),
                         DNA1 = c("black", "blue")),
           image_title = NULL,
           legend = list(colour_by.title.cex = 0.7,
                         colour_by.labels.cex = 0.7))
```

We can see that nuclei are centered within the segmentation masks and all cell
types are correctly segmented (note: to zoom into the image you can right click
and select `Open Image in New Tab`). A common challenge here is to segment large (e.g.
epithelial cells - in cyan) _versus_ small (e.g. B cells - in red). However, the
segmentation approach here appears to correctly segment cells across different
sizes.

An additional approach to observe cell segmentation quality and potentially also
antibody specificity issues is to visualize single-cell expression in form of a
heatmap. Here, we sub-sample the dataset to 2000 cells for visualization
purposes and overlay the cancer type from which the cells were extracted.

```{r segmentation-heatmap, message=FALSE, fig.height=7}
library(dittoSeq)
library(viridis)
cur_cells <- sample(seq_len(ncol(spe)), 2000)

dittoHeatmap(spe[,cur_cells], genes = rownames(spe)[rowData(spe)$use_channel],
             assay = "exprs", cluster_cols = TRUE, scale = "none",
             heatmap.colors = viridis(100), annot.by = "indication",
             annotation_colors = list(indication = metadata(spe)$color_vectors$indication))
```

We can differentiate between epithelial cells (Ecad+) and immune cells (CD45RO).
Some of the markers are specifically detected (e.g., Ki67, CD20, Ecad) and
others are more broadly detected (e.g. HLADR, B2M, CD4).

## Image-level quality control {#image-quality}

Image-level quality control is often performed using tools that offer a
graphical user interface such as [QuPath](https://qupath.github.io/) and
[FIJI](https://imagej.net/software/fiji/). Viewers that were specifically
developed for IMC data can be seen
[here](https://bodenmillergroup.github.io/IMCWorkflow/viewers.html). In this
section, we will specifically focus on quantitative metrics to assess image
quality.

It is often of interest to calculate the signal-to-noise ratio (SNR) for
individual channels and markers. Here, we define the SNR as:

$$SNR = I_s/I_n$$

where $I_s$ is the intensity of the signal (mean intensity of pixels with true
signal) and $I_n$ is the intensity of the noise (mean intensity of pixels
containing noise). Finding a threshold that separates pixels containing signal
and pixels containing noise is not trivial and different approaches can be
chosen. Here, we use the `otsu` thresholding approach to find pixels of the
"foreground" (i.e., signal) and "background" (i.e., noise). The SNR is then
defined as the mean intensity of foreground pixels divided by the mean intensity
of background pixels. We compute this measure as well as mean signal intensity
per image as well as the 95% confidence interval. The plot below shows the  
average SNR _versus_ the average signal intensity across all images.  

```{r image-snr, message=FALSE, warning=FALSE}
library(tidyverse)
library(ggrepel)
library(EBImage)
cur_snr <- lapply(images, function(img){
    mat <- apply(img, 3, function(ch){
        # Otsu threshold
        thres <- otsu(ch, range = c(min(ch), max(ch)))
        # Signal-to-noise ratio
        snr <- mean(ch[ch > thres]) / mean(ch[ch <= thres])
        # Signal intensity
        ps <- mean(ch[ch > thres])
        
        return(c(snr = snr, ps = ps))
    })
    t(mat) %>% as.data.frame() %>% 
        mutate(marker = colnames(mat)) %>% 
        pivot_longer(cols = c(snr, ps))
})

cur_snr <- do.call(rbind, cur_snr)

cur_snr %>% 
    group_by(marker, name) %>%
    summarize(mean = mean(value),
              ci = qnorm(0.975)*sd(value)/sqrt(n())) %>%
    pivot_wider(names_from = name, values_from = c(mean, ci)) %>%
    ggplot() +
#    geom_errorbar(aes(y = log2(mean_snr), xmin = log2(mean_ps - ci_ps), 
#                      xmax = log2(mean_ps + ci_ps))) +
#    geom_errorbar(aes(x = log2(mean_ps), ymin = log2(mean_snr - ci_snr), 
#                      ymax = log2(mean_snr + ci_snr))) +
    geom_point(aes(log2(mean_ps), log2(mean_snr))) +
    geom_label_repel(aes(log2(mean_ps), log2(mean_snr), label = marker)) +
    theme_minimal(base_size = 15) + ylab("Signal-to-noise ratio [log2]") +
    xlab("Signal intensity [log2]")
```

We observe PD1, LAG3 and cleaved PARP to have high SNR but low signal intensity
meaning that in general these markers are not abundantly expressed. The Iridium
intercalator (here marked as DNA1 and DNA2) has the highest signal intensity 
but low SNR. This might be due to staining differences between individual nuclei
where some nuclei are considered as background. We do however observe high
SNR and sufficient signal intensity for the majority of markers.

Another quality indicator is the image area covered by cells (or biological
tissue). This metric identifies ROIs where little cells are present, possibly
hinting at incorrect selection of the ROI. We can compute the percentage of
covered image area using the metadata contained in the `SpatialExperiment`
object:

```{r cell-density}
colData(spe) %>%
    as.data.frame() %>%
    group_by(sample_id) %>%
    summarize(cell_area = sum(area),
           no_pixels = mean(width_px) * mean(height_px)) %>%
    mutate(covered_area = cell_area / no_pixels) %>%
    ggplot() +
        geom_point(aes(sample_id, covered_area)) + 
        theme_minimal(base_size = 15) +
        ylim(c(0, 1)) + 
        theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 8)) +
        ylab("% covered area") + xlab("")
```

We observe that two of the 14 images show unusually low cell coverage. These
two images can now be visualized using `cytomapper`.

```{r low-density-images, message=FALSE}
# Normalize and clip images
cur_images <- images[c("Patient4_005", "Patient4_007")]
cur_images <- normalize(cur_images, separateImages = TRUE)
cur_images <- normalize(cur_images, inputRange = c(0, 0.2))

plotPixels(cur_images,
           mask = masks[c("Patient4_005", "Patient4_007")],
           img_id = "sample_id",
           missing_colour = "white",
           colour_by = c("CD163", "CD20", "CD3", "Ecad", "DNA1"),
           colour = list(CD163 = c("black", "yellow"),
                         CD20 = c("black", "red"),
                         CD3 = c("black", "green"),
                         Ecad = c("black", "cyan"),
                         DNA1 = c("black", "blue")),
           legend = list(colour_by.title.cex = 0.7,
                         colour_by.labels.cex = 0.7))
```

These two images display less dense tissue structure but overall the images are
intact and seem segemented correctly.

Finally, it can be beneficial to visualize the mean marker expression per image
to identify images with outlying marker expression. This check does not
indicate image quality _per se_ but can highlight biological differences. Here,
we will use the `aggregateAcrossCells` function of the 
`r BiocStyle::Biocpkg("scuttle")` package to compute the mean expression per 
image. For visualization purposes, we again `asinh` transform the mean expression
values.

```{r mean-expression-per-image, message=FALSE, fig.height=7}
library(scuttle)

image_mean <- aggregateAcrossCells(spe, 
                                   ids = spe$sample_id, 
                                   statistics="mean",
                                   use.assay.type = "counts")
assay(image_mean, "exprs") <- asinh(counts(image_mean))

dittoHeatmap(image_mean, genes = rownames(spe)[rowData(spe)$use_channel],
             assay = "exprs", cluster_cols = TRUE, scale = "none",
             heatmap.colors = viridis(100), 
             annot.by = c("indication", "patient_id", "ROI"),
             annotation_colors = list(indication = metadata(spe)$color_vectors$indication,
                                      patient_id = metadata(spe)$color_vectors$patient_id,
                                      ROI = metadata(spe)$color_vectors$ROI),
             show_colnames = TRUE)
```

We observe extensive biological variation across the 14 images. Some images show
high expression of the macrophage marker CD206, the B cell marker CD20, the
neutrophil marker CD15, and the proliferation marker Ki67. These differences will
be further studied in the following chapters.

## Cell-level quality control {#cell-quality}

In the following paragraphs we will look at different metrics and visualization
approaches to assess data quality (as well as biological differences) on the 
single-cell level. 

Related to the signal-to-noise ratio (SNR) calculated above on the pixel-level, 
a similar measure can be derived on the single-cell level. Here, we will use 
a two component Gaussian mixture model for each marker to find cells
with positive and negative expression. The SNR is defined as:

$$SNR = I_s/I_n$$

where $I_s$ is the intensity of the signal (mean intensity of cells with positive
signal) and $I_n$ is the intensity of the noise (mean intensity of cells
lacking expression). We calculate the SNR and signal intensity by fitting the
mixture model across the transformed counts of all cells contained in the
`SpatialExperiment` object.

```{r cell-snr, message=FALSE, warning=FALSE, results="hide", fig.keep="all"}
library(mclust)

set.seed(220224)
mat <- apply(assay(spe, "exprs"), 1, function(x){
    cur_model <- Mclust(x, G = 2)
    mean1 <- mean(x[cur_model$classification == 1])
    mean2 <- mean(x[cur_model$classification == 2])
    
    signal <- ifelse(mean1 > mean2, mean1, mean2)
    noise <- ifelse(mean1 > mean2, mean2, mean1)
    
    return(c(snr = signal/noise, ps = signal))
})
    
cur_snr <- t(mat) %>% as.data.frame() %>% 
        mutate(marker = colnames(mat))

cur_snr %>% ggplot() +
    geom_point(aes(log2(ps), log2(snr))) +
    geom_label_repel(aes(log2(ps), log2(snr), label = marker)) +
    theme_minimal(base_size = 15) + ylab("Signal-to-noise ratio [log2]") +
    xlab("Signal intensity [log2]")
```

Next, we observe the distributions of cell size across the individual images.
Differences in cell size distributions can indicate segmentation biases due to
differences in cell density or can indicate biological differences due to cell
type compositions (tumor cells tend to be larger than immune cells). 

```{r cell-size, message=FALSE}
colData(spe) %>%
    as.data.frame() %>%
    group_by(sample_id) %>%
    ggplot() +
        geom_boxplot(aes(sample_id, area)) +
        theme_minimal(base_size = 15) + 
        theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 8)) +
        ylab("Cell area") + xlab("")

summary(spe$area)
```

The median cell size is `r median(spe$area)` pixels with a median major axis
length of `r round(median(spe$major_axis_length), digits = 1)`. The largest cell
has an area of `r max(spe$area)` pixels which relates to a diameter of 
`r round(sqrt(max(spe$area)), digits = 1)` pixels assuming a circular shape.
Overall, the distribution of cell sizes is similar across images with images from
`Patient4_005` and `Patient4_007` showing a reduced average cell size. These
images contain fewer tumor cells which can explain the smaller average cell size.

We detect very small cells in the dataset and will remove them.
The chosen threshold is arbitrary and needs to be adjusted per dataset.

```{r remove-small-cells}
sum(spe$area < 5)
spe <- spe[,spe$area >= 5]
```

Another quality indicator can be an absolute measure of cell density often reported in cells per mm$^2$.

```{r no-cells-per-image, message=FALSE}
colData(spe) %>%
    as.data.frame() %>%
    group_by(sample_id) %>%
    summarize(cell_count = n(),
           no_pixels = mean(width_px) * mean(height_px)) %>%
    mutate(cells_per_mm2 = cell_count/(no_pixels/1000000)) %>%
    ggplot() +
        geom_point(aes(sample_id, cells_per_mm2)) + 
        theme_minimal(base_size = 15)  + 
        theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 8)) +
        ylab("Cells per mm2") + xlab("")
```

The number of cells per mm$^2$ varies across images which also depends on the
number of tumor/non-tumor cells. As we can see in the following sections, some
immune cells appear in cell dense regions while other stromal regions are less
dense.

The data presented here originate from samples from different locations with
potential differences in pre-processing and each sample stained individually.
These (and other) technical aspects can induce staining differences between
samples or batches of samples. Observing potential staining differences can be
crucial to assess data quality. We will use ridgeline visualizations to check
differences in staining patterns:

```{r ridges, message=FALSE, fig.width=7, fig.height=25}
multi_dittoPlot(spe, vars = rownames(spe)[rowData(spe)$use_channel],
               group.by = "patient_id", plots = c("ridgeplot"), 
               assay = "exprs", 
               color.panel = metadata(spe)$color_vectors$patient_id)
```

We observe variations in the distributions of marker expression across patients.
These variations may arise partly from different abundances of cells in
different images (e.g. Patient3 may have higher numbers of CD11c+ and PD1+
cells) as well as staining differences between samples. While most of the
selected markers are specifically expressed in immune cell subtypes, we can see
that E-Cadherin (a marker for epithelial (tumor) cells) shows similar expression across all
patients.

Finally, we will use non-linear dimensionality reduction methods to project
cells from a high-dimensional (40) down to a low-dimensional (2) space. For this
the `r BiocStyle::Biocpkg("scater")` package provides the `runUMAP` and `runTSNE`
function. To ensure reproducibility, we will need to set a seed; 
however different seeds and different parameter settings (e.g. the `perplexity`)
parameter in the `runTSNE` function need to be tested to avoid interpreting
visualization artefacts. For dimensionality reduction, we will use all channels
that show biological variation across the dataset. However, marker selection 
can be performed with different biological questions in mind.

```{r dimred, message=FALSE}
library(scater)

set.seed(220225)
spe <- runUMAP(spe, subset_row = rowData(spe)$use_channel, exprs_values = "exprs") 
spe <- runTSNE(spe, subset_row = rowData(spe)$use_channel, exprs_values = "exprs") 
```

After dimensionality reduction, the low-dimensional embeddings are stored in the 
`reducedDim` slot.

```{r show-dimred-slot}
reducedDims(spe)
head(reducedDim(spe, "UMAP"))
```

Visualization of the low-dimensional embedding facilitates assessment of 
potential "batch effects". The `dittoDimPlot`
function allows flexible visualization. It returns `ggplot` objects which
can be further modified.

```{r visualizing-dimred-1, message=FALSE, fig.height=12}
library(patchwork)

# visualize patient id 
p1 <- dittoDimPlot(spe, var = "patient_id", reduction.use = "UMAP", size = 0.2) + 
    scale_color_manual(values = metadata(spe)$color_vectors$patient_id) +
    ggtitle("Patient ID on UMAP")
p2 <- dittoDimPlot(spe, var = "patient_id", reduction.use = "TSNE", size = 0.2) + 
    scale_color_manual(values = metadata(spe)$color_vectors$patient_id) +
    ggtitle("Patient ID on TSNE")

# visualize region of interest id
p3 <- dittoDimPlot(spe, var = "ROI", reduction.use = "UMAP", size = 0.2) + 
    scale_color_manual(values = metadata(spe)$color_vectors$ROI) +
    ggtitle("ROI ID on UMAP")
p4 <- dittoDimPlot(spe, var = "ROI", reduction.use = "TSNE", size = 0.2) + 
    scale_color_manual(values = metadata(spe)$color_vectors$ROI) +
    ggtitle("ROI ID on TSNE")

# visualize indication
p5 <- dittoDimPlot(spe, var = "indication", reduction.use = "UMAP", size = 0.2) + 
    scale_color_manual(values = metadata(spe)$color_vectors$indication) +
    ggtitle("Indication on UMAP")
p6 <- dittoDimPlot(spe, var = "indication", reduction.use = "TSNE", size = 0.2) + 
    scale_color_manual(values = metadata(spe)$color_vectors$indication) +
    ggtitle("Indication on TSNE")

(p1 + p2) / (p3 + p4) / (p5 + p6)
```

```{r, visualizing-dimred-2, message=FALSE}
# visualize marker expression
p1 <- dittoDimPlot(spe, var = "Ecad", reduction.use = "UMAP", 
                   assay = "exprs", size = 0.2) +
    scale_color_viridis(name = "Ecad") +
    ggtitle("E-Cadherin expression on UMAP")
p2 <- dittoDimPlot(spe, var = "CD45RO", reduction.use = "UMAP", 
                   assay = "exprs", size = 0.2) +
    scale_color_viridis(name = "CD45RO") +
    ggtitle("CD45RO expression on UMAP")
p3 <- dittoDimPlot(spe, var = "Ecad", reduction.use = "TSNE", 
                   assay = "exprs", size = 0.2) +
    scale_color_viridis(name = "Ecad") +
    ggtitle("Ecad expression on TSNE")
p4 <- dittoDimPlot(spe, var = "CD45RO", reduction.use = "TSNE", 
                   assay = "exprs", size = 0.2) +
    scale_color_viridis(name = "CD45RO") +
    ggtitle("CD45RO expression on TSNE")

(p1 + p2) / (p3 + p4)
```

We observe a strong separation of tumor cells (Ecad+ cells) between the
patients. Here, each patient was diagnosed with a different tumor type. The
separation of tumor cells could be of biological origin since tumor cells tend
to display differences in expression between patients and cancer types and/or of
technical origin: the panel only contains a single tumor marker (E-Cadherin) and
therefore slight technical differences in staining causes visible separation
between cells of different patients. Nevertheless, the immune compartment
(CD45RO+ cells) mix between patients and we can rule out systematic staining
differences between patients.

## Save objects

The modified `SpatialExperiment` object is saved for further downstream analysis.

```{r save-objects-quality-control}
saveRDS(spe, "data/spe.rds")
```

## Session Info

<details>
   <summary>SessionInfo</summary>
   
```{r, echo = FALSE}
sessionInfo()
```
</details>