forked from BodenmillerGroup/IMCDataAnalysis
-
Notifications
You must be signed in to change notification settings - Fork 0
/
06-quality_control.Rmd
462 lines (383 loc) · 19.3 KB
/
06-quality_control.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
# Image and cell-level quality control
The following section discusses possible quality indicators for data obtained
by IMC and other highly multiplexed imaging technologies. Here, we will focus
on describing quality metrics on the single-cell as well as image level.
## Read in the data
We will first read in the data processed in previous sections:
```{r read-data, message=FALSE}
images <- readRDS("data/images.rds")
masks <- readRDS("data/masks.rds")
spe <- readRDS("data/spe.rds")
```
## Segmentation quality control {#seg-quality}
The first step after image segmentation is to observe its accuracy.
Without having ground-truth data readily available, a common approach to
segmentation quality control is to overlay segmentation masks on composite images
displaying channels that were used for segmentation.
The [cytomapper](https://www.bioconductor.org/packages/release/bioc/html/cytomapper.html)
package supports exactly this tasks by using the `plotPixels` function.
Here, we select 3 random images and perform image- and channel-wise
normalization (channels are first min-max normalized and scaled to a range of
0-1 before clipping the maximum intensity to 0.2).
```{r overlay-masks, message=FALSE}
library(cytomapper)
set.seed(20220118)
img_ids <- sample(seq_len(length(images)), 3)
# Normalize and clip images
cur_images <- images[img_ids]
cur_images <- normalize(cur_images, separateImages = TRUE)
cur_images <- normalize(cur_images, inputRange = c(0, 0.2))
plotPixels(cur_images,
mask = masks[img_ids],
img_id = "sample_id",
missing_colour = "white",
colour_by = c("CD163", "CD20", "CD3", "Ecad", "DNA1"),
colour = list(CD163 = c("black", "yellow"),
CD20 = c("black", "red"),
CD3 = c("black", "green"),
Ecad = c("black", "cyan"),
DNA1 = c("black", "blue")),
image_title = NULL,
legend = list(colour_by.title.cex = 0.7,
colour_by.labels.cex = 0.7))
```
We can see that nuclei are centered within the segmentation masks and all cell
types are correctly segmented (note: to zoom into the image you can right click
and select `Open Image in New Tab`). A common challenge here is to segment large (e.g.
epithelial cells - in cyan) _versus_ small (e.g. B cells - in red). However, the
segmentation approach here appears to correctly segment cells across different
sizes.
An additional approach to observe cell segmentation quality and potentially also
antibody specificity issues is to visualize single-cell expression in form of a
heatmap. Here, we sub-sample the dataset to 2000 cells for visualization
purposes and overlay the cancer type from which the cells were extracted.
```{r segmentation-heatmap, message=FALSE, fig.height=7}
library(dittoSeq)
library(viridis)
cur_cells <- sample(seq_len(ncol(spe)), 2000)
dittoHeatmap(spe[,cur_cells], genes = rownames(spe)[rowData(spe)$use_channel],
assay = "exprs", cluster_cols = TRUE, scale = "none",
heatmap.colors = viridis(100), annot.by = "indication",
annotation_colors = list(indication = metadata(spe)$color_vectors$indication))
```
We can differentiate between epithelial cells (Ecad+) and immune cells (CD45RO).
Some of the markers are specifically detected (e.g., Ki67, CD20, Ecad) and
others are more broadly detected (e.g. HLADR, B2M, CD4).
## Image-level quality control {#image-quality}
Image-level quality control is often performed using tools that offer a
graphical user interface such as [QuPath](https://qupath.github.io/) and
[FIJI](https://imagej.net/software/fiji/). Viewers that were specifically
developed for IMC data can be seen
[here](https://bodenmillergroup.github.io/IMCWorkflow/viewers.html). In this
section, we will specifically focus on quantitative metrics to assess image
quality.
It is often of interest to calculate the signal-to-noise ratio (SNR) for
individual channels and markers. Here, we define the SNR as:
$$SNR = I_s/I_n$$
where $I_s$ is the intensity of the signal (mean intensity of pixels with true
signal) and $I_n$ is the intensity of the noise (mean intensity of pixels
containing noise). Finding a threshold that separates pixels containing signal
and pixels containing noise is not trivial and different approaches can be
chosen. Here, we use the `otsu` thresholding approach to find pixels of the
"foreground" (i.e., signal) and "background" (i.e., noise). The SNR is then
defined as the mean intensity of foreground pixels divided by the mean intensity
of background pixels. We compute this measure as well as mean signal intensity
per image as well as the 95% confidence interval. The plot below shows the
average SNR _versus_ the average signal intensity across all images.
```{r image-snr, message=FALSE, warning=FALSE}
library(tidyverse)
library(ggrepel)
library(EBImage)
cur_snr <- lapply(images, function(img){
mat <- apply(img, 3, function(ch){
# Otsu threshold
thres <- otsu(ch, range = c(min(ch), max(ch)))
# Signal-to-noise ratio
snr <- mean(ch[ch > thres]) / mean(ch[ch <= thres])
# Signal intensity
ps <- mean(ch[ch > thres])
return(c(snr = snr, ps = ps))
})
t(mat) %>% as.data.frame() %>%
mutate(marker = colnames(mat)) %>%
pivot_longer(cols = c(snr, ps))
})
cur_snr <- do.call(rbind, cur_snr)
cur_snr %>%
group_by(marker, name) %>%
summarize(mean = mean(value),
ci = qnorm(0.975)*sd(value)/sqrt(n())) %>%
pivot_wider(names_from = name, values_from = c(mean, ci)) %>%
ggplot() +
# geom_errorbar(aes(y = log2(mean_snr), xmin = log2(mean_ps - ci_ps),
# xmax = log2(mean_ps + ci_ps))) +
# geom_errorbar(aes(x = log2(mean_ps), ymin = log2(mean_snr - ci_snr),
# ymax = log2(mean_snr + ci_snr))) +
geom_point(aes(log2(mean_ps), log2(mean_snr))) +
geom_label_repel(aes(log2(mean_ps), log2(mean_snr), label = marker)) +
theme_minimal(base_size = 15) + ylab("Signal-to-noise ratio [log2]") +
xlab("Signal intensity [log2]")
```
We observe PD1, LAG3 and cleaved PARP to have high SNR but low signal intensity
meaning that in general these markers are not abundantly expressed. The Iridium
intercalator (here marked as DNA1 and DNA2) has the highest signal intensity
but low SNR. This might be due to staining differences between individual nuclei
where some nuclei are considered as background. We do however observe high
SNR and sufficient signal intensity for the majority of markers.
Another quality indicator is the image area covered by cells (or biological
tissue). This metric identifies ROIs where little cells are present, possibly
hinting at incorrect selection of the ROI. We can compute the percentage of
covered image area using the metadata contained in the `SpatialExperiment`
object:
```{r cell-density}
colData(spe) %>%
as.data.frame() %>%
group_by(sample_id) %>%
summarize(cell_area = sum(area),
no_pixels = mean(width_px) * mean(height_px)) %>%
mutate(covered_area = cell_area / no_pixels) %>%
ggplot() +
geom_point(aes(sample_id, covered_area)) +
theme_minimal(base_size = 15) +
ylim(c(0, 1)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 8)) +
ylab("% covered area") + xlab("")
```
We observe that two of the 14 images show unusually low cell coverage. These
two images can now be visualized using `cytomapper`.
```{r low-density-images, message=FALSE}
# Normalize and clip images
cur_images <- images[c("Patient4_005", "Patient4_007")]
cur_images <- normalize(cur_images, separateImages = TRUE)
cur_images <- normalize(cur_images, inputRange = c(0, 0.2))
plotPixels(cur_images,
mask = masks[c("Patient4_005", "Patient4_007")],
img_id = "sample_id",
missing_colour = "white",
colour_by = c("CD163", "CD20", "CD3", "Ecad", "DNA1"),
colour = list(CD163 = c("black", "yellow"),
CD20 = c("black", "red"),
CD3 = c("black", "green"),
Ecad = c("black", "cyan"),
DNA1 = c("black", "blue")),
legend = list(colour_by.title.cex = 0.7,
colour_by.labels.cex = 0.7))
```
These two images display less dense tissue structure but overall the images are
intact and seem segemented correctly.
Finally, it can be beneficial to visualize the mean marker expression per image
to identify images with outlying marker expression. This check does not
indicate image quality _per se_ but can highlight biological differences. Here,
we will use the `aggregateAcrossCells` function of the
`r BiocStyle::Biocpkg("scuttle")` package to compute the mean expression per
image. For visualization purposes, we again `asinh` transform the mean expression
values.
```{r mean-expression-per-image, message=FALSE, fig.height=7}
library(scuttle)
image_mean <- aggregateAcrossCells(spe,
ids = spe$sample_id,
statistics="mean",
use.assay.type = "counts")
assay(image_mean, "exprs") <- asinh(counts(image_mean))
dittoHeatmap(image_mean, genes = rownames(spe)[rowData(spe)$use_channel],
assay = "exprs", cluster_cols = TRUE, scale = "none",
heatmap.colors = viridis(100),
annot.by = c("indication", "patient_id", "ROI"),
annotation_colors = list(indication = metadata(spe)$color_vectors$indication,
patient_id = metadata(spe)$color_vectors$patient_id,
ROI = metadata(spe)$color_vectors$ROI),
show_colnames = TRUE)
```
We observe extensive biological variation across the 14 images. Some images show
high expression of the macrophage marker CD206, the B cell marker CD20, the
neutrophil marker CD15, and the proliferation marker Ki67. These differences will
be further studied in the following chapters.
## Cell-level quality control {#cell-quality}
In the following paragraphs we will look at different metrics and visualization
approaches to assess data quality (as well as biological differences) on the
single-cell level.
Related to the signal-to-noise ratio (SNR) calculated above on the pixel-level,
a similar measure can be derived on the single-cell level. Here, we will use
a two component Gaussian mixture model for each marker to find cells
with positive and negative expression. The SNR is defined as:
$$SNR = I_s/I_n$$
where $I_s$ is the intensity of the signal (mean intensity of cells with positive
signal) and $I_n$ is the intensity of the noise (mean intensity of cells
lacking expression). We calculate the SNR and signal intensity by fitting the
mixture model across the transformed counts of all cells contained in the
`SpatialExperiment` object.
```{r cell-snr, message=FALSE, warning=FALSE, results="hide", fig.keep="all"}
library(mclust)
set.seed(220224)
mat <- apply(assay(spe, "exprs"), 1, function(x){
cur_model <- Mclust(x, G = 2)
mean1 <- mean(x[cur_model$classification == 1])
mean2 <- mean(x[cur_model$classification == 2])
signal <- ifelse(mean1 > mean2, mean1, mean2)
noise <- ifelse(mean1 > mean2, mean2, mean1)
return(c(snr = signal/noise, ps = signal))
})
cur_snr <- t(mat) %>% as.data.frame() %>%
mutate(marker = colnames(mat))
cur_snr %>% ggplot() +
geom_point(aes(log2(ps), log2(snr))) +
geom_label_repel(aes(log2(ps), log2(snr), label = marker)) +
theme_minimal(base_size = 15) + ylab("Signal-to-noise ratio [log2]") +
xlab("Signal intensity [log2]")
```
Next, we observe the distributions of cell size across the individual images.
Differences in cell size distributions can indicate segmentation biases due to
differences in cell density or can indicate biological differences due to cell
type compositions (tumor cells tend to be larger than immune cells).
```{r cell-size, message=FALSE}
colData(spe) %>%
as.data.frame() %>%
group_by(sample_id) %>%
ggplot() +
geom_boxplot(aes(sample_id, area)) +
theme_minimal(base_size = 15) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 8)) +
ylab("Cell area") + xlab("")
summary(spe$area)
```
The median cell size is `r median(spe$area)` pixels with a median major axis
length of `r round(median(spe$major_axis_length), digits = 1)`. The largest cell
has an area of `r max(spe$area)` pixels which relates to a diameter of
`r round(sqrt(max(spe$area)), digits = 1)` pixels assuming a circular shape.
Overall, the distribution of cell sizes is similar across images with images from
`Patient4_005` and `Patient4_007` showing a reduced average cell size. These
images contain fewer tumor cells which can explain the smaller average cell size.
We detect very small cells in the dataset and will remove them.
The chosen threshold is arbitrary and needs to be adjusted per dataset.
```{r remove-small-cells}
sum(spe$area < 5)
spe <- spe[,spe$area >= 5]
```
Another quality indicator can be an absolute measure of cell density often reported in cells per mm$^2$.
```{r no-cells-per-image, message=FALSE}
colData(spe) %>%
as.data.frame() %>%
group_by(sample_id) %>%
summarize(cell_count = n(),
no_pixels = mean(width_px) * mean(height_px)) %>%
mutate(cells_per_mm2 = cell_count/(no_pixels/1000000)) %>%
ggplot() +
geom_point(aes(sample_id, cells_per_mm2)) +
theme_minimal(base_size = 15) +
theme(axis.text.x = element_text(angle = 90, hjust = 1, size = 8)) +
ylab("Cells per mm2") + xlab("")
```
The number of cells per mm$^2$ varies across images which also depends on the
number of tumor/non-tumor cells. As we can see in the following sections, some
immune cells appear in cell dense regions while other stromal regions are less
dense.
The data presented here originate from samples from different locations with
potential differences in pre-processing and each sample stained individually.
These (and other) technical aspects can induce staining differences between
samples or batches of samples. Observing potential staining differences can be
crucial to assess data quality. We will use ridgeline visualizations to check
differences in staining patterns:
```{r ridges, message=FALSE, fig.width=7, fig.height=25}
multi_dittoPlot(spe, vars = rownames(spe)[rowData(spe)$use_channel],
group.by = "patient_id", plots = c("ridgeplot"),
assay = "exprs",
color.panel = metadata(spe)$color_vectors$patient_id)
```
We observe variations in the distributions of marker expression across patients.
These variations may arise partly from different abundances of cells in
different images (e.g. Patient3 may have higher numbers of CD11c+ and PD1+
cells) as well as staining differences between samples. While most of the
selected markers are specifically expressed in immune cell subtypes, we can see
that E-Cadherin (a marker for epithelial (tumor) cells) shows similar expression across all
patients.
Finally, we will use non-linear dimensionality reduction methods to project
cells from a high-dimensional (40) down to a low-dimensional (2) space. For this
the `r BiocStyle::Biocpkg("scater")` package provides the `runUMAP` and `runTSNE`
function. To ensure reproducibility, we will need to set a seed;
however different seeds and different parameter settings (e.g. the `perplexity`)
parameter in the `runTSNE` function need to be tested to avoid interpreting
visualization artefacts. For dimensionality reduction, we will use all channels
that show biological variation across the dataset. However, marker selection
can be performed with different biological questions in mind.
```{r dimred, message=FALSE}
library(scater)
set.seed(220225)
spe <- runUMAP(spe, subset_row = rowData(spe)$use_channel, exprs_values = "exprs")
spe <- runTSNE(spe, subset_row = rowData(spe)$use_channel, exprs_values = "exprs")
```
After dimensionality reduction, the low-dimensional embeddings are stored in the
`reducedDim` slot.
```{r show-dimred-slot}
reducedDims(spe)
head(reducedDim(spe, "UMAP"))
```
Visualization of the low-dimensional embedding facilitates assessment of
potential "batch effects". The `dittoDimPlot`
function allows flexible visualization. It returns `ggplot` objects which
can be further modified.
```{r visualizing-dimred-1, message=FALSE, fig.height=12}
library(patchwork)
# visualize patient id
p1 <- dittoDimPlot(spe, var = "patient_id", reduction.use = "UMAP", size = 0.2) +
scale_color_manual(values = metadata(spe)$color_vectors$patient_id) +
ggtitle("Patient ID on UMAP")
p2 <- dittoDimPlot(spe, var = "patient_id", reduction.use = "TSNE", size = 0.2) +
scale_color_manual(values = metadata(spe)$color_vectors$patient_id) +
ggtitle("Patient ID on TSNE")
# visualize region of interest id
p3 <- dittoDimPlot(spe, var = "ROI", reduction.use = "UMAP", size = 0.2) +
scale_color_manual(values = metadata(spe)$color_vectors$ROI) +
ggtitle("ROI ID on UMAP")
p4 <- dittoDimPlot(spe, var = "ROI", reduction.use = "TSNE", size = 0.2) +
scale_color_manual(values = metadata(spe)$color_vectors$ROI) +
ggtitle("ROI ID on TSNE")
# visualize indication
p5 <- dittoDimPlot(spe, var = "indication", reduction.use = "UMAP", size = 0.2) +
scale_color_manual(values = metadata(spe)$color_vectors$indication) +
ggtitle("Indication on UMAP")
p6 <- dittoDimPlot(spe, var = "indication", reduction.use = "TSNE", size = 0.2) +
scale_color_manual(values = metadata(spe)$color_vectors$indication) +
ggtitle("Indication on TSNE")
(p1 + p2) / (p3 + p4) / (p5 + p6)
```
```{r, visualizing-dimred-2, message=FALSE}
# visualize marker expression
p1 <- dittoDimPlot(spe, var = "Ecad", reduction.use = "UMAP",
assay = "exprs", size = 0.2) +
scale_color_viridis(name = "Ecad") +
ggtitle("E-Cadherin expression on UMAP")
p2 <- dittoDimPlot(spe, var = "CD45RO", reduction.use = "UMAP",
assay = "exprs", size = 0.2) +
scale_color_viridis(name = "CD45RO") +
ggtitle("CD45RO expression on UMAP")
p3 <- dittoDimPlot(spe, var = "Ecad", reduction.use = "TSNE",
assay = "exprs", size = 0.2) +
scale_color_viridis(name = "Ecad") +
ggtitle("Ecad expression on TSNE")
p4 <- dittoDimPlot(spe, var = "CD45RO", reduction.use = "TSNE",
assay = "exprs", size = 0.2) +
scale_color_viridis(name = "CD45RO") +
ggtitle("CD45RO expression on TSNE")
(p1 + p2) / (p3 + p4)
```
We observe a strong separation of tumor cells (Ecad+ cells) between the
patients. Here, each patient was diagnosed with a different tumor type. The
separation of tumor cells could be of biological origin since tumor cells tend
to display differences in expression between patients and cancer types and/or of
technical origin: the panel only contains a single tumor marker (E-Cadherin) and
therefore slight technical differences in staining causes visible separation
between cells of different patients. Nevertheless, the immune compartment
(CD45RO+ cells) mix between patients and we can rule out systematic staining
differences between patients.
## Save objects
The modified `SpatialExperiment` object is saved for further downstream analysis.
```{r save-objects-quality-control}
saveRDS(spe, "data/spe.rds")
```
## Session Info
<details>
<summary>SessionInfo</summary>
```{r, echo = FALSE}
sessionInfo()
```
</details>