forked from BodenmillerGroup/IMCDataAnalysis
-
Notifications
You must be signed in to change notification settings - Fork 0
/
08-phenotyping.Rmd
1010 lines (809 loc) · 39.5 KB
/
08-phenotyping.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Cell phenotyping
A common step during single-cell data analysis is the annotation of cells based
on their phenotype. Defining cell phenotypes is often subjective and relies
on previous biological knowledge. The [Orchestrating Single Cell Analysis with Bioconductor](https://bioconductor.org/books/3.14/OSCA.basic/cell-type-annotation.html) book
presents a number of approaches to phenotype cells detected by single-cell RNA
sequencing based on reference datasets or prior knowledge of geneset importance.
In highly-multiplexed imaging, target proteins or molecules are manually
selected based on the biological question at hand. It narrows down the feature
space and facilitates the manual annotation of clusters to derive cell
phenotypes. We will therefore discuss and compare a number of clustering
approaches to group cells based on their similarity in marker expression in
Section \@ref(clustering).
Unlike single-cell RNA sequencing or CyTOF data, single-cell data derived from
highly-multiplexed imaging data often suffers from "lateral spillover" between
neighboring cells. This spillover caused by imperfect segmentation often hinders
accurate clustering to define specific cell phenotypes in multiplexed imaging
data. Tools have been developed to correct lateral spillover between cells
[@Bai2021] but the approach requires careful selection of the markers to
correct. In Section \@ref(classification) we will train and apply a random
forest classifier to classify cell phenotypes in the dataset as alternative
approach for clustering-based cell phenotyping. This approach has been previously used to
identify major cell phenotypes in metastatic melanoma and avoids clustering of
cells [@Hoch2022].
## Load data
We will first read in the previously generated `SpatialExperiment` object and
sample 2000 cells to visualize cluster membership.
```{r read-data-pheno, message=FALSE}
library(SpatialExperiment)
spe <- readRDS("data/spe.rds")
# Sample cells
set.seed(220619)
cur_cells <- sample(seq_len(ncol(spe)), 2000)
```
## Clustering approaches {#clustering}
In the first section to identify cellular phenotypes in the dataset, we will
present clustering approaches that group cells based on their similarity in
marker expression. A number of approaches have been developed to cluster data
derived from single-cell RNA sequencing technologies [@Yu2022] or CyTOF
[@Weber2016]. For demonstration purposes, we will highlight common clustering
approaches that are available in R and have been used for clustering cells
obtained from IMC. Two approaches rely on graph-based clustering and one
approach uses self organizing maps (SOM).
### Rphenograph
The PhenoGraph clustering approach was first described to group cells of a CyTOF
dataset [@Levine2015]. The algorithm first constructs a graph by detecting the
`k` nearest neighbours based on euclidean distance in expression space. In the
next step, edges between nodes (cells) are weighted by their overlap in nearest
neighbor sets. To quantify the overlap in shared nearest neighbor sets, the
jaccard index is used. The Louvain modularity optimization approach is used to
detect connected communities and partition the graph into clusters of cells.
This clustering strategy was used by Jackson, Fischer _et al._ and Schulz _et
al._ to cluster IMC data [@Jackson2020; @Schulz2018].
There are several different PhenoGraph implementations available in R. Here, we
use the one available at
[https://github.com/i-cyto/Rphenograph](https://github.com/i-cyto/Rphenograph).
For large datasets,
[https://github.com/stuchly/Rphenoannoy](https://github.com/stuchly/Rphenoannoy)
offers a more performant implementation of the algorithm.
In the following code chunk, we select the asinh-transformed mean pixel
intensities per cell and channel and subset the channels to the ones containing
biological variation. This matrix is transposed to store cells in rows. Within
the `Rphenograph` function, we select the 45 nearest neighbors for graph
building and louvain community detection (default). The function returns a list
of length 2, the first entry being the graph and the second entry containing the
community object. Calling `membership` on the community object will return
cluster IDs for each cell. These cluster IDs are then stored within the
`colData` of the `SpatialExperiment` object. Cluster IDs are mapped on top of
the UMAP embedding and single-cell marker expression within each cluster are
visualized in form of a heatmap.
```{r, echo=FALSE}
start_time <- Sys.time()
```
```{r rphenograph-1, message=FALSE}
library(Rphenograph)
library(igraph)
library(dittoSeq)
library(viridis)
mat <- t(assay(spe, "exprs")[rowData(spe)$use_channel,])
out <- Rphenograph(mat, k = 45)
clusters <- factor(membership(out[[2]]))
spe$pg_clusters <- clusters
dittoDimPlot(spe, var = "pg_clusters",
reduction.use = "UMAP", size = 0.2,
do.label = TRUE) +
ggtitle("Phenograph clusters expression on UMAP")
```
```{r, fig.height=8}
dittoHeatmap(spe[,cur_cells],
genes = rownames(spe)[rowData(spe)$use_channel],
assay = "exprs", scale = "none",
heatmap.colors = viridis(100),
annot.by = c("pg_clusters", "patient_id"),
annot.colors = c(dittoColors(1)[1:length(unique(spe$pg_clusters))],
metadata(spe)$color_vectors$patient_id))
```
```{r, echo=FALSE}
end_time <- Sys.time()
```
The `Rphenograph` function call took
`r round(as.numeric(difftime(end_time, start_time, units = "mins")), digits = 2)` minutes.
We can observe that some of the clusters only contain cells of a single patient.
This can often be observed in the tumor compartment. In the next step, we
use the integrated cells (see Section \@ref(batch-effects)) in low dimensional
embedding for clustering. Here, the low dimensional embedding can
be directly accessed from the `reducedDim` slot.
```{r rphenograph-2, message=FALSE}
mat <- reducedDim(spe, "fastMNN")
out <- Rphenograph(mat, k = 45)
clusters <- factor(membership(out[[2]]))
spe$pg_clusters_corrected <- clusters
dittoDimPlot(spe, var = "pg_clusters_corrected",
reduction.use = "UMAP_mnnCorrected", size = 0.2,
do.label = TRUE) +
ggtitle("Phenograph clusters expression on UMAP, integrated cells")
```
```{r, fig.height=8}
dittoHeatmap(spe[,cur_cells],
genes = rownames(spe)[rowData(spe)$use_channel],
assay = "exprs", scale = "none",
heatmap.colors = viridis(100),
annot.by = c("pg_clusters_corrected", "pg_clusters","patient_id"),
annot.colors = c(dittoColors(1)[1:length(unique(spe$pg_clusters_corrected))],
dittoColors(1)[1:length(unique(spe$pg_clusters))],
metadata(spe)$color_vectors$patient_id))
```
Clustering using the integrated embedding leads to clusters that contain cells
of different patients. Cluster annotation can now be performed by manually
labeling cells based on their marker expression (see Notes in Section
\@ref(clustering-notes)).
### Shared nearest neighbour graph {#snn-graph}
The [bluster](https://www.bioconductor.org/packages/release/bioc/html/bluster.html)
package provides a simple interface to cluster cells using a number of different
[clustering approaches](https://www.bioconductor.org/packages/release/bioc/vignettes/bluster/inst/doc/clusterRows.html) and different metrics to [access cluster stability](https://www.bioconductor.org/packages/release/bioc/vignettes/bluster/inst/doc/diagnostics.html).
For simplicity, we will focus on graph based clustering as this is the most
popular and a fast method for single-cell clustering. The `bluster` package
provides functionalities to build k-nearest neighbour (KNN) and it's weighted
version (shared nearest neighbor; SNN) graphs where nodes represent cells.
The user can chose the number of neighbors to consider (parameter `k`),
the edge weighting method (parameter `type`) and the community detection
function to use (parameter `cluster.fun`). As all parameters affect the clustering
results, the `bluster` package provides the `clusterSweep` function to test
a number of parameter settings in parallel. In the following code chunk,
we select the asinh-transformed mean pixel intensities and subset the markers
of interest. The resulting matrix is transposed to fit to the requirements of
the bluster package (cells in rows).
We test two different settings for `k`, two for `type` and two for `cluster.fun`.
This function call is parallelized by setting the `BPPARAM` parameter. We use
the `approxSilhouette` function to compute the silhouette width for each cell
and compute the average across all cells per parameter setting.
Please see `?silhouette` for more information on how the silhouette
width id computed for each cell. A large average silhouette width indicates
cells that are well clustered.
For time reasons, we sample 10000 cells but it is recommended to increase
this number to include rare cells into parameter testing.
```{r, echo=FALSE}
start_time <- Sys.time()
```
```{r parameter-sweep, message=FALSE, fig.height=7}
library(bluster)
library(BiocParallel)
library(ggplot2)
sam_cells <- sample(seq_len(ncol(spe)), 10000)
mat <- t(assay(spe, "exprs")[rowData(spe)$use_channel,])
combinations <- clusterSweep(mat[sam_cells,], BLUSPARAM=SNNGraphParam(),
k=c(10L, 20L),
type = c("rank", "jaccard"),
cluster.fun=c("walktrap", "louvain"),
BPPARAM = MulticoreParam(RNGseed = 220427))
sil <- vapply(as.list(combinations$clusters),
function(x) mean(approxSilhouette(mat[sam_cells,], x)$width),
0)
ggplot(data.frame(method = names(sil),
sil = sil)) +
geom_point(aes(method, sil)) +
theme_classic(base_size = 15) +
theme(axis.text.x = element_text(angle = 90, hjust = 1)) +
xlab("Cluster parameter combination") +
ylab("Average silhouette width")
```
```{r, echo=FALSE}
end_time <- Sys.time()
```
The cluster parameter sweep took
`r round(as.numeric(difftime(end_time, start_time, units = "mins")), digits = 2)` minutes.
Performing a cluster sweep takes some time as multiple function calls are run in parallel.
We do however recommend testing a number of different parameter settings to
access clustering performance.
Once parameter settings are known, we can either use the `clusterRows` function
of the `bluster` package to cluster cells or its convenient wrapper function
exported by the
[scran](https://bioconductor.org/packages/release/bioc/html/scran.html) package.
The `scran::clusterCells` function accepts a `SpatialExperiment` (or
`SingleCellExperiment`) object which stores cells in columns. By default, the
function detects the 10 nearest neighbours for each cell, performs rank-based
weighting of edges (see `?makeSNNGraph` for more information) and uses the
`cluster_walktrap` function to detect communities in the graph.
As we can see above, the `louvain` community detection approach outperforms the
default `walktrap` approach in this dataset with `k` being 20 and rank-based
edge weighting.
```{r, echo=FALSE}
start_time <- Sys.time()
```
```{r snn-1, message=FALSE}
library(scran)
clusters <- clusterCells(spe[rowData(spe)$use_channel,],
assay.type = "exprs",
BLUSPARAM = SNNGraphParam(k=20,
cluster.fun = "louvain",
type = "rank",
BPPARAM = bpparam()))
spe$nn_clusters <- clusters
dittoDimPlot(spe, var = "nn_clusters",
reduction.use = "UMAP", size = 0.2,
do.label = TRUE) +
ggtitle("SNN clusters expression on UMAP")
```
```{r, fig.height=8}
dittoHeatmap(spe[,cur_cells],
genes = rownames(spe)[rowData(spe)$use_channel],
assay = "exprs", scale = "none",
heatmap.colors = viridis(100),
annot.by = c("nn_clusters", "patient_id"),
annot.colors = c(dittoColors(1)[1:length(unique(spe$nn_clusters))],
metadata(spe)$color_vectors$patient_id))
```
```{r, echo=FALSE}
end_time <- Sys.time()
```
The shared nearest neighbor graph clustering approach took
`r round(as.numeric(difftime(end_time, start_time, units = "mins")), digits = 2)` minutes.
This function was used by [@Tietscher2022] to cluster cells obtained by IMC. Setting
`type = "jaccard"` performs clustering similar to `Rphenograph` above and [Seurat](https://satijalab.org/seurat/articles/pbmc3k_tutorial.html#cluster-the-cells-1).
Similar to the results obtained by `Rphenograph`, some of the clusters are
patient-specific. We can now perform clustering of the integrated cells
by directly specifying which low-dimensional embedding to use:
```{r snn-2, message=FALSE}
clusters <- clusterCells(spe[rowData(spe)$use_channel,],
use.dimred = "fastMNN",
BLUSPARAM = NNGraphParam(k=20,
cluster.fun = "louvain",
type = "rank",
BPPARAM = bpparam()))
spe$nn_clusters_corrected <- clusters
dittoDimPlot(spe, var = "nn_clusters_corrected",
reduction.use = "UMAP_mnnCorrected", size = 0.2,
do.label = TRUE) +
ggtitle("SNN clusters expression on UMAP, integrated cells")
```
```{r, fig.height=8}
dittoHeatmap(spe[,cur_cells],
genes = rownames(spe)[rowData(spe)$use_channel],
assay = "exprs", scale = "none",
heatmap.colors = viridis(100),
annot.by = c("nn_clusters_corrected", "nn_clusters","patient_id"),
annot.colors = c(dittoColors(1)[1:length(unique(spe$nn_clusters_corrected))],
dittoColors(1)[1:length(unique(spe$nn_clusters))],
metadata(spe)$color_vectors$patient_id))
```
### Self organizing maps
An alternative to graph-based clustering is offered by the
[CATALYST](https://bioconductor.org/packages/release/bioc/html/CATALYST.html)
package. The `cluster` function internally uses the
[FlowSOM](https://bioconductor.org/packages/release/bioc/html/FlowSOM.html)
package to group cells into 100 (default) clusters based on self organizing maps
(SOM). In the next step, the
[ConsensusClusterPlus](https://bioconductor.org/packages/release/bioc/html/ConsensusClusterPlus.html)
package is used to perform hierarchical consensus clustering of the previously
detected 100 SOM nodes into 1 to `maxK` clusters. Cluster stability for each `k`
can be assessed by plotting the `delta_area(spe)`. The optimal number
of clusters can be found by selecting the `k` at which a plateau is reached.
In the example below, the optimal `k` lies somewhere around 13.
```{r, echo=FALSE}
start_time <- Sys.time()
```
```{r flowSOM-1, message=FALSE}
library(CATALYST)
# Run FlowSOM and ConsensusClusterPlus clustering
spe <- cluster(spe,
features = rownames(spe)[rowData(spe)$use_channel],
maxK = 30,
seed = 220410)
# Assess cluster stability
delta_area(spe)
spe$som_clusters <- cluster_ids(spe, "meta13")
dittoDimPlot(spe, var = "som_clusters",
reduction.use = "UMAP", size = 0.2,
do.label = TRUE) +
ggtitle("SOM clusters expression on UMAP")
```
```{r, fig.height=8}
dittoHeatmap(spe[,cur_cells],
genes = rownames(spe)[rowData(spe)$use_channel],
assay = "exprs", scale = "none",
heatmap.colors = viridis(100),
annot.by = c("som_clusters", "patient_id"),
annot.colors = c(dittoColors(1)[1:length(unique(spe$som_clusters))],
metadata(spe)$color_vectors$patient_id))
```
```{r, echo=FALSE}
end_time <- Sys.time()
```
Running FlowSOM clustering took `r round(as.numeric(difftime(end_time, start_time, units = "mins")), digits = 2)` minutes.
The `CATALYST` package does not provide functionality to perform `FlowSOM` and
`ConsensusClusterPlus` clustering directly on the batch-corrected, integrated cells. As an
alternative to the `CATALYST` package, the `bluster` package provides SOM
clustering when specifying the `SomParam()` parameter. Similar to the `CATALYST`
approach, we will first cluster the dataset into 100 clusters (also called
"codes"). These codes are then further clustered into a maximum of 30 clusters
using `ConsensusClusterPlus` (using hierarchical clustering and euclidean
distance). The delta area plot can be accessed using the (not exported)
`.plot_delta_area` function from `CATALYST`. Here, it seems that the plateau is
reached at a `k` of 16 and we will store the final cluster IDs within the
`SpatialExperiment` object.
```{r flowSOM-2, message=FALSE, results='hide', fig.show='hide'}
library(kohonen)
library(ConsensusClusterPlus)
# Select integrated cells
mat <- reducedDim(spe, "fastMNN")
# Perform SOM clustering
som.out <- clusterRows(mat, SomParam(100), full = TRUE)
# Cluster the 100 SOM codes into larger clusters
ccp <- ConsensusClusterPlus(t(som.out$objects$som$codes[[1]]),
maxK = 30,
reps = 100,
distance = "euclidean",
seed = 220410,
plot = NULL)
```
```{r}
# Visualize delta area plot
CATALYST:::.plot_delta_area(ccp)
# Link ConsensusClusterPlus clusters with SOM codes and save in object
som.cluster <- ccp[[16]][["consensusClass"]][som.out$clusters]
spe$som_clusters_corrected <- as.factor(som.cluster)
dittoDimPlot(spe, var = "som_clusters_corrected",
reduction.use = "UMAP_mnnCorrected", size = 0.2,
do.label = TRUE) +
ggtitle("Phenograph clusters expression on UMAP, integrated cells")
```
```{r, fig.height=8}
dittoHeatmap(spe[,cur_cells],
genes = rownames(spe)[rowData(spe)$use_channel],
assay = "exprs", scale = "none",
heatmap.colors = viridis(100),
annot.by = c("som_clusters_corrected", "som_clusters","patient_id"),
annot.colors = c(dittoColors(1)[1:length(unique(spe$som_clusters_corrected))],
dittoColors(1)[1:length(unique(spe$som_clusters))],
metadata(spe)$color_vectors$patient_id))
```
The `FlowSOM` clustering approach has been used by [@Hoch2022] to sub-cluster tumor
cells as measured by IMC.
### Compare between clustering approaches
Finally, we can compare the results of different clustering approaches. For
this, we visualize the number of cells that are shared between different
clustering results in a pairwise fashion. In the following heatmaps a high match
between clustering results can be seen for those clusters that are uniquely
detected in both approaches.
First, we will visualize the match between the three different approaches
applied to the asinh-transformed counts.
```{r raw-counts, message=FALSE}
library(patchwork)
library(pheatmap)
library(gridExtra)
tab1 <- table(paste("Rphenograph", spe$pg_clusters),
paste("SNN", spe$nn_clusters))
tab2 <- table(paste("Rphenograph", spe$pg_clusters),
paste("SOM", spe$som_clusters))
tab3 <- table(paste("SNN", spe$nn_clusters),
paste("SOM", spe$som_clusters))
pheatmap(log10(tab1 + 10), color = viridis(100))
pheatmap(log10(tab2 + 10), color = viridis(100))
pheatmap(log10(tab3 + 10), color = viridis(100))
```
We observe that `Rphenograph` and the shared nearest neighbor (SNN) approach by
`scran` show similar results (first heatmap above). For example, Rphenograph
cluster 17 (a tumor cluster) is perfectly captured by SNN cluster 12. On the
other hand, the Neutrophil cluster (SNN cluster 8) is split into Rphenograph
cluster 8 and Rphenograph cluster 10. A common approach
is to now merge clusters that contain similar cell types and annotate them by
hand.
Below, a comparison between the clustering results of the integrated cells
is shown.
```{r corrected, message=FALSE}
tab1 <- table(paste("Rphenograph", spe$pg_clusters_corrected),
paste("SNN", spe$nn_clusters_corrected))
tab2 <- table(paste("Rphenograph", spe$pg_clusters_corrected),
paste("SOM", spe$som_clusters_corrected))
tab3 <- table(paste("SNN", spe$nn_clusters_corrected),
paste("SOM", spe$som_clusters_corrected))
pheatmap(log10(tab1 + 10), color = viridis(100))
pheatmap(log10(tab2 + 10), color = viridis(100))
pheatmap(log10(tab3 + 10), color = viridis(100))
```
In comparison to clustering on the non-integrated cells, the clustering results
of the integrated cells show higher overlap. The SNN approach resulted in fewer
clusters and therefore matches better with the SOM clustering approach.
### Further clustering notes {#clustering-notes}
The `bluster` package provides a number of metrics to assess cluster stability
[here](https://www.bioconductor.org/packages/release/bioc/vignettes/bluster/inst/doc/diagnostics.html).
For brevity we only highlighted the use of the silhouette width but different
metrics should be tested to assess cluster stability.
To assign cell types to clusters, we manually annotate clusters based on their
marker expression. For example, SNN cluster 11 (clustering of the integrated
cells) shows high, homogeneous expression of CD20 and we might therefore label
this cluster as B cells. The next chapter \@ref(single-cell-visualization) will
highlight single-cell visualization methods that can be helpful for manual
cluster annotations.
Full cell labeling can be seen in the following section.
## Cell type classification {#classification}
In this section, we will highlight a cell type classification approach based
on ground truth labeling and random forest classification. The rational for
this supervised cell phenotyping approach is to use the information contained
in the pre-defined markers to detect cells of interest. This approach was
used by Hoch _et al._ to classify cell types in a metastatic melanoma IMC
dataset [@Hoch2022].
The antibody panel used in the example data set mainly focuses on immune cell
types and little on tumor cell phenotypes. Therefore we will label the following
cell types:
* Tumor (E-cadherin positive)
* Stroma (SMA, PDGFRb positive)
* Plasma cells (CD38 positive)
* Neutrophil (MPO, CD15 positive)
* Myeloid cells (HLADR positive)
* B cells (CD20 positive)
* B next to T cells (CD20, CD3 positive)
* Regulatory T cells (FOXP3 positive)
* CD8+ T cells (CD3, CD8 positive)
* CD4+ T cells (CD3, CD4 positive)
The "B next to T cell" phenotype (`BnTcell`) is commonly observed in immune
infiltrated regions measured by IMC. We include this phenotype to account for B
cell/T cell interactions where precise classification into B cells or T cells is
not possible. The exact gating scheme can be seen at
[img/Gating_scheme.pdf](img/Gating_scheme.pdf).
As related approaches, [Astir](https://github.com/camlab-bioml/astir) and
[Garnett](https://cole-trapnell-lab.github.io/garnett/) use pre-defined panel
information to classify cell phenotypes based on their marker expression.
### Manual labeling of cells
The [cytomapper](https://www.bioconductor.org/packages/release/bioc/html/cytomapper.html)
package provides the `cytomapperShiny` function that allows gating of cells
based on their marker expression and visualization of selected cells directly
on the images.
```{r cytomapperShiny, message=FALSE}
library(cytomapper)
if (interactive()) {
images <- readRDS("data/images.rds")
masks <- readRDS("data/masks.rds")
cytomapperShiny(object = spe, mask = masks, image = images,
cell_id = "ObjectNumber", img_id = "sample_id")
}
```
The labeled cells for this data set can be accessed at
[zenodo.org/record/6554611](https://zenodo.org/record/6554611) and were
downloaded in Section \@ref(prerequisites). Per image, the `cytomapperShiny`
function allows the export of gated cells in form of a `SingleCellExperiment` or
`SpatialExperiment` object. The cell label is stored in
`colData(object)$cytomapper_CellLabel` and the gates are stored in
`metadata(object)`. In the next section, we will read in and consolidate the
labeled data.
### Define color vectors
For consistent visualization of cell types, we will now pre-define their colors:
```{r cell-type-colors}
celltype <- setNames(c("#3F1B03", "#F4AD31", "#894F36", "#1C750C", "#EF8ECC",
"#6471E2", "#4DB23B", "grey", "#F4800C", "#BF0A3D", "#066970"),
c("Tumor", "Stroma", "Myeloid", "CD8", "Plasma_cell",
"Treg", "CD4", "undefined", "BnTcell", "Bcell", "Neutrophil"))
metadata(spe)$color_vectors$celltype <- celltype
```
### Read in and consolidate labeled data
Here, we will read in the individual `SpatialExperiment` objects containing the
labeled cells and concatenate them. In the process of concatenating the
`SpatialExperiment` objects along their columns, the `sample_id` entry is
appended by `.1, .2, .3, ...` due to replicated entries.
```{r, read-in-labeled-data, message=FALSE}
library(SingleCellExperiment)
label_files <- list.files("data/gated_cells",
full.names = TRUE, pattern = ".rds$")
# Read in SPE objects
spes <- lapply(label_files, readRDS)
# Merge SPE objects
concat_spe <- do.call("cbind", spes)
```
In the following code chunk we will identify cells that were labeled multiple
times. This occurs when different cell phenotypes are gated per image and can
affect immune cells that are located inside the tumor compartment.
We will first identify those cells that were uniquely labeled. In the next step,
we will identify those cells that were labeled twice AND were labeled as Tumor
cells. These cells will be assigned their immune cell label. Finally, we will
save the unique labels within the original `SpatialExperiment` object.
```{r, consolidate-labels}
cur_tab <- unclass(table(colnames(concat_spe),
concat_spe$cytomapper_CellLabel))
cur_labels <- rep("doublets", nrow(cur_tab))
names(cur_labels) <- rownames(cur_tab)
# Single assignments
single_index <- rowSums(cur_tab) == 1
cur_labels[single_index] <- colnames(cur_tab)[apply(cur_tab[single_index,], 1,
which.max)]
# Double assignment within the tumor
double_index <- rowSums(cur_tab) == 2 & cur_tab[,"Tumor"] == 1
no_tumor <- cur_tab[,colnames(cur_tab) != "Tumor"]
cur_labels[double_index] <- colnames(no_tumor)[apply(no_tumor[double_index,], 1,
which.max)]
# Remove doublets
cur_labels <- cur_labels[cur_labels != "doublets"]
table(cur_labels)
# Transfer labels to SPE object
spe_labels <- rep("unlabeled", ncol(spe))
names(spe_labels) <- colnames(spe)
spe_labels[names(cur_labels)] <- cur_labels
spe$cell_labels <- spe_labels
# Number of cells labeled per patient
table(spe$cell_labels, spe$patient_id)
```
Based on these labels, we can now train a random forest classifier to classify
all remaining, unlabeled cells.
### Train classifier
In this section, we will use the
[caret](https://topepo.github.io/caret/index.html) framework for machine
learning in R. This package provides an interface to train a number of
regression and classification models in a coherent fashion. We use a random
forest classifier due to low number of parameters, high speed and an observed
high performance for cell type classification [@Hoch2022].
In the following section, we will first split the `SpatialExperiment` object
into labeled and unlabeled cells. Based on the labeled cells, we split
the data into a train (75% of the data) and test (25% of the data) dataset.
We currently do not provide an independently labeled validation dataset.
The `caret` package provides the `trainControl` function, which specifies model
training parameters and the `train` function, which performs the actual model
training. While training the model, we also want to estimate the best model
parameters. In the case of the chosen random forest model (`method = "rf"`), we
only need to estimate a single parameters (`mtry`) which corresponds to the
number of variables randomly sampled as candidates at each split. To estimate
the best parameter, we will perform a 5-fold cross validation (set within
`trainControl`) over a tune length of 5 entries to `mtry`.
```{r, echo=FALSE}
start_time <- Sys.time()
```
```{r train-classifier, message=FALSE}
library(caret)
# Split between labeled and unlabeled cells
lab_spe <- spe[,spe$cell_labels != "unlabeled"]
unlab_spe <- spe[,spe$cell_labels == "unlabeled"]
# Randomly split into train and test data
set.seed(220925)
trainIndex <- createDataPartition(factor(lab_spe$cell_labels), p = 0.75)
train_spe <- lab_spe[,trainIndex$Resample1]
test_spe <- lab_spe[,-trainIndex$Resample1]
# Specify train parameters for 5-fold cross validation
fitControl <- trainControl(method = "cv",
number = 5)
# Select the data for training
cur_mat <- t(assay(train_spe, "exprs")[rowData(train_spe)$use_channel,])
# Train a random forest model for predicting cell labels
# This call also performs parameter tuning
rffit <- train(x = cur_mat,
y = factor(train_spe$cell_labels),
method = "rf", ntree = 1000,
tuneLength = 5,
trControl = fitControl)
rffit
```
```{r, echo=FALSE}
end_time <- Sys.time()
```
Training the classifier took
`r round(as.numeric(difftime(end_time, start_time, units = "mins")), digits = 2)` minutes.
### Classifier performance
We next observe the accuracy of the classifer when predicting cell
phenotypes across the cross-validation as well as the test dataset.
First, we can visualize the classification accuracy during parameter
tuning:
```{r accuracy-tuning}
ggplot(rffit) +
geom_errorbar(data = rffit$results,
aes(ymin = Accuracy - AccuracySD,
ymax = Accuracy + AccuracySD),
width = 0.4) +
theme_classic(base_size = 15)
```
The best value for `mtry` is 28 and is used when predicting new data.
It is often recommended to visualize the variable importance of the
classifier. The following plot specifies which variables (markers) are
most important for classifying the data.
```{r variable-importance, fig.height = 10}
plot(varImp(rffit))
```
As expected, the markers that were used for gating (Ecad, CD3, CD20, HLADR,
CD8a, CD38, FOXP3) were important for classification.
To assess the accuracy, sensitivity, specificity, among other quality measures of
the classifier, we will now predict cell phenotypes in the test data.
```{r model-testing, message=FALSE}
# Select test data
cur_mat <- t(assay(test_spe, "exprs")[rowData(test_spe)$use_channel,])
# Predict cell phenotypes in test data
cur_pred <- predict(rffit,
newdata = cur_mat)
```
While the overall classification accuracy can appear high, we also want
to check if each cell phenotype class is correctly predicted.
For this, we will calculate the confusion matrix between predicted and actual
cell labels. This measure highlights individual cell phenotype classes that
were not correctly predicted by the classifier. When setting `mode = "everything"`,
the `confusionMatrix` function returns all available prediction measures including
sensitivity, specificity, precision, recall and the F1 score per cell
phenotype class.
```{r confusion-matrix}
cm <- confusionMatrix(data = cur_pred,
reference = factor(test_spe$cell_labels),
mode = "everything")
cm
```
To easily visualize these results, we can now plot the true positive rate (sensitivity)
versus the false positive rate (1 - specificity):
```{r specificity-sensitivity, message=FALSE}
library(tidyverse)
data.frame(cm$byClass) %>%
mutate(class = sub("Class: ", "", rownames(cm$byClass))) %>%
ggplot() +
geom_point(aes(1 - Specificity, Sensitivity,
size = Detection.Rate,
fill = class),
shape = 21) +
scale_fill_manual(values = metadata(spe)$color_vectors$celltype) +
theme_classic(base_size = 15) +
ylab("Sensitivity (TPR)") +
xlab("1 - Specificity (FPR)")
```
We observe high sensitivity and specificity for most cell types. Plasma cells
show the lowest true positive rate with 92% being sufficiently high. The size of
the circle specifies the number of cells per class.
Finally, to observe which cell phenotypes were wrongly classified, we can visualize
the distribution of classification probabilities per cell phenotype class:
```{r prediciton-probability, fig.height=15}
cur_pred <- predict(rffit,
newdata = cur_mat,
type = "prob")
cur_pred$truth <- factor(test_spe$cell_labels)
cur_pred %>%
pivot_longer(cols = Bcell:Tumor) %>%
ggplot() +
geom_boxplot(aes(x = name, y = value, fill = name), outlier.size = 0.5) +
facet_wrap(. ~ truth, ncol = 1) +
scale_fill_manual(values = metadata(spe)$color_vectors$celltype) +
theme(panel.background = element_blank(),
axis.text.x = element_text(angle = 45, hjust = 1))
```
The boxplots indicate the classification probabilities per class. The classifier
is well trained if classification probabilities are only high for the one
specific class.
### Classification of new data
In the final section, we will now use the tuned and tested random forest
classifier to predict the cell phenotypes of the unlabeled data.
First, we predict the cell phenotypes and extract their classification
probabilities.
```{r, unlabeled-prediction}
# Select unlabeled data
cur_mat <- t(assay(unlab_spe, "exprs")[rowData(unlab_spe)$use_channel,])
# Predict cell phenotypes
cell_class <- as.character(predict.train(rffit,
newdata = cur_mat,
type = "raw"))
names(cell_class) <- rownames(cur_mat)
table(cell_class)
# Extract classification probabilities
cell_prob <- predict.train(rffit,
newdata = cur_mat,
type = "prob")
```
Each cell is assigned to the class with highest probability. There are however
cases, where the highest probability is low meaning the cell can not be uniquely
assigned to a class. We next want to identify these cells and label them as
"undefined". Here, we select a maximum classification probability threshold
of 40% but this threshold needs to be adjusted for other datasets. The adjusted
cell labels are then stored in the `SpatialExperiment` object.
```{r, undefined-cells}
library(ggridges)
# Distribution of maximum probabilities
tibble(max_prob = rowMax(as.matrix(cell_prob)),
type = cell_class) %>%
ggplot() +
geom_density_ridges(aes(x = max_prob, y = cell_class, fill = cell_class)) +
scale_fill_manual(values = metadata(spe)$color_vectors$celltype) +
theme_classic(base_size = 15) +
xlab("Maximum probability") +
ylab("Cell type") +
xlim(c(0,1.2))
# Label undefined cells
cell_class[rowMax(as.matrix(cell_prob)) < 0.4] <- "undefined"
# Store labels in SpatialExperiment onject
cell_labels <- spe$cell_labels
cell_labels[colnames(unlab_spe)] <- cell_class
spe$celltype <- cell_labels
table(spe$celltype, spe$patient_id)
```
We can now compare the cell labels derived by classification to the different
clustering strategies. The first comparison is against the clustering results
using the asinh-transformed counts.
```{r compare-raw, message=FALSE}
tab1 <- table(spe$celltype,
paste("Rphenograph", spe$pg_clusters))
tab2 <- table(spe$celltype,
paste("SNN", spe$nn_clusters))
tab3 <- table(spe$celltype,
paste("SOM", spe$som_clusters))
pheatmap(log10(tab1 + 10), color = viridis(100))
pheatmap(log10(tab2 + 10), color = viridis(100))
pheatmap(log10(tab3 + 10), color = viridis(100))
```
We can see that Tumor and Myeloid cells span multiple clusters while
Neutrophiles are detected as an individual cluster by all clustering approaches.
We next compare the cell classification against clustering results using the
integrated cells.
```{r compare-corrected, message=FALSE}
tab1 <- table(spe$celltype,
paste("Rphenograph", spe$pg_clusters_corrected))
tab2 <- table(spe$celltype,
paste("SNN", spe$nn_clusters_corrected))
tab3 <- table(spe$celltype,
paste("SOM", spe$som_clusters_corrected))
pheatmap(log10(tab1 + 10), color = viridis(100))
pheatmap(log10(tab2 + 10), color = viridis(100))
pheatmap(log10(tab3 + 10), color = viridis(100))
```
We observe a high agreement between the shared nearest neighbor clustering
approach using the integrated cells and the cell phenotypes derived by
classification.
In the next sections, we will highlight visualization strategies to verify the
correctness of the phenotyping approach. Specifically, Section
\@ref(outline-cells) shows how to outline identified cell phenotypes on
composite images.
```{r display-images, echo=FALSE, message = FALSE, results='hide'}
# This section serves quality control purposes.
library(testthat)
test_that("images are written out", {
skip_on_ci()
expect_true(TRUE)
cell_types <- unique(spe$celltype)
markers <- vector(mode = "list", length = length(cell_types))
names(markers) <- cell_types
markers[["Bcell"]] <- c("CD20", "CD3")
markers[["Plasma_cell"]] <- c("CD38")
markers[["CD4"]] <- c("CD4", "CD8a", "FOXP3")
markers[["BnTcell"]] <- c("CD20","CD3")
markers[["CD8"]] <- c("CD4", "CD8a", "FOXP3")
markers[["Treg"]] <- c("CD4", "CD8a", "FOXP3")
markers[["Myeloid"]] <- c("CD11c", "CD68", "HLADR", "CD163")
markers[["Neutrophil"]] <- c("CD15", "MPO")
markers[["Stroma"]] <- c("SMA", "PDGFRb")
markers[["Tumor"]] <- c("Ecad")
markers[["undefined"]] <- c("Ecad", "CD3", "CD20", "MPO", "CD38")
images <- readRDS("data/images.rds")
masks <- readRDS("data/masks.rds")
images <- cytomapper::normalize(images, separateImages = TRUE)
images <- cytomapper::normalize(images, separateImages = TRUE,
inputRange = c(0, 0.2))
if (!dir.exists("data/CellTypeValidation")) {
dir.create("data/CellTypeValidation")
}
for(j in unique(spe$celltype)){
cur_sce <- spe[,spe$celltype == j]
cur_markers <- markers[[j]]
if(dim(cur_sce)[2] == 0) {
next(j)
}
if (length(cur_markers) == 1) {
cur_col <- "red"
names(cur_col) <- j
cur_col <- list(cur_col)
names(cur_col) <- "celltype"
plotPixels(image = images,
object = cur_sce,
mask = masks,
cell_id = "ObjectNumber",
img_id = "sample_id",
colour_by = cur_markers,
outline_by = "celltype",
image_title = list(text = names(images),
cex = 1),
colour = cur_col,
save_plot = list(filename = paste0("data/CellTypeValidation/",
j, ".png")))
} else {
plotPixels(image = images,
object = cur_sce,
mask = masks,
cell_id = "ObjectNumber",
img_id = "sample_id",
colour_by = cur_markers,
outline_by = "celltype",
image_title = list(text = names(images),
cex = 1),
save_plot = list(filename = paste0("data/CellTypeValidation/",
j, ".png")))
}
}
expect_true(TRUE)
})
```
Finally, we save the updated `SpatialExperiment` object.
```{r phenotyping-save-object}
saveRDS(spe, "data/spe.rds")
```