-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathgraph_diffusion.qmd
566 lines (431 loc) · 16.9 KB
/
graph_diffusion.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
---
title: "Diffusion distances"
---
#### Load example data
Before getting into the topic, we load the usual example data and performing standard preprocessing
```{r}
suppressPackageStartupMessages({
library( tidyverse )
library( Matrix )
library( sparseMatrixStats )
library( Seurat ) })
ReadMtx( "~/Downloads/ifnagrko/ifnagrko_raw_counts.mtx.gz",
"~/Downloads/ifnagrko/ifnagrko_obs.csv",
"~/Downloads/ifnagrko/ifnagrko_var.csv",
cell.sep=",", feature.sep=",", skip.cell=1, skip.feature=1,
mtx.transpose=TRUE) -> count_matrix
```
```{r}
count_matrix %>%
CreateSeuratObject() %>%
NormalizeData() %>%
FindVariableFeatures() %>%
ScaleData() %>%
RunPCA( npcs=20 ) %>%
FindNeighbors( dims=1:20 ) %>%
FindClusters( resolution=0.5 ) %>%
RunUMAP( dims=1:20 ) -> seu
```
```{r}
UMAPPlot( seu, label=TRUE ) + coord_equal()
```
#### Nearest neighbors
In this lesson, we will mainly work with the nearest neighbor data. Seurat has already calculated this but we do this again here:
```{r}
FNN::get.knn( Embeddings( seu, "pca" ), k=15 ) -> nn
head( nn$nn.index)
```
#### Adjacency matrix
We next construct the adjacency matrix of the undirected nearest neighbor graph. To connect each vertex to its $i$th nearest neighbor, we need the following edges:
```{r}
ncells <- ncol(seu)
i <- 3
cbind( vertex_A=1:ncells, vertex_B=nn$nn.index[,i] ) %>% head()
```
We can use this as indices of the matrix cells we want to set to one and thus construct a sparse matrix
```{r}
sparseMatrix( i=1:ncells, j=nn$nn.index[,i], x=1, dims=c(ncells,ncells) ) %>% summary() %>% head()
```
Adding up on such matrix for i running from 1 to 15 gives us the adjacency matrix
```{r}
adjm <- sparseMatrix( i=integer(), j=integer(), x=numeric(), dims=c(ncells,ncells) ) # zero matrix
for( i in 1:ncol(nn$nn.index) ) {
adjm <- adjm + sparseMatrix( i=1:ncells, j=nn$nn.index[,i], x=1, dims=c(ncells,ncells) ) }
summary(adjm) %>% head()
```
We make the matrix symmetric
```{r}
adjm <- adjm + t(adjm)
```
Now, some matrix entries have become 2 rather than 1. We set everything back to 1:
```{r}
adjm@x[] <- 1
```
Now, we have an adjacency matrix for our nearest neighbor graph.
#### Random walk
We now define a random walk on our graph as follows: A "walker" (or: "token")
starts at a vertex $i$. In each time step, it choses one of the vertex's
neighbors at random and moves there. What is the probability of the walker
being on vertex $j$ after $\ell$ steps?
We represent the walker being at vertex $i$ with the unit vector in direction $i$,
i.e., the vector $\vec e_i$, with a 1 at component $i$ and zero elsewhere.
The transition matrix $T$ with elements $T_{ij}$ tells us the probability of the walker
moving to vertex $j$ in a step if it was before at vertex $i$:
$$ T_{ij} = A_{ij} \Big/ \sum_{j'}{A_{ij'}}. $$
The division normalizes the probabiliities by dividing by the number of neighbors
that the walker can chose from.
```{r}
trm <- adjm / rowSums(adjm)
```
Check normalization:
```{r}
rowSums(trm) %>% head()
```
To try this out, we pick a cell close to the point (-6,6) in the UMAP:
```{r}
cell <- which.min( ( Embeddings(seu,"umap")[,1] + 6 )^2 + ( Embeddings(seu,"umap")[,2] - 6 )^2 )
cell
```
Here's a UMAP plot of this cell and it's neighbors:
```{r}
Embeddings(seu,"umap") %>%
as_tibble() %>%
mutate( w = case_when(
row_number() == cell ~ "cell",
adjm[cell,] == 1 ~ "neighbor",
TRUE ~ "other"
)) %>%
ggplot +
geom_point( aes( x=umap_1, y=umap_2, col=w ), size=.3 ) + coord_equal() +
scale_color_manual( values=c("darkgreen","magenta","#00000006"))
```
Let's perform 10 steps. We start with a sparse vector with a single 1 at the chosen cell's
index and 0 elsewhere, then multiply this 10 times with $T$:
```{r}
u <- sparseVector( i=cell, x=1, length=ncells )
for( i in 1:10 )
u <- u %*% trm
```
We first check whether `u` is still normalized:
```{r}
sum(u)
```
Here's a plot of $\vec{u}=\vec{e}_i^\top T^{10}$.
```{r}
Embeddings(seu,"umap") %>%
as_tibble() %>%
mutate( u = as.vector(u) ) %>%
ggplot +
geom_point( aes( x=umap_1, y=umap_2, col=u ), size=.3 ) + coord_equal() +
scale_color_viridis_c(direction=-1)
```
#### Exponantiating the transition matrix
Calculating $\ell$ steps by repeated multiplication is wasteful. We should use
a matrix exponential.
As preparation for this, we define the diagonal "degree matrix" $D$, that contains the vertex degrees:
$$ D_{ij} = \delta_{ij} \sum_{j'} A_{ij'}. $$
Now, we have:
$$ T = D^{-1}A.$$
$T$ is a row-stochastic matrix, i.e., its values are all non-negative and its
rows sum to 1.
The probability mass vector for a walker starting at vertex $i$ after one step is
$\vec{e}_i^\top T$, and after $\ell$ steps, $\vec{e}_i^\top T^\ell$.
To calculate $T^\ell$, we will need the eigendecomposition of the symmetrized transition matrix
$\tilde T = D^{-1/2} T D^{-1/2}$:
$$ \tilde T = U\Lambda U^\top,$$
with the columns of $U$ containing the eigenvectors of $\tilde T$ and the diagonal matrix $\Lambda$
containing the eigenvalues.
With this, we get
$$ \begin{align}
T^\ell &= \left(D^{-1} A\right)^\ell \\
&= D^{-1/2} \left(D^{-1/2} A D^{-1/2}\right)^\ell D^{1/2} \\
&= D^{-1/2} \left(U \Lambda U^T\right)^\ell D^{1/2} \\
&= \underbrace{D^{-1/2} U \Lambda^\ell}_{=X_\ell} U^T D^{1/2}.
\end{align} $$
We construct $D$
```{r}
degdiag <- sparseMatrix( i=1:ncells, j=1:ncells, x=rowSums(adjm) )
```
We also write down $D^{-1}$, $D^{1/2}$ and $D^{-1/2}$:
```{r}
invdegdiag <- sparseMatrix( i=1:ncells, j=1:ncells, x=1/rowSums(adjm) )
sqrtdegdiag <- sparseMatrix( i=1:ncells, j=1:ncells, x=sqrt(rowSums(adjm)) )
invsqrtdegdiag <- sparseMatrix( i=1:ncells, j=1:ncells, x=1/sqrt(rowSums(adjm)) )
```
Now we get the eigensystem of $\tilde T$, requesting the 100 eigenvalues
that are largest by magnitude:
```{r}
eigtrm <- RSpectra::eigs_sym( invsqrtdegdiag %*% adjm %*% invsqrtdegdiag, k=100 )
```
Now, we can calculate $\vec e_i^\top T^\ell$ as follows:
We calculate first $X_\ell = D^{-1/2} U \Lambda^\ell$:
```{r}
invsqrtdegdiag %*% eigtrm$vectors %*% diag( eigtrm$values^10 ) %>% as.matrix() -> x10
dim(x10)
```
Note that the matrix $X_\ell$ has been trimmed to only the first 50 columns. That is ok
because the factor $\Lambda^\ell$ gets small quickly and therefore, the columns stay close
to zero once we get past the first few on the left:
```{r}
plot( colSums(x10^2) )
```
Of course, this only works for $\ell \gg 1$ and with $\ell=10$ we may have only just enough steps for the
approximation becoming valid.
We take the row of $X_\ell$ that corresponds to our cell and multiply this with $U^\top D^{1/2}$ to get
$e_i^\top X_\ell U^\top D^{1/2}=e_i^\top T^\ell$:
```{r}
x10[cell,] %*% t(eigtrm$vectors) %*% sqrtdegdiag %>% as.vector() -> u10
```
Here is a plot of that vector:
```{r}
Embeddings(seu,"umap") %>%
as_tibble() %>%
mutate( u = u10 ) %>%
ggplot +
geom_point( aes( x=umap_1, y=umap_2, col=u ), size=.3 ) + coord_equal() +
scale_color_viridis_c(direction=-1)
```
This looks quite similar as the plot we made before.
Here's a comparison of the two results
```{r}
plot( u, u10, asp=1, cex=.2, xlab="exact calculation", ylab="using top 100 eigenvectors" )
abline( 0, 1, col="#00000020" )
abline( h=0, v=0, col="#00000020" )
```
#### Diffusion distances
We now define a new distance metrix for our cells. Intuitively: To quantify the
distance between two cells $i$ and $j$, we start random walks at both cells,
evolve them for $\ell$ steps, and then ask about the overlap between the resulting
probability vectors.
We might therefore use
$$ \left\| e_i^\top T^\ell - e_j
^\top T^\ell \right\|_2, $$
i.e., the Euclidean distance between rows $i$ and $j$ of $T^\ell$ as the distance between cells $i$ and $j$.
Note that the number of steps, $\ell$, selects a "length scale" at which the
that distance is informative.
There is a practical difficulty in using this definition, though: When multiplying
the small $n\times k$ matrix $X_\ell$ with the transpose of the $n\times k$ eigenvector matrix $U$
(with $n$ being the number of cells and $k$ being the number of eigenvectors that have been calculated)
our data blows up to an unwieldy $n\times n$ matrix.
We can avoid this by using the rows of $X_\ell$ instead of the rows of $T^\ell$,
and therefore define:
The *$\ell$-steps diffusion distance* between cells $i$ and $j$ is
$$ d_{\ell,ij} = \left\| e_i^\top X_\ell - e_j^\top X_\ell \right\|_2, $$
Note that this is also
$$ d_{\ell,ij} = \left\| \left( e_i^\top T^\ell - e_i^\top T^\ell\right) D^{-1/2} \right\|_2, $$
i.e., the components of the resulting probability vectors get reweighted by $D^{-1/2}$ before calculating the
norm of the difference. If we accept this (somewhat unmotivated) reweighting, because it does not change much (as the vertex
degrees do not differ that much from each other), we have a computationally efficient way of calculating $d_{\ell,ij}$:
All we need is $X_\ell$.
By calculating the Euclidean distance of every row of $X_\ell$ to the row for our selected cell,
we get the $\ell$-step diffusion distance of that cell to all other cells:
```{r}
Embeddings(seu,"umap") %>%
as_tibble() %>%
mutate( d = sqrt( rowSums( t( t(x10) - x10[cell,] )^2 ) ) ) %>%
arrange(-d) %>%
ggplot +
geom_point( aes( x=umap_1, y=umap_2, col=d ), size=.3 ) + coord_equal() +
scale_color_viridis_c(direction=-1)
```
Below, this plot is repeated for several values of $\ell$. (But remember that our
approximation error is larger for $\ell \lesssim 10$.)
```{r}
for( l in c(3, 10, 30, 100, 300, 1000 ) ) {
invsqrtdegdiag %*% eigtrm$vectors %*% diag( eigtrm$values^l ) %>% as.matrix() -> xm
print(
Embeddings(seu,"umap") %>%
as_tibble() %>%
mutate( d = sqrt( rowSums( t( t(xm) - xm[cell,] )^2 ) ) ) %>%
arrange(-d) %>%
ggplot +
geom_point( aes( x=umap_1, y=umap_2, col=d ), size=.3 ) + coord_equal() +
scale_color_viridis_c(direction=-1) + ggtitle( sprintf( "%d steps", l ) ) )
}
```
#### Pseudotime
We now want to define a pseudotime along the lineage, from cluster 0 to cluster 2 and on to 7.
We define a start cell and an end cell, by taking our cell (which is arguably somewhere in the
middle) and find the cell with the largest distance to it within cluster 0 and 7, respectivelty.
We will work with $\ell=300$:
```{r}
l <- 300
xm <- as.matrix( invsqrtdegdiag %*% eigtrm$vectors %*% diag( eigtrm$values^l ) )
# distances to "intermediate" cell:
d <- sqrt( rowSums( t( t(xm) - xm[cell,] )^2 ) )
```
Here is the distance of the cells in cluster 0 (the astrocyte / neuronal stem cell cluster, where the lineage starts) to our intermediate cell:
```{r}
hist( d[seu$seurat_clusters==0], 100 )
abline(v=0.0013, col="orange")
```
It seems reasonable to assume that the cells at the steep cliff are the actual start of the
lineage and the cells further out or some outliers (perhaps cells that started into
another direction)
```{r}
start_cell <- which.min( ( d * (seu$seurat_clusters==0) - 0.0013 )^2 )
```
Let's also find a reasonable "end cell" in the neuron cluster 7:
```{r}
hist( d[seu$seurat_clusters==7], 100 )
```
Here' let's simply take the last cell:
```{r}
end_cell <- which.max( d * (seu$seurat_clusters==7) )
```
Now, get distances to these two cells:
```{r}
dist_to_start <- sqrt( rowSums( t( t(xm) - xm[start_cell,] )^2 ) )
dist_to_end <- sqrt( rowSums( t( t(xm) - xm[end_cell,] )^2 ) )
```
```{r}
tibble(
dist_to_start, dist_to_end,
cluster=seu$seurat_clusters ) %>%
mutate(
type = case_when(
cluster %in% c( 0, 3, 5, 1, 2, 7 ) ~ "lineage_straight",
cluster %in% c( 6, 11 ) ~ "lineage_cycle",
TRUE ~ "other" ) ) %>%
ggplot +
geom_point( aes( x=dist_to_end, y=dist_to_start, col=type ), size=.1 ) +
coord_equal()
```
Let's rotate this plot by 45°:
```{r}
tibble(
dist_to_start, dist_to_end,
cluster=seu$seurat_clusters ) %>%
mutate(
type = case_when(
cluster %in% c( 0, 3, 5, 1, 2, 7 ) ~ "lineage_straight",
cluster %in% c( 6, 11 ) ~ "lineage_cycle",
TRUE ~ "other" ) ) %>%
ggplot +
geom_point( aes(
x = dist_to_start - dist_to_end,
y = dist_to_start + dist_to_end, col=type ), size=.1 )
```
Our new x-axis should be suitable as pseudotime:
```{r}
pseudotime <- dist_to_start - dist_to_end
```
Let's plot this into the UMAP:
```{r}
Embeddings(seu,"umap") %>%
as_tibble() %>%
add_column( pseudotime ) %>%
ggplot +
geom_point( aes( x=umap_1, y=umap_2, col=pseudotime ), size=.3 ) + coord_equal() +
scale_color_gradientn( colours=rje::cubeHelix(100,r=4) )
```
Before we dive into this, let's also plot the `y axis of `dist+to+start+dfist+to_end`:
```{r}
Embeddings(seu,"umap") %>%
as_tibble() %>%
add_column( distsum = dist_to_start + dist_to_end ) %>%
ggplot +
geom_point( aes( x=umap_1, y=umap_2, col=distsum ), size=.3 ) + coord_equal() +
scale_color_gradient2( midpoint = .01, limits = c(0,.02), oob=scales::oob_squish )
```
Comparing with the plot above, our pseudotime is valid for the red, white and
perhaps the very faintly blue regions.
### Comaprison to principal curve
Let's recreate the principal curve that we used before for pseudotime.
As before, we calculate the principal curve using the cells in the lineage as
input, without the cells from the two cycling cluster, and use smoothing splines
with 10 degrees of freedom.
```{r}
princurve::principal_curve(
Embeddings(seu,"pca")[ seu$seurat_clusters %in% c( 0, 3, 5, 1, 2, 7 ), ],
df = 10, approx_points=1000 ) -> prc
```
We again project the remaining cells onto outo the cells on the curve and assign to them
the same pseudotime as the one of the nearest point on the curve:
```{r}
FNN::get.knnx( prc$s, Embeddings(seu,"pca"), k=1 ) -> prc_nn
pseudotime_prc <- prc$lambda[ prc_nn$nn.index ]
```
Here is the principal-curve-based pseudotime:
```{r}
Embeddings(seu,"umap") %>%
as_tibble() %>%
add_column( pseudotime_prc ) %>%
ggplot +
geom_point( aes( x=umap_1, y=umap_2, col=pseudotime_prc ), size=.3 ) + coord_equal() +
scale_color_gradientn( colours=rje::cubeHelix(100,r=4) )
```
We can also compare the two:
```{r}
tibble(
pt_diffusion = pseudotime,
pt_prc = -pseudotime_prc,
cluster = seu$seurat_clusters
) %>%
mutate(
type = case_when(
cluster %in% c( 0, 3, 5, 1, 2, 7 ) ~ "lineage_straight",
cluster %in% c( 6, 11 ) ~ "lineage_cycle",
TRUE ~ "other" ) ) %>%
ggplot +
geom_point( aes( x=pt_diffusion, y=pt_prc, col=type ), size=.2 )
```
Here, teh same, but coloured for cluster (using only the lineage cells). See below for a UMAP
coloured in the same manner.
```{r}
tibble(
pt_diffusion = pseudotime,
pt_prc = -pseudotime_prc,
cluster = seu$seurat_clusters
) %>%
filter(
cluster %in% c( 0, 3, 5, 1, 2, 7, 6, 11 ) ) %>%
ggplot +
geom_point( aes( x=pt_diffusion, y=pt_prc, col=cluster ), size=.2 )
```
### Diffusion space
The space of $X_\ell$ can also be interpreted as an alternative to the feature space,
called "diffusion space". Using the first few components provides a dimension redution,
the "diffusion map".
For this to work well, our neighborhood graph should be connected.
Ours has two connection components, however, as is evident from the fact that our
transition matrix has two unit eigenvalues:
```{r}
head( eigtrm$values )
```
To make our live easier, let's reduce the data to only the lineage cells
```{r}
in_lineage <- seu$seurat_clusters %in% c( 0, 3, 5, 1, 2, 7, 6, 11 )
```
Subset the adjacency matrix to these and recalculate the transition matrix and
its spectrum
```{r}
adjml <- adjm[ in_lineage, ][ , in_lineage ]
ncells <- nrow(adjml)
invsqrtdegdiag <- sparseMatrix( i=1:ncells, j=1:ncells, x=1/sqrt(rowSums(adjml)) )
eigtrm <- RSpectra::eigs_sym( invsqrtdegdiag %*% adjml %*% invsqrtdegdiag, k=10 )
x300 <- as.matrix( invsqrtdegdiag %*% eigtrm$vectors %*% diag( eigtrm$values^300 ) )
```
```{r}
as_tibble( x300[,2:3 ] ) %>%
add_column( cluster = seu$seurat_clusters[in_lineage] ) %>%
ggplot +
geom_point( aes( x=V1, y=V2, col=cluster ), size=.1 ) + coord_equal()
```
For comparison, the UMAP with the same c
```{r}
as_tibble( Embeddings(seu,"umap")[in_lineage,] ) %>%
add_column( cluster = seu$seurat_clusters[in_lineage] ) %>%
ggplot +
geom_point( aes( x=umap_1, y=umap_2, col=cluster ), size=.1 ) + coord_equal()
```
### References
The idea of diffusion distances and diffusion maps has been introduced in this paper:
- Coifman and Lafon (2006): *Diffusion Maps*. Applied and Computational Harmonic Analysis, Vol. 21, Pages 5-30. [doi:10.1016/j.acha.2006.04.006](https://doi.org/10.1016/j.acha.2006.04.006)
Applying these ideas to single-cell data is explored in
- Haghverdi, Büttner, Theis (2015): *Diffusion maps for high-dimensional single-cell analysis of differentiation data*
Bioinformatics, Vol. 31, Pages 2989–2998, [doi:10.1093/bioinformatics/btv325](https://doi.org/10.1093/bioinformatics/btv325)
- Haghverdi, Büttner, Wolf, Buettner, Theis (2016): *Diffusion pseudotime robustly reconstructs lineage branching*
Nature Methods, Vol. 13, Pages 845–848
- Angerer, Haghverdi, Büttner, Theis, Marr, Buettner: *destiny: diffusion maps for large-scale single-cell data in R*,
Bioinformatic, Vol. 32, Pages 1241-1243, [doi:10.1093/bioinformatics/btv715](https://doi.org/10.1093/bioinformatics/btv715)