-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathpedestrian.qmd
159 lines (101 loc) · 4.71 KB
/
pedestrian.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
---
title: "Doing the Seurat analysis \"on foot\""
---
We load our example data
```{r}
suppressPackageStartupMessages( {
library( Seurat )
library( Matrix ) } )
```
```{r}
count_matrix <- Read10X( "data/pbmc3k/filtered_gene_bc_matrices/hg19/" )
```
Qu: Examine the sparse storage format of the matrix.
### Normalization
The total rad counts per cell (the column sums of the count matrix) differ noticeably.
In order to compare between cells, we should not look at the absolute counts but better at the fraction that each gene's count contributes to the total. Hence, we divide the matrix by the column sum:
```{r}
fractions <- t( t(count_matrix) / colSums(count_matrix) )
```
### Variance-mean dependence
We calculate for each gene mean and variance across all cells and plot them against each other:
```{r}
library( sparseMatrixStats )
plot( rowMeans(fractions), rowVars(fractions), cex=.3 )
```
Qu: The function `rowVars` is here taken from the "sparseMatrixStats". How would you calculate the row variances of a matrix given in one of the sparse storage formats?
It is helpful to make the plot logarithmic
```{r}
plot( rowMeans(fractions), rowVars(fractions), cex=.3, log="xy", col=scales::alpha( "black", .3 ) )
segments( 1e-6, 1e-10, 1e-2, 1e-6 ) # line with slope 1 on log-log plot --> v ∝ µ
segments( 1e-7, 1e-10, 1e-4, 1e-4 ) # line with slope 2 on log-log plot --> v ∝ µ²
```
We have a look at the squared coefficient of variation (CV²):
```{r}
plot( rowMeans(fractions), rowVars(fractions) / rowMeans(fractions),
cex=.3, log="xy", col=scales::alpha( "black", .3 ) )
abline( h = mean( 1/colSums(count_matrix)), col="orange")
```
For later use, we keep a list of the 1000 genes with the highest CV²:
```{r}
hvg <- names( head( sort( rowVars(fractions) / rowMeans(fractions), decreasing=TRUE ), 1000 ) )
```
We call these the "highly variable genes" (HVG).
### Log-transformation
For the more strongly expressed (and hence more informative) genes, SD seems to be proportional to mean, i.e., we have multiplicative noise. TO make this homoskedastic, we should logarithmize our fractions.
However, $\log(0)=-\infty$. As a "hack" to avoid this, let us as a small value to each faction before taking the log. Seurat used $10^{-4}$.
So, if $k_{ij}$ is the read count for gene $i$ in cell $j$ and $s_j=\sum_i k_{ij}$ the total read count for cell $j$, we might use $\log_{10}\left(\frac{k_{ij}}{s_j}+10^{-4}\right)$. However, is seems convenient to use a transformation that maps 0 to 0. Therefore, Seurat (and many other work flows use):
$$ y_{ij} = \log_{10}\left(\frac{k_{ij}}{s_j}\cdot 10^4 + 1\right)$$
The ``+1'' is often referred to as the "pseudocount", because it is roughly one extra read count -- if $s_j$ is around $10^4$.
```{r}
expr <- log1p( fractions * 1e4 + 1 )
```
Task: Compare with Seurat and check whether it uses natural or decadic logarithm.
Note: The function `log1p` preserves the matrix's sparse storage format. Why does this not work with `log( 1 + . )`?
### Distances
Let's pick a random cell and check its distance to all other cells, using Euclidean distance on `expr`.
```{r}
cell <- sample.int( ncol(expr), 1 )
hist( sqrt( colSums( ( expr - expr[,cell] )^2 ) ), 100 )
```
Remembering that the cells fall into three "clusters", we might have hoped to get a bimodal distribution, with the cells from the chosen cell's own cluster being closer and the other being farther. This is not the case, even if we try many cells.
PCA will save us, however.
### PCA
We perform a principal component analysis (PCA) on the `expr` matrix. To save time, we
- calculate only the top 20 PCs, and use a PCA function that can take advantage of this and of the fact that our matrix is sparse
- use only the 1000 most variable genes (as seen above)
```{r}
pca <- irlba::prcomp_irlba( t( expr[hvg,] ), n=20, center=TRUE, scale.=TRUE )
rownames(pca$rotation) <- hvg
rownames(pca$x) <- colnames(expr)
```
### Distances again
Now, let's try again with the distances:
```{r}
cell <- sample.int( ncol(pca$x), 1 )
hist( sqrt( rowSums( ( pca$x - pca$x[cell,] )^2 ) ), 100 )
```
Now, for at least some cells, we get a clearly bimodal distribution.
### Neighbor graph
As we only have a small data set, we can calculate the full distance matrix:
```{r}
dm <- as.matrix( dist( pca$x ) )
```
Qu: Pick two cells, calculate the Euclidean distance of the their PCA scores "manually" and compare with the matrix entry. It should agree.
For a given cell, find its 20 nearest neighbors:
```{r}
cell <- 134
order( dm[134,] )[1:20]
```
Do so for all cells:
```{r}
t(sapply( 1:ncol(expr), function(cell)
order( dm[134,] )[1:20] )) -> nn
```
Make a graph out of this
```{r}
library( igraph )
# ...
```
Apply a clustering algorithm onto it
...