Skip to content

Commit

Permalink
A few minor edits to the intro vignette up to 'Step 2'.
Browse files Browse the repository at this point in the history
  • Loading branch information
pcarbo committed Nov 27, 2023
1 parent 8a15d20 commit 65b557d
Showing 1 changed file with 37 additions and 35 deletions.
72 changes: 37 additions & 35 deletions vignettes/intro_poisson_mash.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -22,18 +22,19 @@ Overview
--------

This vignette illustrates how to use
[Poisson multivariate adaptive shrinkage][paper] to analyze a data
set. Poisson mash was developed for multi-condition differential
expression analysis with single-cell RNA-sequencing (scRNA-seq) data in
which gene expression is measured in many (e.g., dozens) of different
[Poisson multivariate adaptive shrinkage ("Poisson mash")][paper] to
analyze single-cell RNA-sequencing (scRNA-seq) data. Poisson mash was
developed for *multi-condition* differential expression analysis with
scRNA-seq data in which gene expression is measured in many different
conditions. The aim is to detect which genes are differentially
expressed (DE) and to estimate expression differences (log-fold
changes) among multiple conditions. This approach makes use of the
multivariate adaptive shrinkage ([mash][mash]) prior.

We illustrate this approach by analyzing scRNA-seq data for learning
about the effects of cytokine stimulation on gene expression.
We will work through the Poisson mash analysis step-by-step.
We illustrate Poisson mash by analyzing a scRNA-seq data set that was
developed in order to learn about the effects of cytokine stimulation
on gene expression. We will work through the Poisson mash analysis of
this dataa set step-by-step.

First, we load the R packages needed to run the analysis.

Expand All @@ -52,8 +53,8 @@ results are reproducible.
set.seed(1)
```

Create a data object for Poisson mash analysis
----------------------------------------------
Create a data object for Poisson mash
-------------------------------------

To run Poisson mash, you need to provide: (1) a matrix of UMI counts,
stored as a $J \times N$ matrix $Y$ with entries $y_{ji}$, where $i =
Expand All @@ -62,8 +63,8 @@ vector of length $n$ in which the $i^{\mathrm{th}}$ element gives the
condition label of cell $i$. We denote the number of conditions by
$R$. ($R$ should be much larger than 2.)

The original data contained counts for 13,362 cells, 8,543 genes and
45 conditions (44 cytokine treatments plus one control). For the this
The original data set contained counts for 13,362 cells, 8,543 genes
and 45 conditions (44 cytokine treatments plus one control). For this
vignette, we will use subset (1,936 cells, 2,158 genes, 12
conditions), but later we will inspect the results obtained from
running Poisson mash on the full data set.
Expand All @@ -80,8 +81,7 @@ table(conditions)
After loading in the data, we compute the cell-specific size factors.
This can be done in various ways. The simplest option is to take the
sum of the counts $y_{ji}$ across all genes in each cell $i$. Here
we take a deconvolution-based approach implemented in the
**scran** package:
we take a deconvolution-based approach implemented in scran:

```{r calculate-size-factor}
clusters <- quickCluster(Y)
Expand All @@ -90,7 +90,7 @@ names(si) <- colnames(Y)
```

Next, we create a data object for poisson mash analysis using the
function `pois_mash_set_data`.
function `pois_mash_set_data()`.

```{r create-data-object}
dat <- pois_mash_set_data(Y,conditions,si)
Expand All @@ -106,7 +106,7 @@ to perform a DE analysis across multiple treatment conditions, jointly
for multiple cell types. In this case, $M$ is the number of cell
types, and $R$ represents the total number of combinations of
treatment conditions and cell types. Please refer to the documentation
for the function `pois_mash_set_data` to see how to do this.)
for the function `pois_mash_set_data()` to see how to do this.)

Estimate latent factors capturing unwanted variation
----------------------------------------------------
Expand All @@ -115,7 +115,7 @@ Unwanted variation present in the data can induce dependence among
gene-wise tests and confound the DE analysis. In some cases, the
confounding variables are known, such as when these capture batch
effects. More often, the unwanted variation is due to unmeasured
factors. Here we estimating the confounding variables using
factors. Here we estimate the unmeasured confounding variables using
[glmpca][glmpca].

```{r ruv, results="hide", message=FALSE, warning=FALSE}
Expand Down Expand Up @@ -157,10 +157,10 @@ p(\beta_j; \pi, U) =
$$
In the mash prior, $w_l$ is a pre-specified grid of scaling
coefficients, spanning from very small to very large to capture the
full range of possible DE effect sizes; $U_k$ is a set of covariance
matrices, each of which capture a different pattern of covariation
across conditions; $\pi_{kl}$ are mixture proportions that are
estimated from the data during model fitting.
full range of possible DE effect sizes; each $U_k$ is a covariance
matrix that captures a different pattern of covariation across
conditions; $\pi_{kl}$ are the mixture proportions that are estimated
from the data during model fitting.

To use Poisson mash, you need to specify these prior covariance
matrices. This may include "canonical" covariance matrices that
Expand All @@ -170,7 +170,7 @@ arbitrary patterns of DE among conditions.

For the canonical matrices, here we consider the matrices capturing
condition-specific effects for each of the $R$ conditions. These
matrices can be generated using the `pois_cov_canonical` function:
matrices can be generated using the `pois_cov_canonical()` function:

```{r canonical}
ulist.c <- pois_cov_canonical(dat)
Expand All @@ -179,11 +179,11 @@ ulist.c <- pois_cov_canonical(dat)
Data-driven covariance matrices are estimated using a three-step
process: (1) initialization, (2) refinement, (3) diagonal entry
modification. The initialization phase is implemented by the function
`pois_cov_init`. To estimate data-driven covariance matrices, we only
`pois_cov_init()`. To estimate the data-driven covariance matrices, we only
use the subset of genes containing "strong signals"; that is, the
genes that are considered DE by a multinomial goodness-of-fit test.
The selection of genes with strong signals is implicitly performed
by the call to `pois_cov_init`, e.g., setting `cutoff = 3` selects
by the call to `pois_cov_init()`, e.g., setting `cutoff = 3` selects
the genes for which the magnitude of z-score (inverted from the
p-value of the goodness-of-fit test) exceeds 3.

Expand All @@ -193,15 +193,12 @@ res.pca <- pois_cov_init(dat,cutoff = 3)

The refinement phase refines the initial estimates of data-driven
covariance matrices using the Extreme Eeconvolution (ED) algorithm,
and is implemented by the function `pois_cov_ed`. As in the
and is implemented by the function `pois_cov_ed()`. As in the
initialization phase, only data from the genes with strong signals are
used. It should be noted that only data-driven covariance matrices
used. It should be noted that only the data-driven covariance matrices
need to be refined, so they need to be distinguished from the
canonical covariance matrices in the `pois_cov_ed` call using the
argument `ulist.dd`. For the purposes of this vignette, we just ran
`pois_cov_ed` for 10 iterations, but we recommend performing more
iterations when analyzing your own data (the default number of
iterations is 500).
canonical covariance matrices in the `pois_cov_ed()` call using the
argument `ulist.dd`.

```{r cov-ed, results="hide"}
R <- ncol(dat$X)
Expand All @@ -218,6 +215,11 @@ fit.ed <- pois_cov_ed(dat,subset = res.pca$subset,Ulist = res.pca$Ulist,
control = list(maxiter = 10))
```

For the purposes of this vignette, we just ran
`pois_cov_ed()` for 10 iterations, but we recommend performing more
iterations when analyzing your own data (the default number of
iterations is 500).

The modification phase adds a small constant $\epsilon^2$ (e.g.,
0.01) to the diagonal entries of the estimated data-driven covariance
matrices. This step is done to alleviate issues of inflated
Expand All @@ -235,24 +237,24 @@ H <- length(Ulist)
for(h in 1:H){
Uh <- Ulist[[h]]
Uh <- Uh/max(diag(Uh))
Ulist[[h]] <- Uh + 1e-2*diag(R)
Ulist[[h]] <- Uh + 0.01*diag(R)
}
# add epsilon2*I to each rank-1 data-driven prior covariance matrix
# only a subset of the rank-1 covariance matrices are data-driven and need this modification
G <- length(fit.ed$ulist)
epsilon2.G <- rep(1e-8, G)
names(epsilon2.G) <- names(fit.ed$ulist)
epsilon2.G[ulist.dd] <- 1e-2
epsilon2.G[ulist.dd] <- 0.01
```

### Step 3: Fit the Poisson mash model

Now we are ready to fit the Poisson mash model to data from all genes,
which is implemented by the function `pois_mash`.
which is implemented by the function `pois_mash()`.

By default, differences in expression are measured relative to the
mean across all conditions. To change this, see the `pois_mash` input
mean across all conditions. To change this, see the `pois_mash()` input
arguments `C` and `res.colnames`.

```{r fit-model, results="hide"}
Expand All @@ -263,7 +265,7 @@ res <- pois_mash(data = dat,Ulist = Ulist,ulist = fit.ed$ulist,
control = list(maxiter = 10,nc = 2))
```

For this vignette, we ran `pois_mash` for only 10 iterations, but
For this vignette, we ran `pois_mash()` for only 10 iterations, but
in practice we recommend setting `maxiter = 100`, or perhaps more.

# Examining the Poisson mash results
Expand Down

0 comments on commit 65b557d

Please sign in to comment.