lab_dge.Rmd

---
title: "Differential Gene Expression"
subtitle: "Workshop on RNA-Seq"
editor_options: 
  chunk_output_type: console
output:
  bookdown::html_document2:
    highlight: textmate
    toc: true
    toc_float:
      collapsed: true
      smooth_scroll: true
      print: false
    toc_depth: 4
    number_sections: true
    df_print: default
    code_folding: none
    self_contained: false
    keep_md: false
    encoding: 'UTF-8'
    css: "assets/lab.css"
    include:
      after_body: assets/footer-lab.html
---

```{r,child="assets/header-lab.Rmd"}
```

<div class="boxy boxy-exclamation boxy-yellow">
Set `~/RNAseq/labs/` as your working directory (i.e. your working directory should be the directory where you saved this .Rmd file). Create a subdirectory named `data` in this directory (`~/RNAseq/labs/data/`) for input and output files.
</div>

Load R packages and source the download function.

```{r}
library(dplyr) # data wrangling
library(ggplot2) # plotting
library(DESeq2) # rna-seq
library(edgeR) # rna-seq

# source download function
source("https://raw.githubusercontent.com/NBISweden/workshop-RNAseq/master/assets/scripts.R")
```

# Preparation

For differential gene expression, we use the **DESeq2** package. We use the raw filtered counts and metadata.
  
```{r}
download_data("data/gene_counts.csv")
cf <- as.matrix( read.csv("data/gene_counts.csv",header=TRUE,stringsAsFactors=FALSE,row.names=1) )

download_data("data/metadata_raw.csv")
mr <- read.csv("data/metadata_raw.csv",header=TRUE,stringsAsFactors=F,row.names=1)
```

The data is converted to a DESeq2 object. The GLM model we use is simple since we only have one variable of interest `~Group`.

If we had other covariates to control for, we would add them to the model like so `~var+Group`. The variable of interest appears in the end. This model asks to find differentially expressed genes between groups under 'Group' variable while controlling for the effect of 'var'. Similarily, batch effects can be controlled by specifying them in the model `~batch+Group`.

```{r}
library(DESeq2)
mr$Group <- factor(mr$Group)
d <- DESeqDataSetFromMatrix(countData=cf,colData=mr,design=~Group)
```

# Size factors

The first step is estimating size factors. The data is normalised for sequencing depth and compositional bias as done for the VST step. DESeq2 uses a method called *median-of-ratios* for this step.

```{r,chunk.title=TRUE}
d <- DESeq2::estimateSizeFactors(d,type="ratio")
```

<div class="boxy boxy-lightbulb">

**Optional**

For those interested in the details of the *median-of-ratios* method, click below.

<p>
<button class="btn btn-sm btn-collapse btn-collapse-normal collapsed" type="button" data-toggle="collapse" data-target="#dge-size-factor" aria-expanded="false" aria-controls="dge-size-factor">
</button>
</p>
<div class="collapse" id="dge-size-factor">
<div class="card card-body">

This is a step-by-step guide to computing normalisation factors (size factors) using the *median-of-ratios* method.

1. The geometric mean is computed across all samples for each gene.

```{r,chunk.title=TRUE}
gm_mean = function(x, na.rm=TRUE){
  exp(sum(log(x[x > 0]), na.rm=na.rm) / length(x))
}

gmean <- apply(cf,1,gm_mean)
head(gmean)
```

2. Read counts for each gene is divided by the geometric mean of that gene to create a ratio.

```{r,chunk.title=TRUE}
ratio <- cf/gmean
head(ratio)[,1:5]
```

3. The median ratio for each sample (across all genes) is taken as the size factor for that sample.

```{r,chunk.title=TRUE}
sf <- apply(ratio,2,median)
sf
```

We can verify that these values are correct by comparing with size factors generated by DESeq2.

```{r,chunk.title=TRUE}
# deseq2 size factors
sizeFactors(d)
```

If we plot the size factors for each sample against the total counts for each sample, we get the plot below. We can see that they correlate very well. Size factors are mostly correcting for total counts, ie; sequencing depth.

```{r,fig.height=4.5,fig.width=4.5}
plot(sizeFactors(d),colSums(cf),xlab="Size factors",ylab="Total counts")
```

The raw counts can then be divided by the size factor to yield normalised read counts.

```{r,chunk.title=TRUE}
# custom
head(t(t(cf)/sf))[,1:5]
# deseq2
head(counts(d,normalized=TRUE))[,1:5]
```

</div>
</div>
</div>

<i class="fas fa-comments"></i> The function `estimateSizeFactors()` has options to set a custom `locfunc` other than the median. Why is this useful? What happens if you change it to "shorth". Check out the help page for `estimateSizeFactors()`.

# Gene dispersion

When it comes to comparing values between groups, some measure of variation is needed to estimate the variability in gene counts within groups. The most common measures of variation such as variance and standard deviation are not a good measure for rnaseq data because these measures correlate with the mean. For negative bionomially distributed data, dispersion is a better measure of variation in the data set. 

<div class="boxy boxy-lightbulb">

**Optional**

For some more clarification on dispersion, click below.

<p>
<button class="btn btn-sm btn-collapse btn-collapse-normal collapsed" type="button" data-toggle="collapse" data-target="#dge-dispersion" aria-expanded="false" aria-controls="dge-dispersion">
</button>
</p>
<div class="collapse" id="dge-dispersion">
<div class="card card-body">

We can create a mean counts vs variance plot for all genes in our data set.

```{r,fig.height=4.5,fig.width=4.5,chunk.title=TRUE}
dm <- apply(cf,1,mean)
dv <- apply(cf,1,var)

ggplot(data.frame(mean=log10(dm+1),var=log10(dv+1)),
       aes(mean,var))+
  geom_point(alpha=0.2)+
  geom_smooth(method="lm")+
  labs(x=expression('Log'[10]~'Mean counts'),y=expression('Log'[10]~'Variance'))+
  theme_bw()
```

We see a mostly linear relationship on the log scale. The blue line denotes a linear fit. Genes that have larger read counts show higher variance. It's hard to say which genes are more variable based on this alone. Therefore, variance is not a good measure to identify variation in read counts. A measure that controls for this mean-variance relationship is what we need.

One option is the coefficient of variation (CV). Let's compute the CV for each gene and plot CV vs the mean.

```{r,fig.height=4.5,fig.width=4.5,chunk.title=TRUE}
cva <- function(x) sd(x)/mean(x)
dc <- apply(cf,1,cva)

ggplot(data.frame(mean=log10(dm+1),var=dc),
       aes(mean,var))+
  geom_point(alpha=0.2)+
  geom_smooth()+
  labs(x=expression('Log'[10]~'Mean counts'),y="Coefficient of variation")+
  theme_bw()
```

Now, we see that genes with lower counts have higher variability and genes with larger counts have lower variability. A measure like CV is taking the ratio of 'variation' to mean. `cv=sd(x)/mean(x)`.

This becomes even more apparent if we compute the CV and mean over replicates within sample groups (Time).

```{r,fig.height=3.5,fig.width=5,chunk.title=TRUE}
dx1 <- data.frame(day00=apply(cf[,1:3],1,cva),day07=apply(cf[,4:6],1,cva))
dx1$gene <- rownames(dx1)
dx1 <- tidyr::gather(dx1,key=sample,value=cv,-gene)
rownames(dx1) <- paste0(dx1$gene,"-",dx1$sample)

dx2 <- data.frame(day00=apply(cf[,1:3],1,mean),day07=apply(cf[,4:6],1,mean))
dx2$gene <- rownames(dx2)
dx2 <- tidyr::gather(dx2,key=sample,value=mean,-gene)
rownames(dx2) <- paste0(dx2$gene,"-",dx2$sample)

dx3 <- merge(dx1,dx2,by=0)

ggplot(dx3,aes(x=log10(mean+1),y=cv))+
  geom_point(alpha=0.2)+
  geom_smooth()+
  facet_wrap(~sample.x)+
  labs(x=expression('Log'[10]~'Mean counts'),y="Coefficient of variation")+
  theme_bw()
```

We find that CV strongly declines with increasing counts. Genes with low counts show higher variability. For the sake of completeness, we can also plot the relationship between CV and variance for the same sample groups.

```{r,fig.height=3.5,fig.width=5,chunk.title=TRUE}
dx1 <- data.frame(day00=apply(cf[,1:3],1,cva),day07=apply(cf[,4:6],1,cva))
dx1$gene <- rownames(dx1)
dx1 <- tidyr::gather(dx1,key=sample,value=cv,-gene)
rownames(dx1) <- paste0(dx1$gene,"-",dx1$sample)

dx2 <- data.frame(day00=apply(cf[,1:3],1,var),day07=apply(cf[,4:6],1,var))
dx2$gene <- rownames(dx2)
dx2 <- tidyr::gather(dx2,key=sample,value=var,-gene)
rownames(dx2) <- paste0(dx2$gene,"-",dx2$sample)

dx3 <- merge(dx1,dx2,by=0)

ggplot(dx3,aes(x=log10(var+1),y=cv))+
  geom_point(alpha=0.2)+
  geom_smooth()+
  facet_wrap(~sample.x)+
  labs(x=expression('Log'[10]~'Variance'),y="Coefficient of variation")+
  theme_bw()
```

And we see that they have a weak relationship.

</div>
</div>
</div>

DESeq2 computes it's own version of dispersion in a more robust manner taking into account low count values. The DESeq2 dispersion estimates are inversely related to the mean and directly related to variance. The dispersion estimate is a good measure of the variation in gene expression for a certain mean value.

Now, the variance or dispersion estimate for genes with low counts is unreliable when there are too few replicates. To overcome this, DESeq2 borrows information from other genes. DESeq2 assumes that genes with similar expression levels have similar dispersion values. Dispersion estimates are computed for each gene separately using maximum likelihood estimate. A curve is fitted to these gene-wise dispersion estimates. The gene-wise estimates are then 'shrunk' to the fitted curve.

Gene-wide dispersions, fitted curve and shrinkage can be visualised using the `plotDispEsts()` function.

```{r,fig.height=5.5,fig.width=5,chunk.title=TRUE}
d <- DESeq2::estimateDispersions(d)
plotDispEsts(d)
```

The black points denote the maximum likelihood dispersion estimate for each gene. The red curve denote the fitted curve. The blue points denote the new gene dispersion estimates after shrunk towards the curve. The circled blue points denote estimates that are not shrunk as they are too far away from the curve. Thus, shrinkage method is important to reduce false positives in DGE analysis involving too few replicates.

<i class='fas fa-lightbulb'></i> It is a good idea to visually check the dispersion shrinkage plot to verify that the method works for your data set.

# Testing

Overdispersion is the reason why RNA-Seq data is better modelled as negative-binomial distribution rather than poisson distribution. Poisson distribution has a mean = variance relationship, while negative-binomial distribution has a variance > mean relationship. The last step in the DESeq2 workflow is fitting the Negative Binomial model for each gene and performing differential expression testing. This is based on the log fold change values computed on the corrected count estimates between groups.

The log fold change is computed from the corrected count values as such:

```
FC = corrected counts group B / corrected counts group A
logFC = log2(FC)
```

and the fold-change can be computed back from the log fold-change as such:

```
FC = 2^logFC
```

The most commonly used testing for comparing two groups in DESeq2 is the Walds's test. The null hypothesis is that the groups are not different and `logFC=0`. The list of contrasts can be seen using `resultsNames()`. Then we can pick our comparisons of interest.

```{r,chunk.title=TRUE}
dg <- nbinomWaldTest(d)
resultsNames(dg)
```

And we can get the result tables for the three different comparisons. The summary of the result object shows the number of genes that are differentially expressed with positive or negative fold-change and outliers.

```{r,chunk.title=TRUE}
res <- results(dg,name="Group_day07_vs_day00",alpha=0.05)
res$padj[is.na(res$padj)] <- 1
# res <- na.omit(res)
summary(res)
write.csv(res,"data/dge_results.csv",row.names=T)
```

You can also build up the comparison using **contrasts**. Contrasts need the condition, level to compare and the reference level (base level). For example, `results(dg,contrast=c("Group","day07","day00"),alpha=0.05)`.

Note that `day00` is the reference level and other levels are compared to this. Therefore, a fold-change of 2 would mean that, a gene is 2 fold higher expressed in `day07` compared to `day00`.

The `results()` function has many useful arguments. One can set a threshold on the logFC values using `lfcThreshold`. By default, no filtering is performed based on logFC values. Outliers are detected and p-values are set to NA automatically using `cooksCutoff`. `independentFiltering=TRUE` remove genes with low counts.

```{r,chunk.title=TRUE}
head(res)
```

The result table contains mean expression value (baseMean), log2 fold change (log2FoldChange), log2 fold change standard error (lfcSE), wald test statistic (stat), wald test p-value (pvalue) and BH adjusted p-value (padj) for each gene.

<i class='fas fa-lightbulb'></i>  Note that the results object is a `DESeqResults` class object and not a data.frame. It can be converted to a data.frame using `as.data.frame()` for exporting to a file.

It is a good idea to look at the distribution of unadjusted p-values.

```{r,fig.height=5,fig.width=5}
hist(res$pvalue,main="Pval distribution",xlab="P-values")
```

This is the kind of distribution to be expected when the p-values are "well-behaved". For more explanation on p-value distributions, see [here](http://varianceexplained.org/statistics/interpreting-pvalue-histogram/). If this distribution is very different or strange, then it might indicate an underlying problem.

We can filter the results table as needed.

```{r,chunk.title=TRUE}
# all genes
nrow(as.data.frame(res))
# only genes with padj <0.05
nrow(dplyr::filter(as.data.frame(res),padj<0.05))
# only genes with padj <0.05 and an absolute fold change >2
nrow(dplyr::filter(as.data.frame(res),padj<0.05,abs(log2FoldChange)>2))
```

Note that manually filtering by **log2FoldChange** on the results table is not the same as setting the `lfcThreshold` argument in the `results()` function.

### lfcShrink

This is an optional extra step to generate more accurate log2 fold changes. This step corrects the log2 fold changes for genes with high dispersion. This does not change the p-values or the list of DE genes.

```{r,chunk.title=TRUE}
lres <- lfcShrink(dg,coef="Group_day07_vs_day00",res=res,type="normal")
summary(lres)
```

```{r,chunk.title=TRUE}
head(lres)
```

We see that the number of genes up and down has not changed. But, we can see that the logFC distribution has changed in the density plot below.

```{r,fig.height=4,fig.width=8}
par(mfrow=c(1,2))
plot(density(na.omit(res$log2FoldChange)),main="log2FC")
plot(density(na.omit(lres$log2FoldChange)),main="log2FC lfcShrink")
par(mfrow=c(1,1))
```

This step does not change the total number of DE genes. This may be useful in downstream steps where genes need to be filtered down based on fold change or if fold change values are used in functional analyses such as GSEA.

<i class="fas fa-comments"></i> How many DE genes do you get?

# Visualisation

In this section, we can explore some useful visualisation of the differential gene expression output.

## MA plot

The MA plot shows mean expression vs log fold change for all genes. The `plotMA()` function from DESeq2 takes a results object as input. Differentially expressed genes are marked in bright colour.

```{r,fig.height=5.5,fig.width=5,chunk.title=TRUE}
DESeq2::plotMA(res)
```

<i class="fas fa-comments"></i> How does this plot change if you set log fold change filtering to minimum value of 1. How does the plot change when you use `lfcShrink()`?

## Volcano plot

A volcano plot is similar to the MA plot. It plots log fold change vs adjusted p-values.

```{r,fig.height=5,fig.width=5,chunk.title=TRUE}
ggplot()+
  geom_point(data=as.data.frame(res),aes(x=log2FoldChange,y=-log10(padj)),col="grey80",alpha=0.5)+
  geom_point(data=filter(as.data.frame(res),padj<0.05),aes(x=log2FoldChange,y=-log10(padj)),col="red",alpha=0.7)+
  geom_hline(aes(yintercept=-log10(0.05)),alpha=0.5)+
  theme_bw()
```

X axis denotes log fold change and the y axis denotes -log10 adjusted p-value. The adjusted p-value is transformed so that the smallest p-values appears at the top. The horizontal grey line denotes the significance threshold line. All genes above this line (coloured red as well) are considered significant.

<i class="fas fa-comments"></i> Why is the y-axis (p-value) on a -log scale?

## Counts plot

It can be a good idea to manually verify some of these genes by plotting out it's actual read count values. We can use the function `plotCounts()` to visualise the data points for a gene of interest. Below, we see the counts before and after normalisation.

```{r,fig.height=5,fig.width=5,chunk.title=TRUE}
plotCounts(d,gene=rownames(res)[1],intgroup="Group",normalized=F)
plotCounts(d,gene=rownames(res)[1],intgroup="Group",normalized=T)
```

<i class="fas fa-comments"></i> By looking at the count plots, do you agree that the top DE genes differ significantly between the groups compared? Does the fold change make sense given the reference group?

***