Added R notebook

BUStools · Jun 16, 2019 · cdc632e · cdc632e
1 parent f8e76a7
commit cdc632e
Show file tree

Hide file tree

Showing 3 changed files with 739 additions and 0 deletions.
diff --git a/.gitignore b/.gitignore
@@ -0,0 +1,5 @@
+.Rproj.user
+.Rhistory
+.RData
+.Ruserdata
+*.Rproj
diff --git a/getting_started.Rmd b/getting_started.Rmd
@@ -0,0 +1,152 @@
+---
+title: "Getting started"
+author: "Lambda"
+date: "6/15/2019"
+output: html_document
+---
+
+```{r setup, include=FALSE}
+knitr::opts_chunk$set(echo = TRUE,
+                      fig.align = "center")
+```
+
+In thihs notebook, we load the output of `bustools` and do some basic analysis with R.
+```{r, message=FALSE}
+library(Seurat)
+library(DropletUtils)
+library(Matrix)
+library(data.table)
+library(ggplot2)
+library(magrittr)
+fn <- "/home/single_cell_analysis/kallisto_out_single/kallisto_SRR8599150_v2/"
+```
+
+Here's a function that can load the output into R in one step. This is part of the `BUSpaRse` package, which implement utility functions related to `bustools`.
+```{r}
+# Function to load matrix
+#' Read matrix along with barcode and gene names
+#' 
+#' This function takes in a directory and name and reads the mtx file, genes,
+#' and barcodes from the output of `bustools` to return a sparse matrix with 
+#' column names and row names. 
+#' 
+#' @param dir Directory with the bustools count outputs.
+#' @param name The files in the output directory should be <name>.mtx, <name>.genes.txt,
+#' and <name>.barcodes.txt.
+#' @param tcc Logical, whether the matrix of interest is a TCC matrix. Defaults
+#' to \code{FALSE}.
+#' @return A dgCMatrix with barcodes as column names and genes as row names.
+#' 
+read_count_output <- function(dir, name, tcc = TRUE) {
+  dir <- normalizePath(dir, mustWork = TRUE)
+  m <- readMM(paste0(dir, "/", name, ".mtx"))
+  m <- t(m)
+  m <- as(m, "dgCMatrix")
+  # The matrix read has cells in rows
+  ge <- if (tcc) ".ec.txt" else ".genes.txt"
+  genes <- fread(paste0(dir, "/", name, ge), header = FALSE)$V1
+  barcodes <- fread(paste0(dir, "/", name, ".barcodes.txt"), header = FALSE)$V1
+  colnames(m) <- barcodes
+  rownames(m) <- genes
+  return(m)
+}
+```
+
+```{r}
+# The directory where the data is placed
+list.files(fn)
+```
+
+```{r}
+# Load data
+m <- read_count_output(fn, "genes", tcc = FALSE)
+dim(m)
+```
+We expect 3949 cells, but there are way more barcodes (columns) than expected.
+
+# Remove empty droplets
+The vast majority of barcodes we see in the data are empty droplets. We will remove barcodes that are likely to be empty droplets and keep those that are likely to be real cells. Here we use the classic knee plot method to remove empty droplets.
+
+```{r}
+bc_rank <- barcodeRanks(m)
+```
+
+```{r}
+qplot(bc_rank$rank, bc_rank$total, geom = "line") +
+  geom_hline(yintercept = metadata(bc_rank)$knee, color = "blue", linetype = 2) +
+  geom_hline(yintercept = metadata(bc_rank)$inflection, color = "green", linetype = 2) +
+  annotate("text", x = 1000, y = 1.2 * c(metadata(bc_rank)$knee, metadata(bc_rank)$inflection),
+           label = c("knee", "inflection"), color = c("blue", "green")) +
+  scale_x_log10() +
+  scale_y_log10() +
+  labs(x = "Rank", y = "Total reads") +
+  theme_bw()
+```
+
+Now we filter the barcodes based on the inflectionsn point. We will also remove genes not detected in any cell.
+```{r}
+# Filter the barcodes based on 
+tot_counts <- colSums(m)
+tot_genes <- rowSums(m)
+m_filtered <- m[, tot_counts > metadata(bc_rank)$inflection]
+m_filtered <- m_filtered[tot_genes > 0, ]
+dim(m_filtered)
+```
+
+That's a more reasonable number of barcodes.
+
+# Basic analysis
+First, we normali and scale the data with [`SCTransform`](https://github.com/ChristophH/sctransform).
+
+```{r, message=FALSE, warning=FALSE}
+seu <- CreateSeuratObject(m_filtered) %>% 
+  SCTransform(verbose = FALSE)
+```
+
+Some QC: distribution of total UMI counts and number of genes detected
+```{r}
+VlnPlot(seu, c("nCount_RNA", "nFeature_RNA"), pt.size = 0.1)
+```
+
+Also see how total UMI counts in each cell relates to the number of genes detected.
+```{r}
+ggplot([email protected], aes(nCount_RNA, nFeature_RNA)) +
+  geom_point(size = 0.5, alpha = 0.3) +
+  theme_bw() +
+  labs(x = "Total UMI counts", y = "Number of genes detected")
+```
+
+Then PCA.
+```{r}
+seu <- RunPCA(seu, npcs = 50, verbose = FALSE)
+ElbowPlot(seu, ndims = 50) + labs(x = "Principal component")
+```
+
+The y axis of this plot is the standard deviation of the data explained by the principal component (PC). Judging from the elbow plot, we will use the first 30 PCs for tSNE.
+
+```{r}
+# Run tSNE
+seu <- RunTSNE(seu, dims = 1:30, verbose = FALSE)
+```
+
+Here we cluster the data with the Louvain algorithm. The algorithm used in Seurat can be changed.
+```{r}
+seu <- FindNeighbors(seu, verbose = FALSE)
+seu <- FindClusters(seu, verbose = FALSE)
+```
+
+What do the clusters look like?
+```{r}
+# On PCA
+DimPlot(seu, pt.size = 0.5, reduction = "pca")
+```
+
+```{r}
+# On tSNE
+DimPlot(seu, pt.size = 0.5, reduction = "tsne")
+```
+
+```{r}
+sessionInfo()
+```
+