diff --git a/vignettes/activity_analysis.Rmd b/vignettes/activity_analysis.Rmd deleted file mode 100644 index bd84b8d..0000000 --- a/vignettes/activity_analysis.Rmd +++ /dev/null @@ -1,125 +0,0 @@ ---- -title: "Activity-based Data Analysis" -output: rmarkdown::html_vignette -vignette: > - %\VignetteIndexEntry{activity_analysis} - %\VignetteEngine{knitr::rmarkdown} - %\VignetteEncoding{UTF-8} ---- - -```{r, include = FALSE} -knitr::opts_chunk$set( - collapse = TRUE, - comment = "#>" -) -options(rmarkdown.html_vignette.check_title = FALSE) -``` - -```{r setup, message=FALSE, warning=FALSE} -library(scMINER) -library(ggplot2) -library(dplyr) -``` - -# Introduction - -The **driver activity estimation** is one of the most important features of scMINER. **Mathematically**, the activity of one driver is a type of mean of the expressions of its targets. And **biologically**, the activity can be interpreted as a measure that describes how actively the driver functions, like the enzymes in digesting their subtracts, kinase in activating their downstream genes. Given the gene expression profiles and networks, scMINER can estimate the activities of some predefined drivers, including not only transcription factors (TFs) but also signaling genes (SIGs). scMINER provides a few functions to effortlessly calculate the activities, identify the hidden drivers and visualize them in multiple ways. - -In this tutorial, we will use PBMC14k dataset for demonstration purpose. - -
-**How was the PBMC14K dataset generated?** -```{r, prepare-input-eset, echo=FALSE} -## Load the raw count matrix -data("pbmc14k_rawCount") - -## Read the true labels -pbmc14k_trueLabel <- read.table(system.file("extdata/demo_pbmc14k/PBMC14k_trueLabel.txt.gz", package = "scMINER"), header = T, row.names = 1, sep = "\t", quote = "", stringsAsFactors = FALSE) - -## The true_label much cover all cells in the expression matrix -table(colnames(pbmc14k_rawCount) %in% row.names(pbmc14k_trueLabel)) - -## Create the sparse eSet object using the true_label -pbmc14k_raw.eset <- createSparseEset(input_matrix = pbmc14k_rawCount, cellData = pbmc14k_trueLabel, featureData = NULL, projectID = "PBMC14k", addMetaData = TRUE) - -## Filter he sparse eSet object using the auto mode -pbmc14k_filtered.eset <- filterSparseEset(pbmc14k_raw.eset, filter_mode = "auto", filter_type = "both") - -## Normalize he sparse eSet object using the auto mode -pbmc14k_log2cpm.eset <- normalizeSparseEset(pbmc14k_filtered.eset, scale_factor = 1000000, log_base = 2, log_pseudoCount = 1) - -## Add the clustering results from MICA -pbmc14k_log2cpm.eset <- addMICAoutput(pbmc14k_log2cpm.eset, mica_output_file = system.file("extdata/demo_pbmc14k/PBMC14k/MICA/clustering_UMAP_euclidean_20_2.05.txt", package = "scMINER"), visual_method = "umap") - -## Add cell type annotation results -celltype_map <- c(`1`="CD4TN", `2`="CD4TCM", `3`="CD8TN", `4`="NK", `5`="B", `6`="Monocyte", `7`="CD4Treg") -pbmc14k_log2cpm.eset$cell_type <- as.character(celltype_map[pbmc14k_log2cpm.eset$clusterID]) -``` -
-\ - -# Calculate the activities - -scMINER provides two functions, `getActivity_individual()` and `getActivity_inBatch()`, to calculate the driver activities: - -## Calculate the activities per group - -`getActivity_individual()` is designed to calculate the activities per group. It takes the network files as the input: -```{r calculate-activity-individually, eval=FALSE} -## let's use B cell as an example -activity_B.eset <- getActivity_individual(input_eset = pbmc14k_log2cpm.eset[, pData(pbmc14k_log2cpm.eset)$trueLabel == "B"], - network_file.tf = system.file("extdata/demo_pbmc14k/PBMC14k/SJARACNe/B/TF/bt100_pc001/consensus_network_ncol_.txt", package = "scMINER"), - network_file.sig = system.file("extdata/demo_pbmc14k/PBMC14k/SJARACNe/B/SIG/bt100_pc001/consensus_network_ncol_.txt", package = "scMINER"), - driver_type = "TF_SIG") -``` - -If you need to calculate the activity for multiple groups, this is usually the case, you can do it using `getActivity_individual()` as shown above one by one and merge the esets after that. Or, scMINER privides another function, `getActivity_inBatch()`, to calculate the activity in batch: -```{r calculate-activity-in-batch} -## let's use B cell as an example -activity.eset <- getActivity_inBatch(input_eset = pbmc14k_log2cpm.eset, sjaracne_dir = system.file("extdata/demo_pbmc14k/PBMC14k/SJARACNe", package = "scMINER"), group_name = "trueLabel", driver_type = "TF_SIG", activity_method = "mean", do.z_normalization = TRUE) -``` - -```{r save-activity-eset, eval=FALSE} -saveRDS(activity.eset, file = "/work-path/PBMC14k/DATA/activity.eset") -``` - - -# Differential activity analysis - -Similar to `getDE()`, scMINER provides a function, `getDA()`, to perform the differential activity analysis and identify the group-specific drivers. - -```{r differential-activity-analysis-1} -## 1. To perform differential expression analysis in a 1-vs-rest manner for all groups -da_res1 <- getDA(input_eset = activity.eset, group_by = "cell_type", use_method = "t.test") -head(da_res1) -``` - -```{r differential-activity-analysis-2, eval=FALSE} -## 2. To perform differential expression analysis in a 1-vs-rest manner for one specific group -da_res2 <- getDA(input_eset = activity.eset, group_by = "cell_type", g1 = c("B"), use_method = "t.test") - -## 3. To perform differential expression analysis in a rest-vs-1 manner for one specific group -da_res3 <- getDA(input_eset = activity.eset, group_by = "cell_type", g0 = c("B"), use_method = "t.test") - -## 4. To perform differential expression analysis in a 1-vs-1 manner for any two groups -da_res4 <- getDA(input_eset = activity.eset, group_by = "cell_type", g1 = c("CD4Treg"), g0 = c("CD4TCM"), use_method = "t.test") -``` - -The `getTopFeatures()` function can aslo be used to easily extract the group-specific markers from the differential expression result: -```{r get-top-drivers} -top_drivers <- getTopFeatures(input_table = da_res1, number = 10, group_by = "g1_tag", sort_by = "log2FC", sort_decreasing = TRUE) - -dim(top_drivers) - -head(top_drivers) -``` - -**The activity-based analysis is done!** - -
-**Session info** -```{r session-info} -sessioninfo::session_info() -``` -
-\ diff --git a/vignettes/celltype_annotation.Rmd b/vignettes/celltype_annotation.Rmd deleted file mode 100644 index 773a865..0000000 --- a/vignettes/celltype_annotation.Rmd +++ /dev/null @@ -1,193 +0,0 @@ ---- -title: "Cell type annotation" -output: rmarkdown::html_vignette -vignette: > - %\VignetteIndexEntry{celltype_annotation} - %\VignetteEngine{knitr::rmarkdown} - %\VignetteEncoding{UTF-8} ---- - -```{r, include = FALSE} -knitr::opts_chunk$set( - collapse = TRUE, - comment = "#>" -) -options(rmarkdown.html_vignette.check_title = FALSE) -``` - -```{r setup, message=FALSE, warning=FALSE} -library(scMINER) -library(ggplot2) -library(dplyr) -``` - -# Introduction - -Currently, there are two types of strategies to annotate the clusters: **supervised** and **unsupervised**. The **supervised** methods use a list of known markers of potential cell types curated from somoe existing studies of the same/similar contexts. While in contrast, the **unsupervised** methods are usually based on the differentially expressed genes. scMINER provide several useful functions to support both types of strategies. - -In this tutorial, we will use PBMC14k dataset for demonstration purpose. - -
-**How was the PBMC14K dataset generated?** -```{r, prepare-input-eset, message=FALSE, warning=FALSE} -## Load the raw count matrix -data("pbmc14k_rawCount") - -## Read the true labels -pbmc14k_trueLabel <- read.table(system.file("extdata/demo_pbmc14k/PBMC14k_trueLabel.txt.gz", package = "scMINER"), header = T, row.names = 1, sep = "\t", quote = "", stringsAsFactors = FALSE) - -## The true_label much cover all cells in the expression matrix -table(colnames(pbmc14k_rawCount) %in% row.names(pbmc14k_trueLabel)) - -## Create the sparse eSet object using the true_label -pbmc14k_raw.eset <- createSparseEset(input_matrix = pbmc14k_rawCount, cellData = pbmc14k_trueLabel, featureData = NULL, projectID = "PBMC14k", addMetaData = TRUE) - -## Filter he sparse eSet object using the auto mode -pbmc14k_filtered.eset <- filterSparseEset(pbmc14k_raw.eset, filter_mode = "auto", filter_type = "both") - -## Normalize he sparse eSet object using the auto mode -pbmc14k_log2cpm.eset <- normalizeSparseEset(pbmc14k_filtered.eset, scale_factor = 1000000, log_base = 2, log_pseudoCount = 1) - -## Add the clustering results from MICA -pbmc14k_log2cpm.eset <- addMICAoutput(pbmc14k_log2cpm.eset, mica_output_file = system.file("extdata/demo_pbmc14k/PBMC14k/MICA/clustering_UMAP_euclidean_20_2.05.txt", package = "scMINER"), visual_method = "umap") -``` -
-\ - - -# Supervised cell type annotation - -In the past a few years, the scRNA-seq data has dramatically increased in both quality and quantity. For the majority of tissue types, some existing studies on the same/similar tissue type are most likely available, and from these existing studies, we can figure out a list of candidate cell types to expect and curate a list of markers for each of them. In this case, we know the 7 cell types involved in the dataset, and curated a marker list from some existing PBMCs studies. - -## Annotate using signature scores - -Given a marker list of candidate cell types, scMINER can estimate a signature score, which is mathematically `the weighted mean of the expression of marker genes involved`, for each candidate cell type across all cell cluster. To do so, you will need to generate a signature table with three columns: - -- `signature_name`: name of cell types/signatures; -- `signature_feature`: markers genes/features of corresponding cell type/signature; -- `weight`: weight of corresponding maker/feature in corresponding cell type/signature. It ranges from -1 to 1, so both positive and negtive markers are supoorted. -```{r signature-table} -## Signature table of PBMC14k dataset -signature_table <- read.table(system.file("extdata/demo_pbmc14k/PBMC14k_signatureTable.txt", package = "scMINER"), header = TRUE, sep = "\t", quote = "", stringsAsFactors = FALSE) -head(signature_table) -``` - -With this signature table, `draw_bubbleplot()` can estimate the signature scores and visualize them using bubble plot: -```{r signature-bubble, fig.width=7, fig.height=5, fig.align='center'} -## Violin plot of marker genes across clusters -draw_bubbleplot(input_eset = pbmc14k_log2cpm.eset, signature_table = signature_table, group_by = "clusterID") -``` -In the bubble plot above, the color of the bubbles is proportional to the mean of signature score, and the size of the bubbles is proportional to the percentage of cells with higher signature score than mean. The cell type of each cluster is clear, except the cluster 7, which shows equally-high signature score of both CD4+ TCM and CD4+ Reg and higher percentage of CD4+ TCM cells. - -## Annotate using individual marker genes - -scMINER also provides a variety of functions to visualize the selected features: -```{r selected-markers} -## For the demonstration purposes, we picked two well known markers for each of the 7 known cell types, plus "CD3D" and "CD4". -genes_of_interest <-c("CD14", "LYZ", "GZMB", "NKG7", "CD19", "MS4A1", "CD8A", "CD8B", "SELL", "CCR7", "IL2RA", "FOXP3", "IL7R", "S100A4", "CD3D", "CD4") -``` - -### feature visualization: violin plot -```{r featurePlot-vln, fig.width=7, fig.height=6, fig.align='center'} -## Violin plot of marker genes across clusters -feature_vlnplot(input_eset = pbmc14k_log2cpm.eset, features = genes_of_interest, group_by = "clusterID", ncol = 4) -``` - -### feature visualization: box plot -```{r featurePlot-box, fig.width=7, fig.height=6, fig.align='center'} -## Box plot of marker genes across clusters -feature_boxplot(input_eset = pbmc14k_log2cpm.eset, features = genes_of_interest, group_by = "clusterID", ncol = 4) -``` - -### feature visualization: scatter plot -```{r featurePlot-scatter, fig.width=7, fig.height=7, fig.align='center'} -## UMAP scatter plot of marker genes -feature_scatterplot(input_eset = pbmc14k_log2cpm.eset, features = genes_of_interest, ncol = 4, location_x = "UMAP_1", location_y = "UMAP_2", point.size = 0.5, legend.key_height = 0.3, legend.key_width = 0.2, fontsize.legend_title = 8, fontsize.legend_text = 6, fontsize.axis_title = 8, legend.position = "none") -``` - -### feature visualization: bubble plot -```{r featurePlot-bubble, fig.width=7,fig.height=5, fig.align='center'} -## Bubble plot of marker genes across clusters -feature_bubbleplot(input_eset = pbmc14k_log2cpm.eset, features = genes_of_interest, group_by = "clusterID", xlabel.angle = 45) -``` - -### feature visualization: heatmap -```{r featurePlot-heatmap, fig.width=7, fig.height=5,fig.align='center'} -## Heatmap of marker genes across clusters -feature_heatmap(input_eset = pbmc14k_log2cpm.eset, features = genes_of_interest, group_by = "clusterID", scale_method = "none", annotation_columns = c("trueLabel")) -``` - - -# Unsupervised cell type annotation - -Existing studies in the same or similar contexts are not always available, and there is a significant concern regarding the reliability of reference studies. This reliability largely depends on the expertise of the original authors who defined the markers and assigned the cell types. Therefore, we strongly encourage users to also try unsupervised methods, which can serve as a means of cross-validation. - -scMINER provides a function, `getDE()`, to perform the differential expression analysis and identify the markers of each cluster. The `getDE()` function supports three different methods to perform the differential expression analysis, `limma`, `wilcoxon` and `t.test`. And it allows the users to define the groups to compare in a flexible way: - -```{r differential-expression-analysis-1} -## 1. To perform differential expression analysis in a 1-vs-rest manner for all groups -de_res1 <- getDE(input_eset = pbmc14k_log2cpm.eset[500,], group_by = "clusterID", use_method = "limma") -head(de_res1) -``` -Here is an brief introduction to the results of `getDE()`: - -- **feature**: feature name; -- **g1_tag**: a vector of clusters or subgroups involved in g1, the fore-ground group; -- **g0_tag**: a vector of clusters or subgroups involved in g0, the back-ground group; -- **g1_avg**: mean of gene expression of cells in g1; -- **g0_tag**: mean of gene expression of cells in g0; -- **g1_pct**: percentage of cells expressing the corresponding genes in group 1; -- **g0_pct**: percentage of cells expressing the corresponding genes in group 0; -- **log2FC**: log2Fold change of gene expression between g1 and g0; -- **Pval**: P values of g1-g0 comparison; -- **FDR**: FDR of g1-g0 comparison; -- **Zscore**: Z score of g1-g0 comparison, signed by `log2FC`; - -```{r differential-expression-analysis-2, eval=FALSE} -## 2. To perform differential expression analysis in a 1-vs-rest manner for one specific group -de_res2 <- getDE(input_eset = pbmc14k_log2cpm.eset, group_by = "clusterID", g1 = c("1"), use_method = "limma") - -## 3. To perform differential expression analysis in a rest-vs-1 manner for one specific group -de_res3 <- getDE(input_eset = pbmc14k_log2cpm.eset, group_by = "clusterID", g0 = c("1"), use_method = "limma") - -## 4. To perform differential expression analysis in a 1-vs-1 manner for any two groups -de_res4 <- getDE(input_eset = pbmc14k_log2cpm.eset, group_by = "clusterID", g1 = c("1"), g0 = c("3"), use_method = "limma") -``` - -scMINER also provides a function, `getTopFeatures()`, to easily extract the group-specific markers from the differential expression result: -```{r get-top-markers} -cluster_markers <- getTopFeatures(input_table = de_res1, number = 10, group_by = "g1_tag", sort_by = "log2FC", sort_decreasing = TRUE) -dim(cluster_markers) -head(cluster_markers) -``` - -# Add cell type annotations to SparseExpressionSet object - -Based on the supervised and unsupervised methods, we have annotated the cell types for each cluster. To add the cell type annotation information into the sparse eset object: -```{r add-cell-type-annotation-to-eset} -celltype_map <- c(`1`="CD4TN", `2`="CD4TCM", `3`="CD8TN", `4`="NK", `5`="B", `6`="Monocyte", `7`="CD4Treg") -pbmc14k_log2cpm.eset$cell_type <- as.character(celltype_map[pbmc14k_log2cpm.eset$clusterID]) -head(pData(pbmc14k_log2cpm.eset)) -``` - -The `draw_barplot()` function can visualize the cell composition of self-defined groups. We can use it to show the purity of MICA clusters: -```{r group-barplot, fig.width=6, fig.height=6, fig.align='center'} -## Violin plot of marker genes across clusters -draw_barplot(input_eset = pbmc14k_log2cpm.eset, group_by = "cell_type", color_by = "trueLabel_full", xlabel.angle = 45) -``` - -Don't forget to save the SparseExpressionSet object after the cell type annotation added. -```{r save-eset-obj-after-annotation, eval=FALSE} -saveRDS(pbmc14k_log2cpm.eset, file = "/work-path/PBMC14k/DATA/pbmc14k_log2CPM_annotated.rds") -``` - -**The cell type annotation is done!** - -
-**Session info** -```{r session-info} -sessioninfo::session_info() -``` -
-\ - diff --git a/vignettes/clustering_analysis.Rmd b/vignettes/clustering_analysis.Rmd deleted file mode 100644 index 37db5cc..0000000 --- a/vignettes/clustering_analysis.Rmd +++ /dev/null @@ -1,158 +0,0 @@ ---- -title: "Mutual Information-based Clustering Analysis (MICA)" -output: rmarkdown::html_vignette -vignette: > - %\VignetteIndexEntry{clustering_analysis} - %\VignetteEngine{knitr::rmarkdown} - %\VignetteEncoding{UTF-8} ---- - -```{r, include = FALSE} -knitr::opts_chunk$set( - collapse = TRUE, - comment = "#>" -) -options(rmarkdown.html_vignette.check_title = FALSE) -``` - -```{r setup, message=FALSE, warning=FALSE} -library(scMINER) -library(ggplot2) -library(dplyr) -``` - -# Introduction - -**MICA (Mutual Information-based Clustering Analysis)** is a clustering tool designed for single cell genomics data. Compared to most existing single-cell clustering algorithms, MICA has two unique features: - -- MICA uses **mutual information** to measure cell-cell similarity for unsupervised clustering analysis, while most existing tools employ **linear-transformation of PCA** and/or **co-expression analysis using linear Pearson or Spearman correlations** that may not capture the nonlinear cell-cell distance. -- MICA uses **all high-quality features** for clustering, while most existing tools select the top highly variable features to improve the clustering speed. This is arbitrary and may lose the information that can distinguish close cell states. - -**MICA** is developed using Python framework, to take its strengths in calculation speed and memory consumption. A lot of effort has been made to improve the interoperability between Python and R. Now MICA works **seamlessly** with the SparseExpressionSet object. The input of MICA can be easily generated from the SparseExpressionSet object by `generateMICAinput()`, and the output of MICA, the clustering results, can be effortlessly visualized by `MICAplot()` and integrated into SparseExpressionSet object by `addMICAoutput()`. - -We will go through them one by one in this tutorial using the PBMC14k dataset. - -
-**How was the PBMC14K dataset generated?** -```{r, prepare-input-eset} -## Load the raw count matrix -data("pbmc14k_rawCount") - -## Read the true labels -pbmc14k_trueLabel <- read.table(system.file("extdata/demo_pbmc14k/PBMC14k_trueLabel.txt.gz", package = "scMINER"), header = T, row.names = 1, sep = "\t", quote = "", stringsAsFactors = FALSE) - -## The true_label much cover all cells in the expression matrix -table(colnames(pbmc14k_rawCount) %in% row.names(pbmc14k_trueLabel)) - -## Create the sparse eSet object using the true_label -pbmc14k_raw.eset <- createSparseEset(input_matrix = pbmc14k_rawCount, cellData = pbmc14k_trueLabel, featureData = NULL, projectID = "PBMC14k", addMetaData = TRUE) - -## Filter he sparse eSet object using the auto mode -pbmc14k_filtered.eset <- filterSparseEset(pbmc14k_raw.eset, filter_mode = "auto", filter_type = "both") - -## Normalize he sparse eSet object using the auto mode -pbmc14k_log2cpm.eset <- normalizeSparseEset(pbmc14k_filtered.eset, scale_factor = 1000000, log_base = 2, log_pseudoCount = 1) -``` -
-\ - - -# Generate the MICA input - -The standard input of MICA is **a normalized and log-transformed gene expression matrix**. scMINER can generate this matrix from sparse eSet object and save it into a file that can be directly read by MICA. MICA accepts `.h5ad` or `.txt` format as the input file, which can be easily generated by embedded function `generateMICAinput()`: -```{r generate-mica-input-txt, eval=FALSE} -## generate MICA input in txt format -generateMICAinput(input_eset = pbmc14k_log2cpm.eset, output_file = "/work-path/PBMC14k/MICA/micaInput.txt", overwrite = FALSE) - -## check the format of MICA input -mica_input <- read.delim(system.file("extdata/demo_pbmc14k/PBMC14k/MICA/micaInput.txt", package = "scMINER"), header = T, sep = "\t", row.names = 1) -mica_input[1:5,1:5] -``` - -To use the .h5ad format, run the codes below. -```{r generate-mica-input-h5ad, eval=FALSE} -## generate MICA input in h5ad format: anndata package is needed -generateMICAinput(input_eset = pbmc14k_log2cpm.eset, output_file = "/work-path/PBMC14k/MICA/micaInput.h5ad", overwrite = FALSE) -``` - -In addition to generating the standard MICA input file, `generateMICAinput()` also returns the recommended commands of running MICA. You can copy the commands, modify according and run them. - - -# Run the MICA - -MICA features two different modes named by their different dimension reduction techniques: - -- **Multi-Dimensional Scaling (MDS)** mode: this mode is more accurate and robust for small datasets (less than 5,000 cells, be default) due to its global dimension reduction nature; -- **Graph Embedding (GE)** mode: this mode works better with large datasets (more than 5,000 cells, by default) using a graph embedding approach to explore distant neighbor cells. - -In this case, since there are 13,605 cells, we will use the `MICA GE` mode for the clustering: -```{r, run-mica-ge, engine = 'bash', eval=FALSE} -mica ge -i /work-path/PBMC14k/MICA/micaInput.txt -o /work-path/PBMC14k/MICA/micaOutput -minr 0.1, -maxr 9.0 -ss 0.05 -nw 40 -``` - -This command will generate the clustering results of multiple resolutions, from 0.1 to 9.0, with a step size of 0.05. - -# add MICA output into sparse eSet object - -MICA generates several files and save all of them in the output directory specified by the user with `-o` argument. The core, and only, output file we need for subsequent analysis is the clustering label file named in the format of `ProjectName_clustering_VisualizeMethod_euclidean_NumberOfDimensions_Resolution.txt`. In this case, since we used a range of resolutions, there are several clustering label files generated, one for each resolution. Based on the knowledge about PBMC14k dataset, we compared the results of different resolutions and picked `clustering_UMAP_euclidean_20_2.05.txt` for subsequent analysis. - -```{r mica-output-core} -micaOutput <- read.table(system.file("extdata/demo_pbmc14k/PBMC14k/MICA/clustering_UMAP_euclidean_20_2.05.txt", package = "scMINER"), header = TRUE, sep = "\t", quote = "", stringsAsFactors = F) -head(micaOutput) -``` -As shown above, the clustering label file contains four columns: - -- `ID`: cell barcodes; -- `X`: coordinates of UMAP_1 or tSNE_1; -- `Y`: coordinates of UMAP_2 or tSNE_2; -- `label`: labels of predicted clusters. - -The clustering result can be easily easily added to the SparseExpressionSet object by `addMICAoutput()`: -```{r mica-output-vis} -pbmc14k_log2cpm.eset <- addMICAoutput(pbmc14k_log2cpm.eset, mica_output_file = system.file("extdata/demo_pbmc14k/PBMC14k/MICA/clustering_UMAP_euclidean_20_2.05.txt", package = "scMINER"), visual_method = "umap") -head(pData(pbmc14k_log2cpm.eset)) -``` - -# Visualize the MICA output - -scMINER provides a function, `MICAplot()` to easily visualize the clustering results on a 2D plot, UMAP or tSNE. And it can be colored by multiple variables, including **cluster label**, **sample source**, **nUMI**, **nFeature**, **pctMito** and more. - -## Color-coded by cluster labels - -```{r visualize-mica-output-cluster, fig.align='center', fig.width=6, fig.height=5} -MICAplot(input_eset = pbmc14k_log2cpm.eset, color_by = "clusterID", X = "UMAP_1", Y = "UMAP_2", point.size = 0.1, fontsize.cluster_label = 6) -``` - -## Color-coded by true label of cell types - -```{r visualize-mica-output-truelabel, fig.align='center', fig.width=6.5, fig.height=5} -MICAplot(input_eset = pbmc14k_log2cpm.eset, color_by = "trueLabel", X = "UMAP_1", Y = "UMAP_2", point.size = 0.1, fontsize.cluster_label = 4) -``` - -## Color-coded by nUMI, for QC purpose - -```{r visualize-mica-output-numi, fig.align='center', fig.width=6.5, fig.height=5} -MICAplot(input_eset = pbmc14k_log2cpm.eset, color_by = "nUMI", do.logTransform = TRUE, point.size = 0.1) -``` - - -## Color-coded by nFeature, for QC purpose - -```{r visualize-mica-output-nfeature, fig.align='center', fig.width=6.5, fig.height=5} -MICAplot(input_eset = pbmc14k_log2cpm.eset, color_by = "nFeature", do.logTransform = TRUE, point.size = 0.1) -``` -## Color-coded by pctMito, for QC purpose - -```{r visualize-mica-output-pctmito, fig.align='center', fig.width=6.5, fig.height=5} -MICAplot(input_eset = pbmc14k_log2cpm.eset, color_by = "pctMito", do.logTransform = FALSE, point.size = 0.1) -``` - -**The mutual information-based clustering analysis is done!** - -
-**Session info** -```{r session-info} -sessioninfo::session_info() -``` -
-\ diff --git a/vignettes/data_preprocessing.Rmd b/vignettes/data_preprocessing.Rmd deleted file mode 100644 index 1fab332..0000000 --- a/vignettes/data_preprocessing.Rmd +++ /dev/null @@ -1,350 +0,0 @@ ---- -title: "Create SparseExpressionSet objects from multiple input formats" -output: rmarkdown::html_vignette -vignette: > - %\VignetteIndexEntry{data_preprocessing} - %\VignetteEngine{knitr::rmarkdown} - %\VignetteEncoding{UTF-8} ---- - -```{r, include = FALSE} -knitr::opts_chunk$set( - collapse = TRUE, - comment = "#>" -) -options(rmarkdown.html_vignette.check_title = FALSE) -``` - -```{r setup, message=FALSE, warning=FALSE} -library(scMINER) -library(ggplot2) -library(dplyr) -``` - -# Introduction - -This tutorial describes how to create the normalized SparseExpressionSet (sparse eSet) objects from multiple input formats. This vignette will walk through the basic workflow of data preprocessing with scMINER, including generation of gene expression matrix, creation of sparse eSet project, quality control, data filtration and normalization. - -# Create the project space - -We would ***strongly recommend*** you to create a **project space** to centralize scMINER outputs of your project. A **project space** created by scMINER is a folder of specified project name. It contains 4 subfolders: - -- **DATA**: to save the sparse eSet objects and other files; -- **MICA**: to save the inputs and outputs of mutual information-based clustering analysis; -- **SJARACNe**: to save the inputs and outputs of mutual information-based network rewiring and quality control; -- **PLOT**: to save the files of data visualization. - -This can be easily done by `createProjectSpace()` in scMINER: -```{r create-project-space, eval=FALSE} -## create the project space folder named "PBMC14k" in "/path-you-like" -createProjectSpace(project_dir = "/work-path", project_name = "PBMC14k", do.unlink = TRUE) -``` - -The structure of project space is shown below: -```{r show-structure-of-project-space} -list.dirs(system.file("extdata/demo_pbmc14k/PBMC14k", package = "scMINER"), full.names = FALSE, recursive = F) -``` - -# Generating the gene expression matrix - -In this section, we will generate the gene expression matrix, genes by cells, from **multiple input formats** commonly used in single-cell RNA sequencing. For demonstration purposes, scMINER embedded four datasets in `extdata/demoData_readInput`, with one for each input format. All of these four samples were generated by downsampling the real scRNA-seq data. - -## From data directory generated by 10x Genomics - -This is the most popular input format of scRNA-seq data generated by 10x Genomics. Usually, the data directory contains three files: - -- **matrix.mtx**: a sparse matrix format containing the raw UMI count per cell and gene combination -- **barcodes.tsv**: a tab-separated matrix containing the cell barcodes. -- **features.tsv**: a tab-separated matrix containing the features/genes and their annotations. - -For more details about this format, please check out [here](https://www.10xgenomics.com/support/software/cell-ranger/latest/analysis/outputs/cr-outputs-mex-matrices). - -```{r read-10x-data-from-directory} -data_dir <- system.file("extdata/demo_inputs/cell_matrix_10x", package = "scMINER") -list.files(path = data_dir, full.names = FALSE) - -demo1_mtx <- readInput_10x.dir(input_dir = data_dir, featureType = "gene_symbol", removeSuffix = TRUE, addPrefix = "demo1") - -demo1_mtx[1:5,1:5] -``` - -The `readInput_10x.dir()` function can handle these conditions: - -- Alternative file names for feature data: for the datasets generated by CellRanger 3.0 or earlier, the file name is **genes.tsv**; -- Compressed input files: one or more input files are compressed, usually in "**.gz**" format; -- Data with multiple modalities: like the single cell multiome data. In this case, `readInput_10x.dir()` only retains the data of "Gene Expression" by default. - -## From HDF5 file generated by 10x Genomics - -This is another popular input format of scRNA-seq data generated by 10x Genomics. The Hierarchical Data Format version 5 (HDF5 or H5) is a binary format that can compress and access data much more efficiently than text formats. It's super useful when dealing with large datasets. - -For more details about this format, please check out [here](https://www.10xgenomics.com/support/software/cell-ranger/latest/analysis/outputs/cr-outputs-h5-matrices). - -```{r read-10x-data-from-h5} -h5_file <- system.file("extdata/demo_inputs/hdf5_10x/demoData2.h5", package = "scMINER") -demo2_mtx <- readInput_10x.h5(h5_file = h5_file, featureType = "gene_symbol", removeSuffix = TRUE, addPrefix = "demo2") - -demo2_mtx[1:5,1:5] -``` - -***NOTE:*** The `readInput_10x.h5()` function is developed exclusively for the HDF5 file generated by CellRanger of 10x Genomics. The HDF5 files from other source may have different hierarchical structures and can not be read by this function. - -## From text-table file - -This is definitely the most compatible text format for scRNA-seq data. which can be used by all single-cell RNA-seq technologies, like 10x Genomics, Smart-Seq, Drop-Seq and more. The commonly used text table file formats include **txt** (text file format), **csv** (comma-separated values) and **tsv** (tab-separated values). - -```{r read-10x-data-from-table-file} -table_file <- system.file("extdata/demo_inputs/table_file/demoData3.txt.gz", package = "scMINER") -demo3_mtx <- readInput_table(table_file = table_file, sep = "\t", is.geneBYcell = TRUE, removeSuffix = TRUE, addPrefix = "demo3") # set is.geneBYcell = FALSE to read features in columns and cell in in rows - -demo3_mtx[1:5,1:5] -``` - -***NOTE:*** A major concern to read the gene expression matrix from text-table files is that **the special characters in column names might be changed to dots (".")**, especially when the matrix is organized in cells by genes. This may cause failures in the identification of mitochondrial genes (usually defined by "MT-|mt-") or spike-in RNAs (usually defined by "ERCC-|Ercc-"). The `readInput_table()` function has set `check.names = FALSE` to avoid this issue. However, if this issue already exists in the source data, you will have to fix it manually. - -## From H5AD file - -The H5AD file is another well-used format for scRNA-seq data. Derived from HDF5 file format, the H5AD file is designed to store large amounts of data efficiently and to facilitate fast access to subsets of the data. Now it's getting more and more popular in scRNA-seq data analysis, visualization and sharing. - -For more details about this format, please check out [here](https://anndata.dynverse.org/). - -```{r read-10x-data-from-h5ad-file} -h5ad_file <- system.file("extdata/demo_inputs/h5ad_file/demoData4.h5ad", package = "scMINER") -demo4_obj <- readInput_h5ad(h5ad_file = h5ad_file, removeSuffix = TRUE, addPrefix = "demo4") # set is.geneBYcell = FALSE to read features in columns and cell in in rows - -t(demo4_obj$X)[1:5,1:5] -``` - -***NOTE:*** Unlike the other three readInput functions which return a gene expression matrx, the `readInput_h5ad()` returns an AnnData object. Here are the key components of an AnnData object: - -- **X**: the primary data matrix, cells by genes; -- **obs**: observations (cells) metadata; -- **var**: variables (features/genes) metadata. - - -# Create sparse eSet object - -In this section, we will **create the SparseExpressionSet object** from the gene expression matrix generated above. The demo dataset used in this section is the **PBMC14k** dataset, which contains 7 known cell types of 2,000 cell per cell type. This dataset was generated from a dataset with 10 sorted Peripheral Blood Mononuclear Cells (PBMCs) populations [[Zheng et al., 2017](./data-reference.md#[Zheng et al., 2017])]. The original dataset is freely available under accession number [SRP073767](https://www.ncbi.nlm.nih.gov/sra?term=SRP073767) and [Zenodo](https://zenodo.org/record/3357167#.YhQNF2RKj6V). - -
-**How was the PBMC14K dataset generated from the original dataset?** -```{r, eval = FALSE} -## Step 1: rectify the invalid gene symbols /Volumes/projects/scRNASeq/yu3grp/scMINER/NG_Revision/QPan/scminer_R/Datasets/PBMC20K/00_sourceData/Filtered_DownSampled_SortedPBMC_data.csv -counts <- read.csv("/Volumes/projects/scRNASeq/yu3grp/scMINER/NG_Revision/QPan/scminer_R/Datasets/PBMC20K/00_sourceData/Filtered_DownSampled_SortedPBMC_data.csv", row.names = 1) # "Filtered_DownSampled_SortedPBMC_data.csv" is the raw count matrix directly downloaded from Zenodo -d <- t(counts); dim(d) # it includes 21592 genes and 20000 cells - -officialGene <- read.table("/Volumes/projects/scRNASeq/yu3grp/scMINER/NG_Revision/QPan/scminer_R/Datasets/PBMC20K/00_sourceData/genesymbol_from_GTF_GRCh37.txt", header = T, sep = "\t", quote = "", stringsAsFactors = F); head(officialGene) # "genesymbol_from_GTF_GRCh37.txt" contains the official gene ids and symbols extracted from GTF file downloaded from https://ftp.ensembl.org/pub/grch37/current/gtf/homo_sapiens/ -officialGene$dotted_symbol <- gsub("-", "\\.", officialGene$gene_name); officialGene$dotted_symbol <- make.unique(officialGene$dotted_symbol) -table(row.names(d) %in% officialGene$dotted_symbol); row.names(d)[! row.names(d) %in% officialGene$dotted_symbol] # two genes are not in: X7SK.1 and X7SK.2 -row.names(d) <- gsub("X7SK.1", "7SK", row.names(d)); row.names(d) <- gsub("X7SK.2", "7SK.1", row.names(d)) -table(row.names(d) %in% officialGene$dotted_symbol) # all true - -row.names(officialGene) <- officialGene$dotted_symbol -officialGene <- officialGene[row.names(d),] -row.names(d) <- make.unique(officialGene$gene_name) -write.table(d, file = "/Users/qpan/Downloads/PBMC20K_rawCount.txt", sep = "\t", row.names = TRUE, col.names = TRUE, quote = FALSE, append = FALSE) # 21592 genes, 20000 cells - -celltype <- read.csv("/Volumes/projects/scRNASeq/yu3grp/scMINER/NG_Revision/QPan/scminer_R/Datasets/PBMC20K/00_sourceData/Labels.csv"); head(celltype); # "Labels.csv" contains the true label of cell types directly downloaded from Zenodo -table(celltype$x) # 2000 cells for each of 10 cell types: CD14+ Monocyte, CD19+ B, CD34+, CD4+ T Helper2, CD4+/CD25 T Reg, CD4+/CD45RA+/CD25- Naive T, CD4+/CD45RO+ Memory, CD56+ NK, CD8+ Cytotoxic T, CD8+/CD45RA+ Naive Cytotoxic -df <- data.frame(cell_barcode = colnames(d), trueLabel_full = celltype$x); dim(df) -truelabel_map <- c(`CD14+ Monocyte`="Monocyte", `CD19+ B`="B", `CD34+`="CD34pos", `CD4+ T Helper2`="CD4Th2", `CD4+/CD25 T Reg`="CD4Treg", - `CD4+/CD45RA+/CD25- Naive T`="CD4TN", `CD4+/CD45RO+ Memory`="CD4TCM", `CD56+ NK`="NK", `CD8+ Cytotoxic T`="CD8CTL", `CD8+/CD45RA+ Naive Cytotoxic`="CD8TN") -df$trueLabel <- as.character(truelabel_map[df$trueLabel_full]) -write.table(df, file = "/Users/qpan/Downloads/PBMC20K_trueLabel.txt", sep = "\t", row.names = FALSE, col.names = TRUE, quote = FALSE, append = FALSE) - -## Step 2: extract 7 populations -df.14k <- df[df$trueLabel_full %in% c("CD14+ Monocyte", "CD19+ B", "CD4+/CD25 T Reg", "CD4+/CD45RA+/CD25- Naive T", "CD4+/CD45RO+ Memory", "CD56+ NK", "CD8+/CD45RA+ Naive Cytotoxic"),] -write.table(df.14k, file = "/Users/qpan/Downloads/PBMC14K_trueLabel.txt", sep = "\t", row.names = TRUE, col.names = TRUE, quote = FALSE, append = FALSE) - -d.14k <- d[,df.14k$cell_barcode] -d.14k <- d.14k[rowSums(d.14k) > 0,] -write.table(d.14k, file = "/Volumes/projects/scRNASeq/yu3grp/scMINER/NG_Revision/QPan/scminer_R/Datasets/PBMC14K/PBMC14K_rawCount.txt", sep = "\t", row.names = FALSE, col.names = TRUE, quote = FALSE, append = FALSE) # 17986 genes, 14000 cells -``` -
-\ - -```{r load-the-pbmc14k-data} -## Read the raw count matrix -data("pbmc14k_rawCount") -dim(pbmc14k_rawCount) - -pbmc14k_rawCount[1:5,1:5] -``` - -## Solely from the gene expression matrix - -This is **the most commonly used way** to create the sparse eSet object with scMINER: -```{r create-sparse-eset-solely-from-matrix} -pbmc14k_raw.eset <- createSparseEset(input_matrix = pbmc14k_rawCount, projectID = "PBMC14k", addMetaData = TRUE) - -pbmc14k_raw.eset -``` -- **input_matrix**: it's usually but not limited to a sparse matrix of raw UMI count. - - As for the data format, it accepts **`dgCMatrix`**, **`dgTMatrix`**, **`dgeMatrix`**, **`matrix`**, **`data.frame`**. - - As for the type of quantification measures, it takes **raw counts**, normalized counts (e.g. **`CPM`** or **`CP10k`**), **`TPM`** (Transcripts Per Million), **`FPKM/RPKM`** (Fragments/Reads Per Kilobase of transcript per Million) and others. - - **What if a data frame object is given to it?** When a non-matrix table is passed to `input_matrix` argument, the `createSparseEset()` function will automatically convert it to a matrix. And it the matrix, either converted from other format or directly passed from users, is not sparse. `createSparseEset()` will automatically convert it into sparse matrix, by default. This is controlled by another argument called `do.sparseConversion`, the default of which is `TRUE`. It's not recommended but the users can set it as `FALSE` to disable the conversion. Then `createSparseEset()` will create the eSet based on the regular matrix. -- **addMetaData**: when this argument is set `TRUE` (this is the default), `createSparseEset()` will automatically generate 5 statistics, 4 for cells and 1 for features, and add them into the `phenoData` and `featureData` slots. These 5 statistics will be used in quality control and data filtration. - -```{r sparse-eset-metadata} -## check the phenoData: metadata of cells -head(pData(pbmc14k_raw.eset)) - -## check the featureData: metadata of features -head(fData(pbmc14k_raw.eset)) -``` - -## Using self-customized meta data - -In some cases, you may have more meta data of either cells (e.g. sample id, treatment condition) or features (e.g. gene full name, gene type, genome location) which will be used in downstream analysis and you do want to add them into the sparse eSet object. the `createSparseEset()` function provides another two arguments, **`cellData`** and **`featureData`**, to take the self-customized meta data. For the PBMC14k dataset, we have the true labels of cell type and would like to add them to the sparse eSet object. - -```{r create-sparse-eset-solely-with-cutomized-matadata} -## read the true labels of cell type for PBMC14k dataset -true_label <- read.table(system.file("extdata/demo_pbmc14k/PBMC14k_trueLabel.txt.gz", package = "scMINER"), header = T, row.names = 1, sep = "\t", quote = "", stringsAsFactors = FALSE) - -head(true_label) - -table(true_label$trueLabel_full) - -## the true_label much cover all cells in the expression matrix -table(colnames(pbmc14k_rawCount) %in% row.names(true_label)) - -## create the sparse eSet object using the true_label -pbmc14k_raw.eset <- createSparseEset(input_matrix = pbmc14k_rawCount, cellData = true_label, featureData = NULL, projectID = "PBMC14k", addMetaData = TRUE) - -## check the true labels of cell type from sparse eSet object -head(pData(pbmc14k_raw.eset)) -table(pData(pbmc14k_raw.eset)$trueLabel_full) -``` - -## From multiple samples - -**What if you have multiple samples for one project?** Now it's pretty common to profile multiple samples of the same tissue but under different conditions (e.g. drug treatment) in one project. Analyzing these samples one by one is crucial, and analyzing them in a combined manner may give you more prospects. For this purpose, scMINER provides a function, `combineSparseEset()`, to easily combine the sparse eSet objects of multiple samples. - -```{r combine-sparse-eset-readinput, echo=FALSE} -## create a sparse eSet object of each sample to combined -demo1_mtx <- readInput_10x.dir(input_dir = data_dir, featureType = "gene_symbol", removeSuffix = TRUE) -demo1.eset <- createSparseEset(input_matrix = demo1_mtx, projectID = "demo1", addMetaData = TRUE) - -demo2_mtx <- readInput_10x.h5(h5_file = h5_file, featureType = "gene_symbol", removeSuffix = TRUE) -demo2.eset <- createSparseEset(input_matrix = demo2_mtx, projectID = "demo2", addMetaData = TRUE) - -demo3_mtx <- readInput_table(table_file = table_file, sep = "\t", is.geneBYcell = TRUE, removeSuffix = TRUE) -demo3.eset <- createSparseEset(input_matrix = demo3_mtx, projectID = "demo3", addMetaData = TRUE) - -demo4_obj <- readInput_h5ad(h5ad_file = h5ad_file, removeSuffix = TRUE) -demo4.eset <- createSparseEset(input_matrix = t(demo4_obj$X), projectID = "demo4", addMetaData = TRUE) -``` - -```{r combine-sparse-eset-merge} -## combine the 4 sparse eSet objects -combined.eset <- combineSparseEset(eset_list = c(demo1.eset, demo2.eset, demo3.eset, demo4.eset), - projectID = c("sample1", "sample2", "sample3", "sample4"), - addPrefix = c("demo1", "demo2", "demo3", "demo4"), - addSurfix = NULL, addMetaData = TRUE, imputeNA = TRUE) -dim(combined.eset) -``` - -A few questions you may have about the `combineSparseEset()` function: -- **What if the input eSets have different features?** `combineSparseEset()` ALWAYS keep all features from the input eSets, and generate NA values wherever the data is not available. By default, this function impute the NA values with the minimum value of the combined matrix, which is usually but not always zero. If this imputation method doesn't fit your study, you can set `imputeNA` to `FALSE` to disable it. If so, the NAs will retain in the eSet object, and you can manually impute them with your own method. -- **What if the input eSets have some same cell barcodes?** `combineSparseEset()` ALWAYS keep all cells from the input eSets, and will report an error when same barcodes are found in different input eSets. This function provides two arguments, `addPrefix` and `addSurfix`, to solve this issue. You can easily avoid the same barcodes of different input eSets by adding a eSet-specific prefix and/or surfix to the barcodes. -```{r show-barcodes-of-combined-eset} -head(pData(combined.eset)) -``` - -- **I have some customized column in the phenoData and/or featureData slots. How does `combineSparseEset()` handle them?** `combineSparseEset()` only keep the columns of phenoData and featureData that shared by all input eSets. Your customized columns would be kept only when they are available in all input eSets. -- **Are the 5 meta data statistics in the combined eSet still same with those generated in each eSet?** No. By default, `combineSparseEset()` will update (add, if they are not available in input eSets) these 5 meta data statistics based on the combined matrix. It's not recommended but you can disable it by setting `addMataData` to `FALSE`. - - -# Quality control and filter the sparse eSet object - -In this section, we will introduce you how scMINER assess the scRNA-seq data quality, estimate the cutoffs for filtration, and remove the cells and feature of low quality from the sparse eSet object. We will use the PBMC14k dataset to conduct the demo. - -## QC metrics available in scMINER - -As we mentioned before, scMINER can automatically generate 5 meta data statistics and add them to the sparse eSet object. These 5 meta data statistics are the metrics scMINER uses to assess the quality of cells and features: - -- For cell quality assessment, scMINER provides 4 metrics that [commonly used by the community](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4758103/): - - **nUMI**: number of total UMIs in each cell. Cells with abnormally high nUMI usually indicate doublets, while those with abnormally low nUMI usually indicate poorly sequenced cells or empty droplets. - - **nFeature**: number of expressed features/genes in each cell. Similar to nUMI. - - **pctMito**: percentage of UMIs of mitochondrial genes (defined by "^mt-|^MT-") in each cell. Cells with aberrantly high pctMito usually indicate dying cells. - - **pctSpikeIn**: percentage of UMIs of spike-in RNAs (defined by "^ERCC-|^Ercc-")) in each cell. This is used to estimate the normalization factor. Cells with extremely high or low pctSpikeIn need to be removed. - -- For feature quality assessment, scMINER provides one metrics: - - **nCell**: number of cells expressing the features/genes. Genes with extremely low nCell are poorly sequenced and are usually of low variance. - -## Generate the QC report - -To help assess the data quality and determine the cutoffs for data filtration, scMINER provides a function, `drawSparseEsetQC()`, to generate a html-format quality control report: -```{r generateQCreport-eset, eval=FALSE} -## To generate the -drawSparseEsetQC(input_eset = pbmc14k_raw.eset, output_html_file = "/work-path/PBMC14k/PLOT/pbmc14k_rawCount.html", overwrite = TRUE) - -## scMINER supports group-specific QC highlights -drawSparseEsetQC(input_eset = pbmc14k_raw.eset, output_html_file = "/work-path/PBMC14k/PLOT/pbmc14k_rawCount.html", overwrite = FALSE, group_by = "true_label") -``` - -The quality control report consists of 4 parts: - -- **Key Statistics**: it highlights 5 key statistics of given eset object, including `number of cells`, `number of genes`, mean of `genes per cell`, mean of `UMIs per cell` and mean of `cells per gene`. -- **Detailed statistics of key metrics**: it summarizes and visualizes the detailed statistics of 5 key metrics that scMINER uses for filtration: `nUMI`, `nFeature`, `pctMito`, `pctSpikeIn`, `nCell`. -- **Detailed statistics per cell and gene**: it lists the detailed statistics of each gene and cell. -- **Filtration cutoffs by scMINER**: it provides the cutoffs estimated automatically by scMINER based on Median ± 3 * MAD (maximum absolute deviance), and the pseudo-filtration statistics on both genes and cells with these cutoffs. - -## Filter the sparse eset object - -From the quality control report generated above, we have got a better sense about the data quality and the cutoffs to use for filtration. scMINER provides a function, `filterSparseEset()` for this purpose, and it can work in two modes: - -- **"auto"**: in this mode, scMINER will use the cutoffs estimated by Median ± 3*MAD (maximum absolute deviation). Based on our tests, in most cases, this mode works well with the matrix of both raw UMI counts and TPM values. -- **"manual"**: in this mode, the users can manually specify the cutoffs, both low and high, of all 5 metrics: **nUMI**, **nFeature**, **pctMito**, **pctSpikeIn** for cells, and **nCell** for genes. No cells or features would be removed under the default cutoffs of each metrics. - -No matter which mode to use, `filterSparseEset()` returns a summary table with detailed information of filtration statistics. You can refer to it and adjust the cutoffs accordingly. - -### Filtering under auto mode -To conduct the filtering using the cutoffs recommended by scMINER: -```{r filter-eset-auto} -## Filter eSet under the auto mode -pbmc14k_filtered.eset <- filterSparseEset(pbmc14k_raw.eset, filter_mode = "auto", filter_type = "both") -``` - -In some cases, you may find that most of the cutoffs generated by the auto mode are good, except one or two. Though there is no 'hybrid' mode, scMINER does allow you to customize some of the cutoffs generated by the auto mode. This can be easily done by adding the cutoffs you would customize under the auto mode: -```{r filter-eset-auto-customize, eval=FALSE} -## Filter eSet under the auto mode, with customized values -pbmc14k_filtered.eset <- filterSparseEset(pbmc14k_raw.eset, filter_mode = "auto", filter_type = "both", gene.nCell_min = 5) -``` - -With the code above, scMINER will filter the eSet using all of the cutoffs generated by auto mode, except `gene.nCell_min`. - -### Filtering under manual mode -To apply the self-customized cutoffs: -```{r filter-eset-manual, eval=FALSE} -## Filter eSet under the manual mode -pbmc14k_filtered.eset <- filterSparseEset(pbmc14k_raw.eset, filter_mode = "manual", filter_type = "both", gene.nCell_min = 10, cell.nUMI_min = 500, cell.nUMI_max = 6500, cell.nFeature_min = 200, cell.nFeature_max = 2500, cell.pctMito_max = 0.1) -``` - -For any unspecified cutoff arguments, like `gene.nCell_max`, `filterSparseEset()` will automatically assign the default values to them. The default values of any cutoff argument would not filter out any cells or features. So, if you want to skip some metrics, just leave the cutoffs of them unspecified. For example, in the codes above, `gene.nCell_max` is unspecified. Then `filterSparseEset()` wil assign the default value, which is `Inf`, to it. No features would be filtered out by this argument. - -## Normalize the SparseExpressionSet object - -After the filtration of low-quality features and cells, the eSet object is ready for normalization. We recommend to use "log2CPM" method for normalization: the raw counts in each cell are normalized to a library size of 1 million, followed by log2 transformation. -```{r normalize-eset} -pbmc14k_log2cpm.eset <- normalizeSparseEset(pbmc14k_filtered.eset, scale_factor = 1000000, log_base = 2, log_pseudoCount = 1) -``` - -This normalized and log-transformed SparseExpresionSet object can be directly used for Mutual Information-based clustering, network inference and other downstream analysis. - -## Save the SparseExpressionSet object - -```{r save-eset, eval=FALSE} -saveRDS(pbmc14k_log2cpm.eset, file = "/work-path/PBMC14k/DATA/pbmc14k_log2cpm.rds") -``` - -**The data preprocessing is done!** - - -
-**Session info** -```{r session-info} -sessioninfo::session_info() -``` -
-\ diff --git a/vignettes/network_inference.Rmd b/vignettes/network_inference.Rmd deleted file mode 100644 index 92524ae..0000000 --- a/vignettes/network_inference.Rmd +++ /dev/null @@ -1,187 +0,0 @@ ---- -title: "Mutual Information-based network rewiring by SJARACNe" -output: rmarkdown::html_vignette -vignette: > - %\VignetteIndexEntry{network_inference} - %\VignetteEngine{knitr::rmarkdown} - %\VignetteEncoding{UTF-8} ---- - -```{r, include = FALSE} -knitr::opts_chunk$set( - collapse = TRUE, - comment = "#>" -) -options(rmarkdown.html_vignette.check_title = FALSE) -``` - -```{r setup, message=FALSE, warning=FALSE} -library(scMINER) -library(ggplot2) -library(dplyr) -``` - -# Introduction - -[SJARACNe](https://academic.oup.com/bioinformatics/article/35/12/2165/5156064) is a scalable software tool for gene network reverse engineering from big data. As an improved implementation of the ARACNe, SJARACNe achieves a dramatic improvement in computational performance in both time and memory usage and implements new features while preserving the network inference accuracy of the original algorithm. - -Similar to MICA, SJARACNe is also a component of scMINER framework, and can work seamlessly with the SparseExpressionSet object. The input for MICA can be easily generated from the SparseExpressionSet object by `generateSJARACNeInput()`, and the output of SJARACNe, the networks, can be effortlessly assessed by `drawNetworkQC()` and directly taken for driver activity estimation by `getActivity_individual()` and `getActivity_inBatch()`. - -In this tutorial, we will use PBMC14k dataset for demonstration purpose. - -
-**How was the PBMC14K dataset generated?** -```{r, prepare-input-eset, echo=FALSE} -## Load the raw count matrix -data("pbmc14k_rawCount") - -## Read the true labels -pbmc14k_trueLabel <- read.table(system.file("extdata/demo_pbmc14k/PBMC14k_trueLabel.txt.gz", package = "scMINER"), header = T, row.names = 1, sep = "\t", quote = "", stringsAsFactors = FALSE) - -## The true_label much cover all cells in the expression matrix -table(colnames(pbmc14k_rawCount) %in% row.names(pbmc14k_trueLabel)) - -## Create the sparse eSet object using the true_label -pbmc14k_raw.eset <- createSparseEset(input_matrix = pbmc14k_rawCount, cellData = pbmc14k_trueLabel, featureData = NULL, projectID = "PBMC14k", addMetaData = TRUE) - -## Filter he sparse eSet object using the auto mode -pbmc14k_filtered.eset <- filterSparseEset(pbmc14k_raw.eset, filter_mode = "auto", filter_type = "both") - -## Normalize he sparse eSet object using the auto mode -pbmc14k_log2cpm.eset <- normalizeSparseEset(pbmc14k_filtered.eset, scale_factor = 1000000, log_base = 2, log_pseudoCount = 1) - -## Add the clustering results from MICA -pbmc14k_log2cpm.eset <- addMICAoutput(pbmc14k_log2cpm.eset, mica_output_file = system.file("extdata/demo_pbmc14k/PBMC14k/MICA/clustering_UMAP_euclidean_20_2.05.txt", package = "scMINER"), visual_method = "umap") - -## Add cell type annotation results -celltype_map <- c(`1`="CD4TN", `2`="CD4TCM", `3`="CD8TN", `4`="NK", `5`="B", `6`="Monocyte", `7`="CD4Treg") -pbmc14k_log2cpm.eset$cell_type <- as.character(celltype_map[pbmc14k_log2cpm.eset$clusterID]) -``` -
-\ - -# Generate the SJARACNe input files - -The network inference is usually conducted in a **cluster-** or **cell type-specific** basis. Given the column names for grouping, `generateSJARACNeInput()` will create a folder for each group named by the group label. - -***IMPORTANT NOTE***: Any **illegal characters in path** in group labels may cause issues in subsequent analysis. To avoid it, scMINER only accept letters(A-Za-z), numbers(0-9), underscores('_') and periods('.'). - -In this case, the true labels of cell type are available, so we use them to define the groups for network inference. - -```{r generate-sjaracne-inputs-from-true-label, eval=FALSE} -## Columns with any illegal characters can not be used for groupping -generateSJARACNeInput(input_eset = pbmc14k_log2cpm.eset, group_name = "trueLabel", sjaracne_dir = "/work-path/PBMC14k/SJARACNe", species_type = "hg", driver_type = "TF_SIG", downSample_N = NULL) -``` - -For big datasets, `generateSJARACNeInput()` provides an argument, `downSample_N`, to allow users to down sample size of each group. The default value of `downSample_N` is 1,000, any group with >= 1,000 cells will be down-sample to 1,000. - -```{r sjarace-input-file-structure} -## one folder for each group -list.dirs(system.file("extdata/demo_pbmc14k/PBMC14k/SJARACNe", package = "scMINER"), full.names = FALSE, recursive = FALSE) - -## file structure of each folder -list.files(system.file("extdata/demo_pbmc14k/PBMC14k/SJARACNe/B", package = "scMINER"), full.names = FALSE, recursive = TRUE, include.dirs = FALSE, pattern = "[^consensus_network_ncol_.txt]") -``` - -The standard input files of SJARACNe, for each group, include: - -- a "**`.exp.txt`**" file: a tab-separated genes/transcripts/proteins by cells/samples expression matrix with the first two columns being ID and symbol. -- a "**`TF`**" folder containing a "**`.tf.txt`**" file: a list of significant gene/transcript/protein IDs of TF drivers. -- a "**`SIG`**" folder containing a "**`.sig.txt`**" file: a list of significant gene/transcript/protein IDs of SIG drivers. -- a bash script (**`runSJARACNe.sh`**) to run SJARACNe. Further modification is needed to run it. -- a json file (**`config_cwlexec.json`**) containing parameters to run SJARACNe. - -Usually, the ground truth of cell types is not available. Then the **cluster labels**, or **cell type annotations of the clusters**, can be used for grouping in network rewiring, since it's expected that cells with same cluster label/annotated cell type are of similar gene expression profiles. To generate from annotated cell types, you can run: - -```{r generate-sjaracne-inputs-from-cell-type, eval=FALSE} -generateSJARACNeInput(input_eset = pbmc14k_log2cpm.eset, group_name = "cell_type", sjaracne_dir = "/work-path/PBMC14k/SJARACNe/bycelltype", species_type = "hg", driver_type = "TF_SIG") -``` - - -# Run the SJARACNe - -By default, the `generateSJARACNeInput()` function also generates a `runSJARACNe.sh` file in the folder of each group. This file much be modified before you can run it: - -- **removed unneeded lines**: There are usually 4 lines in this file: the lines starting with "sjaracne lsf" are the command lines to run on IBM LSF cluster, while the lines starting with "sjaracne local" are the command lines runing on a single machine (Linux/OSX). Please select the lines based on your situation and remove the others. -- **-n**: number of bootstrap networks to generate. Default: 100. -- **-pc**: p value threshold to select edges in building consensus network. Default: e-2 for single-cell data, e-3 for meta-cell data, and e-5 for bulk sample data. -Please use "sjaracne lsf -h" or "sjaracne local -h" to check more details of arguments available in SJARACNe. - -There is another file, `config_cwlexec.json`, available in the folder. It contains the information (e.g. memory request for each step of SJARACNe run) used for LSF job submission. This file is only needed for LSF runs and the default values works well in most cases. If you are running SJARACNe on a big dataset, you may need to request more memory from it. - -In this case, we use LSF to run the SJARACNe: -```{r run-sjaracne-lsf, engine = 'bash', eval=FALSE} -## let's use B cell as an example -# for TF -sjaracne lsf -e /work-path/PBMC14k/SJARACNe/B/B.8572_1902.exp.txt -g /work-path/PBMC14k/SJARACNe/B/TF/B.835_1902.tf.txt -o /work-path/PBMC14k/SJARACNe/B/TF/bt100_pc001 -n 100 -pc 0.01 -j /work-path/PBMC14k/SJARACNe/B/config_cwlexec.json - -# for SIG -sjaracne lsf -e /work-path/PBMC14k/SJARACNe/B/B.8572_1902.exp.txt -g /work-path/PBMC14k/SJARACNe/B/SIG/B.4148_1902.sig.txt -o /work-path/PBMC14k/SJARACNe/B/SIG/bt100_pc001 -n 100 -pc 0.01 -j /work-path/PBMC14k/SJARACNe/B/config_cwlexec.json -``` - -We created a folder named "**bt100_pc001**" in both TF and SIG folders of each group, to save the networks generated under **100 bootstraps** (`-n 100`) and **0.01 consensus p value** (`-pc 0.01`). - -To run SJARACNe on a local machine: -```{r run-sjaracne-local, engine = 'bash', eval=FALSE} -## let's use B cell as an example -# for TF -sjaracne local -e /work-path/PBMC14k/SJARACNe/B/B.8572_1902.exp.txt -g /work-path/PBMC14k/SJARACNe/B/TF/B.835_1902.tf.txt -o /work-path/PBMC14k/SJARACNe/B/TF/bt100_pc001 -n 100 -pc 0.01 - -# for SIG -sjaracne local -e /work-path/PBMC14k/SJARACNe/B/B.8572_1902.exp.txt -g /work-path/PBMC14k/SJARACNe/B/SIG/B.4148_1902.sig.txt -o /work-path/PBMC14k/SJARACNe/B/SIG/bt100_pc001 -n 100 -pc 0.01 -``` - - -# Assess the quality of networks - -## Introduction to the network file by SJARACNe - -The core output of SJARACNe is the network file named `consensus_network_ncol_.txt`. It contains 9 columns: - -- **source**: ID of the source gene, can be the gene symbol; -- **target**: ID of the target gene, can be the gene symbol; -- **source.symbol**: symbol of the source gene; -- **target.symbol**: symbol of the target gene; -- **MI**: mutual information of source-gene pair; -- **pearson**: Pearson correlation coefficient, [-1,1] -- **pearson**: Spearman correlation coefficient, [-1,1] -- **slope**: slop of the regression line, returned by `stats.linregression()` -- **p.value**: p-value for a hypothesis test whose null hypothesis is that the slope is zero, using Wald Test with t-distribution of the test statistic - -```{r check-network-file-format} -network_format <- read.table(system.file("extdata/demo_pbmc14k/PBMC14k/SJARACNe/B/TF/bt100_pc001/consensus_network_ncol_.txt", package = "scMINER"), - header = T, sep = "\t", quote = "", stringsAsFactors = F) -head(network_format) -``` - -## Generate network QC report - -There is no simple standards to tell the reliability of networks. Empirically, a network with **50-300** target size is good. scMINER provides a function, `drawNetworkQC()`, to quickly assess the quality of networks in batch. - -```{r assess-network-quality, eval=FALSE} -network_stats <- drawNetworkQC(sjaracne_dir = "/work-path/PBMC14K/SJARACNe", generate_html = FALSE) # Set `generate_html = TRUE` to generate html-format QC report for each network file -``` - -```{r show-network-qc-stats} -## The network QC statistics table is saved seperately, for demonstration purposes. -network_stats <- readRDS(system.file("extdata/demo_pbmc14k/PBMC14k/SJARACNe/network_stats.rds", package = "scMINER")) -head(network_stats) -``` -**The Mutual Information-based network rewiring by SJARACNe is done!** - -
-**Session info** -```{r session-info} -sessioninfo::session_info() -``` -
-\ - - - - - - - - - diff --git a/vignettes/quick_tutorial.Rmd b/vignettes/quick_tutorial.Rmd new file mode 100644 index 0000000..124657d --- /dev/null +++ b/vignettes/quick_tutorial.Rmd @@ -0,0 +1,626 @@ +--- +title: "A quick tutorial to run scMINER" +author: "Qingfei Pan" +date: "`r Sys.Date()`" +output: + html_document: + theme: yeti + toc: yes +--- + +```{r, include = FALSE} +knitr::opts_chunk$set( + collapse = TRUE, + comment = "#>" +) +options(rmarkdown.html_vignette.check_title = FALSE) +``` + +Welcome to the **quick** tutorial of scMINER! This tutorial aims to provide **immediate and practical guidance** on running scMINER with your own data. For full functionality and detailed documentation of scMINER, please refer to the [Full Documentation](https://qingfeipan.github.io/scMINER_qpan/bookdown/index.html). In this tutorial, we will walk through the core analysis of scMINER step-by-step using a ground truth dataset called **PBMC14k**. + +# Get started +```{r setup, message=FALSE, warning=FALSE} +library(scMINER) +library(dplyr) +library(ggplot2) +library(anndata) +library(hdf5r) +``` + +# Data preprocessing + +## Create project space + +The *project space* created by scMINER is a folder that can not only keep your data centralized and organized but also make the scMINER pipeline more smooth and robust. We encourge you to create a project space for each of your studies. + +```{r create-project-space, eval=FALSE} +scminer_dir <- createProjectSpace(project_dir = "/your-path", project_name = "PBMC14k") +``` + +This creates a folder named **`PBMC14k`** in the directory of **`/your-path`**, and creates four subfolders in it: + +- **`DATA`**: to save the sparse eSet objects and other files; +- **`MICA`**: to save the inputs and outputs of mutual information-based clustering analysis; +- **`SJARACNe`**: to save the inputs and outputs of network inference and quality control; +- **`PLOT`**: to save the files of data visualization. + +## Generate gene expression matrix + +scMINER provides four functions to generate gene expression matrix from multiple-format inputs: + +```{r generate-gene-expression-matrix, eval=FALSE} +## Input type 1: Data directory by 10x Genomics, containing matrix.mtx, barcodes.tsv and features.tsv (or genes.tsv) +demo1_mtx <- readInput_10x.dir(input_dir = system.file("extdata/demo_inputs/cell_matrix_10x", package = "scMINER"), + featureType = "gene_symbol", removeSuffix = TRUE, addPrefix = "demo1") + +## Input type 2: Text-table file, eg. txt, tsv, csv +demo2_mtx <- readInput_table(table_file = system.file("extdata/demo_inputs/table_file/demoData2.txt.gz", package = "scMINER"), + is.geneBYcell = TRUE, # set is.geneBYcell = FALSE to read features in columns and cells in rows + sep = "\t", removeSuffix = TRUE, addPrefix = "demo2") + +## Input type 3: HDF5 file by 10x Genomics +demo2_mtx <- readInput_10x.h5(h5_file = system.file("extdata/demo_inputs/hdf5_10x/demoData2.h5", package = "scMINER"), + featureType = "gene_symbol", removeSuffix = TRUE, addPrefix = "demo3") + +## Input type 4: H5AD file +demo4_obj <- readInput_h5ad(h5ad_file = system.file("extdata/demo_inputs/h5ad_file/demoData4.h5ad", package = "scMINER"), + removeSuffix = TRUE, addPrefix = "demo4") +``` + +The raw count matrix of PBMC14k dataset is embedded in scMINER and can be easily fetched by: + +```{r fetch-pbmc14k-rawcount-data} +## load the raw count matrix of PBMC14k dataset +data("pbmc14k_rawCount") +dim(pbmc14k_rawCount) +pbmc14k_rawCount[1:5,1:4] +``` + +## Create SparseEset object + +The **`SparseExpressionSet`** (or **`SparseEset`** for short) is a new class created by scMINER to handle the sparsity in scRNA-seq data. It is derived from [ExpressionSet](https://www.bioconductor.org/packages/release/bioc/vignettes/Biobase/inst/doc/ExpressionSetIntroduction.pdf), and enables to compress, store and access efficiently and conveniently. The SparseEset object is the **center** of scRNA-seq data analysis by scMINER. + +The SparseEset object can be easily created from the gene expression matrix: + +```{r create-sparse-eset-solely-from-matrix, eval=FALSE} +## Create SparseEset object solely from gene expression matrix, meta data is automatically added +pbmc14k_raw.eset <- createSparseEset(input_matrix = pbmc14k_rawCount, projectID = "PBMC14k", addMetaData = TRUE) +``` + +`createSparseEset()` offers an argument, `addMetaData`, to automatically generate and add 5 meta data statistics for cells and genes into the SparseEset object. It also provides another two arguments, `cellData` and `featureData`, to allow you add your customized `phenoData` or `featureData`. In this case, we have the true labels of cell types and would like to add them to the SparseEset object: + +```{r create-sparse-eset-with-cutomized-matadata} +## Read the true lables of cell types embedded in scMINER R package +true_label <- read.table(system.file("extdata/demo_pbmc14k/PBMC14k_trueLabel.txt.gz", package = "scMINER"), header = T, row.names = 1, sep = "\t", quote = "", stringsAsFactors = FALSE) + +## Create SparseEset object with self-customized metadata +pbmc14k_raw.eset <- createSparseEset(input_matrix = pbmc14k_rawCount, cellData = true_label, featureData = NULL, projectID = "PBMC14k", addMetaData = TRUE) + +head(pData(pbmc14k_raw.eset)) +table(pData(pbmc14k_raw.eset)$trueLabel_full) +``` + +If you have **multiple samples for one project**, please create one SparseEset object for each of the samples and combined these SparseEset objects into one: + +```{r create-sparse-eset-from-multiple-samples, eval=FALSE} +## Create SparseEset from multiple samples +# Step 1: create an SparseEset for each sample +demo1_mtx <- readInput_10x.dir(input_dir = system.file("extdata/demo_inputs/cell_matrix_10x", package = "scMINER"), + featureType = "gene_symbol", removeSuffix = TRUE) +demo1.eset <- createSparseEset(input_matrix = demo1_mtx, addMetaData = TRUE) + +demo2_mtx <- readInput_table(table_file = system.file("extdata/demo_inputs/table_file/demoData2.txt.gz", package = "scMINER"), + is.geneBYcell = TRUE, sep = "\t", removeSuffix = TRUE) +demo2.eset <- createSparseEset(input_matrix = demo2_mtx, addMetaData = TRUE) + +# Step 2: combine the SparseEset objects of all samples +combined.eset <- combineSparseEset(eset_list = c(demo1.eset, demo2.eset), + projectID = c("sample1", "sample2"), + addPrefix = c("demo1", "demo2"), + addSurfix = NULL, addMetaData = TRUE, imputeNA = TRUE) +``` + + +## Filter SparseEset object + +As we mentioned before, scMINER can automatically generate and add 5 meta data statistics to SparseEset object. These 5 meta data statistics are the metrics scMINER uses to assess the quality of cells and features: + +- For cell quality assessment, scMINER provides 4 metrics that [commonly used by the community](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4758103/): + - **nUMI**: number of total UMIs in each cell. Cells with abnormally high nUMI usually indicate doublets, while those with abnormally low nUMI usually indicate poorly sequenced cells or empty droplets. + - **nFeature**: number of expressed features/genes in each cell. Similar to nUMI. + - **pctMito**: percentage of UMIs of mitochondrial genes (defined by "^mt-|^MT-") in each cell. Cells with aberrantly high pctMito usually indicate dying cells. + - **pctSpikeIn**: percentage of UMIs of spike-in RNAs (defined by "^ERCC-|^Ercc-")) in each cell. This is used to estimate the normalization factor. Cells with extremely high or low pctSpikeIn need to be removed. + +- For feature quality assessment, scMINER provides one metrics: + - **nCell**: number of cells expressing the features/genes. Genes with extremely low nCell are poorly sequenced and are usually of low variance. + +To help assess the data quality and determine the cutoffs used for filtration, scMINER can generate a html-format QC report: + +```{r generateQCreport-eset, eval=FALSE} +## To generate QC report from SparseEset object +drawSparseEsetQC(input_eset = pbmc14k_raw.eset, output_html_file = "/your-path/PBMC14k/PLOT/pbmc14k_rawCount.html", overwrite = FALSE) + +## scMINER also supports group-specific QC highlights +drawSparseEsetQC(input_eset = pbmc14k_raw.eset, output_html_file = "/your-path/PBMC14k/PLOT/pbmc14k_rawCount.html", overwrite = FALSE, group_by = "trueLabel") +``` + +This QC report contains a variety of tables and plot of the key statistics of your data that can help you get a better sense about data quality and determine the cutoffs for filtration. + + +scMINER provides two modes to perform SparseEset object filtration: + +- **`auto`**: in this mode, the filtration cutoffs are **automatically** generated by scMINER in `Median ± 3*MAD` (maximum absolute deviation) method. This mode works well in the majority of our test datasets. +- **`manual`**: in this mode, you can **manually** specify the cutoffs, both low and high, of all 5 metrics. **No cells or features would be removed under the default cutoffs of each metrics**. + +To apply the **`auto`** mode in SparseEset filtration: + +```{r filter-sparse-eset-auto} +## Filter SparseEset object with the cutoffs automatically generated by scMINER +pbmc14k_filtered.eset <- filterSparseEset(pbmc14k_raw.eset, filter_mode = "auto", filter_type = "both") +``` + +This command generates a filtered SparseEset object `pbmc14k_filtered.eset` and returns a summary table with detailed information of filtration statistics. You can refer to it and adjust the cutoffs accordingly. + +In some cases, you may find that most of the cutoffs generated by the auto mode are good, except one or two. Though there is no 'hybrid' mode, scMINER does allow you to customize some of the cutoffs generated by the auto mode. This can be easily done by adding the cutoffs you would customize under the auto mode: +```{r filter-sparse-eset-auto-customize, eval=FALSE} +## Filter eSet under the auto mode, with customized values +pbmc14k_filtered.eset <- filterSparseEset(pbmc14k_raw.eset, filter_mode = "auto", filter_type = "both", gene.nCell_min = 5) +``` + +If the cutoffs generated in auto model do not work well in your case and you would like to go with self-customized cutoffs, you can easily apply them by: +```{r filter-eset-manual, eval=FALSE} +## Filter SparseEset object with self-customized cutoffs +pbmc14k_filtered.eset <- filterSparseEset(pbmc14k_raw.eset, filter_mode = "manual", filter_type = "both", gene.nCell_min = 10, cell.nUMI_min = 500, cell.nUMI_max = 6500, cell.nFeature_min = 200, cell.nFeature_max = 2500, cell.pctMito_max = 0.1) +``` + + +## Normalize SparseEset object + +scMINER recommends the **`log2CPM`** method for normalization: the raw counts in each cell are normalized to a library size of 1 million, followed by log2 transformation. +```{r normalize-sparse-eset} +pbmc14k_log2cpm.eset <- normalizeSparseEset(pbmc14k_filtered.eset, scale_factor = 1000000, log_base = 2, log_pseudoCount = 1) +``` + +This normalized and log-transformed SparseEset object can be directly used for Mutual Information-based clustering, network inference and other downstream analysis. And it's recommended to save it into the project space. + +```{r save-eset-log2cpm, eval=FALSE} +saveRDS(pbmc14k_log2cpm.eset, file = "/your-path/PBMC14k/DATA/pbmc14k_log2cpm.rds") +``` + + +# MI-based clustering analysis + +**MICA** (**M**utual **I**nformation-based **C**lustering **A**nalysis) is a clustering tool designed for scRAN-seq data. It is developed with Python to take it's strengths in calculation speed and memory consumption. As a component of scMINER framework, MICA works seamlessly with the scMINER R package and SparseEset object. + +## Generate MICA input + +The standard input of MICA is **a normalized and log-transformed gene expression matrix**. scMINER can generate this matrix from SparseEset object and save it into a file that can be directly read by MICA. + +scMINER uses **`.txt`** as the default input file format. It can by generated by: + +MICA accepts `.h5ad` or `.txt` format as the input file, which can be easily generated by embedded function `generateMICAinput()`: +```{r generate-mica-input-txt, eval=FALSE} +## Generate MICA input in txt format +generateMICAinput(input_eset = pbmc14k_log2cpm.eset, output_file = "/your-path/PBMC14k/MICA/micaInput.txt", overwrite = FALSE) + +## Check the format of MICA input +mica_input <- read.delim(system.file("extdata/demo_pbmc14k/PBMC14k/MICA/micaInput.txt", package = "scMINER"), header = T, sep = "\t", row.names = 1) +mica_input[1:5,1:5] +``` + +scMINER also supports **`.h5ad`** as the input file format which is getting more popular in scRNA-seq data storage and sharing. It can by generated by: +```{r generate-mica-input-h5ad, eval=FALSE} +## Generate MICA input in h5ad format +generateMICAinput(input_eset = pbmc14k_log2cpm.eset, output_file = "/your-path/PBMC14k/MICA/micaInput.h5ad", overwrite = FALSE) +``` + +In addition to generating the standard MICA input file, `generateMICAinput()` also returns the recommended commands of running MICA. You can copy the commands, modify accordingly and run. + +## Run MICA + +MICA features two different modes named by their different dimension reduction techniques: + +- **Multi-Dimensional Scaling (MDS)** mode: this mode is more accurate and robust for small datasets (less than 5,000 cells, be default) due to its global dimension reduction nature; +- **Graph Embedding (GE)** mode: this mode works better with large datasets (more than 5,000 cells, by default) using a graph embedding approach to explore distant neighbor cells. + +To run **`MDS`** model: +```{r, run-mica-mds, engine = 'bash', eval=FALSE} +mica mds -i /your-path/PBMC14k/MICA/micaInput.txt -o /your-path/PBMC14k/MICA/MDS -nck 5 6 7 8 9 10 +``` + +The MDS model uses **`K-Means`** by default for clustering. The argument `-nck` above specifies the number of cluster for K-Means. + + +In this case, since there are 13,605 cells, we will use the **`GE`** mode for the clustering: +```{r, run-mica-ge, engine = 'bash', eval=FALSE} +mica ge -i /your-path/PBMC14k/MICA/micaInput.txt -o /your-path/PBMC14k/MICA/GE -minr 0.1 -maxr 9.0 -ss 0.05 -nw 40 +``` + +The GE mode uses **`Louvain`** for clustering. The command above will generate the clustering results of multiple resolutions, from 0.1 to 9.0, with a step size of 0.05. + + +## Add MICA output to SparseEset object + +MICA generates several files and save all of them in the output directory specified by the user with `-o` argument. The core, and only, output file we need for subsequent analysis is the clustering label file named in the format of `ProjectName_clustering_VisualizeMethod_euclidean_NumberOfDimensions_Resolution.txt`. In this case, since we used a range of resolutions, there are several clustering label files generated, one for each resolution. Based on the knowledge about PBMC14k dataset, we compared the results of different resolutions and picked `clustering_UMAP_euclidean_20_2.05.txt` for subsequent analysis. + +```{r mica-output-core} +## Read the selected MICA output file +micaOutput <- read.table(system.file("extdata/demo_pbmc14k/PBMC14k/MICA/clustering_UMAP_euclidean_20_2.05.txt", package = "scMINER"), header = TRUE, sep = "\t", quote = "", stringsAsFactors = F) +head(micaOutput) +``` + +As shown above, the clustering label file contains four columns: + +- **`ID`**: cell barcodes; +- **`X`**: coordinates of UMAP_1 or tSNE_1; +- **`Y`**: coordinates of UMAP_2 or tSNE_2; +- **`label`**: labels of predicted clusters. + +The clustering result can be easily easily added to the SparseEset object by `addMICAoutput()`: +```{r mica-output-vis} +## Add MICA output into SparseEset object +pbmc14k_clustered.eset <- addMICAoutput(input_eset = pbmc14k_log2cpm.eset, + mica_output_file = system.file("extdata/demo_pbmc14k/PBMC14k/MICA/clustering_UMAP_euclidean_20_2.05.txt", package = "scMINER"), + visual_method = "umap") # use "tsne" if t-SNE was used in MICA +head(pData(pbmc14k_clustered.eset)) +``` + +It's optional but recommend to save the SparseEset object with clustering resluts added: +```{r save-eset-clustered, eval=FALSE} +saveRDS(pbmc14k_clustered.eset, file = "/your-path/PBMC14k/DATA/pbmc14k_clustered.eset") +``` + + +## Visulize MICA output + +scMINER provides a function, `MICAplot()` to easily visualize the clustering results on a 2D plot, UMAP or tSNE. And it can be colored by multiple variables, including **cluster label**, **sample source**, **nUMI**, **nFeature**, **pctMito** and more. + +To visualize the clustering results: + +```{r visualize-mica-output-cluster, fig.align='center', fig.width=6, fig.height=5} +MICAplot(input_eset = pbmc14k_clustered.eset, color_by = "clusterID", X = "UMAP_1", Y = "UMAP_2", point.size = 0.1, fontsize.cluster_label = 6) +``` + +To visualize the true labels of cell types: + +```{r visualize-mica-output-truelabel, fig.align='center', fig.width=6.5, fig.height=5} +MICAplot(input_eset = pbmc14k_clustered.eset, color_by = "trueLabel", X = "UMAP_1", Y = "UMAP_2", point.size = 0.1, fontsize.cluster_label = 4) +``` + +To visualize the **`nUMI`** on UMAP/t-SNE plot: + +```{r visualize-mica-output-numi, fig.align='center', fig.width=6.5, fig.height=5} +MICAplot(input_eset = pbmc14k_clustered.eset, color_by = "nUMI", do.logTransform = TRUE, point.size = 0.1) +``` + +You can also visualize the **`nFeature`** **`pctMito`** and **`pctSpikeIn`**: + +```{r visualize-mica-output-nfeature, fig.align='center', fig.width=6.5, fig.height=5} +MICAplot(input_eset = pbmc14k_clustered.eset, color_by = "nFeature", do.logTransform = TRUE, point.size = 0.1) +MICAplot(input_eset = pbmc14k_clustered.eset, color_by = "pctMito", do.logTransform = FALSE, point.size = 0.1) +MICAplot(input_eset = pbmc14k_clustered.eset, color_by = "pctSpikeIn", do.logTransform = FALSE, point.size = 0.1) +``` + + +# Cell type annotation + +Currently, there are two types of strategies to annotate the clusters: **supervised** and **unsupervised**. The **supervised** methods use a list of known markers of potential cell types curated from some existing studies of the same/similar contexts. While in contrast, the **unsupervised** methods are usually based on the differentially expressed genes. scMINER provides several useful functions to support both types of strategies. + +## Supervised cell type annotation + +In this showcase, we know the 7 cell types involved in the PBMC14k dataset, and curated a marker list from some existing PBMCs studies. + +### Using signature scores + +Given a marker list of candidate cell types, scMINER can estimate a **signature score**, which is mathematically `the weighted mean of the expression of marker genes involved`, for each candidate cell type across all cell cluster. To do so, you will need to generate a signature table with three columns: + +- `signature_name`: name of cell types/signatures; +- `signature_feature`: markers genes/features of corresponding cell type/signature; +- `weight`: weight of corresponding maker/feature in corresponding cell type/signature. It ranges from -1 to 1, so both positive and negative markers are supported. + +```{r signature-table} +## Signature table of PBMC14k dataset +signature_table <- read.table(system.file("extdata/demo_pbmc14k/PBMC14k_signatureTable.txt", package = "scMINER"), header = TRUE, sep = "\t", quote = "", stringsAsFactors = FALSE) +head(signature_table) +``` + +With this signature table, `draw_bubbleplot()` can estimate the signature scores and visualize them using bubble plot: +```{r signature-bubble, fig.width=7, fig.height=5, fig.align='center'} +## Bubble plot of signature scores across clusters +draw_bubbleplot(input_eset = pbmc14k_clustered.eset, signature_table = signature_table, group_by = "clusterID") +``` + +In the bubble plot above, the color of the bubbles is proportional to the mean of signature scores, and the size of the bubbles is proportional to the percentage of cells with higher signature score than mean. + +### Using individual marker genes + +scMINER also provides a variety of functions to visualize the selected features: + +```{r selected-markers} +## For the demonstration purposes, we picked two well known markers for each of the 7 known cell types, plus "CD3D" and "CD4". +genes_of_interest <-c("CD14", "LYZ", "GZMB", "NKG7", "CD19", "MS4A1", "CD8A", "CD8B", "SELL", "CCR7", "IL2RA", "FOXP3", "IL7R", "S100A4", "CD3D", "CD4") +``` + +#### feature visualization: violin plot +```{r featurePlot-vln, fig.width=7, fig.height=6, fig.align='center'} +## Violin plot of marker genes across clusters +feature_vlnplot(input_eset = pbmc14k_clustered.eset, features = genes_of_interest, group_by = "clusterID", ncol = 4) +``` + +#### feature visualization: box plot +```{r featurePlot-box, fig.width=7, fig.height=6, fig.align='center'} +## Box plot of marker genes across clusters +feature_boxplot(input_eset = pbmc14k_clustered.eset, features = genes_of_interest, group_by = "clusterID", ncol = 4) +``` + +#### feature visualization: scatter plot +```{r featurePlot-scatter, fig.width=7, fig.height=7, fig.align='center'} +## UMAP scatter plot of marker genes +feature_scatterplot(input_eset = pbmc14k_clustered.eset, features = genes_of_interest, ncol = 4, location_x = "UMAP_1", location_y = "UMAP_2", point.size = 0.5, legend.key_height = 0.3, legend.key_width = 0.2, fontsize.legend_title = 8, fontsize.legend_text = 6, fontsize.axis_title = 8, legend.position = "none") +``` + +#### feature visualization: bubble plot +```{r featurePlot-bubble, fig.width=7,fig.height=5, fig.align='center'} +## Bubble plot of marker genes across clusters +feature_bubbleplot(input_eset = pbmc14k_clustered.eset, features = genes_of_interest, group_by = "clusterID", xlabel.angle = 45) +``` + +#### feature visualization: heatmap +```{r featurePlot-heatmap, fig.width=7, fig.height=5,fig.align='center'} +## Heatmap of marker genes across clusters +feature_heatmap(input_eset = pbmc14k_clustered.eset, features = genes_of_interest, group_by = "clusterID", scale_method = "none", annotation_columns = c("trueLabel")) +``` + + +## Unsupervised cell type annotation + +scMINER provides a function, `getDE()`, to perform the differential expression analysis and identify the markers of each cluster. The `getDE()` function supports three different methods to perform the differential expression analysis, `limma`, `wilcoxon` and `t.test`. And it allows the users to define the groups to compare in a highly flexible way: + +```{r differential-expression-analysis-1} +## 1. To perform differential expression analysis in a 1-vs-rest manner for all groups +de_res1 <- getDE(input_eset = pbmc14k_clustered.eset, group_by = "clusterID", use_method = "limma") +head(de_res1) +``` + +Here is an brief introduction to the results of `getDE()`: + +- **feature**: feature name; +- **g1_tag**: a vector of clusters or subgroups involved in g1, the fore-ground group; +- **g0_tag**: a vector of clusters or subgroups involved in g0, the back-ground group; +- **g1_avg**: mean of gene expression of cells in g1; +- **g0_tag**: mean of gene expression of cells in g0; +- **g1_pct**: percentage of cells expressing the corresponding genes in group 1; +- **g0_pct**: percentage of cells expressing the corresponding genes in group 0; +- **log2FC**: log2Fold change of gene expression between g1 and g0; +- **Pval**: P values of g1-g0 comparison; +- **FDR**: FDR of g1-g0 comparison; +- **Zscore**: Z score of g1-g0 comparison, signed by `log2FC`; + +```{r differential-expression-analysis-2, eval=FALSE} +## 2. To perform differential expression analysis in a 1-vs-rest manner for one specific group +de_res2 <- getDE(input_eset = pbmc14k_clustered.eset, group_by = "clusterID", g1 = c("1"), use_method = "limma") + +## 3. To perform differential expression analysis in a rest-vs-1 manner for one specific group +de_res3 <- getDE(input_eset = pbmc14k_clustered.eset, group_by = "clusterID", g0 = c("1"), use_method = "limma") + +## 4. To perform differential expression analysis in a 1-vs-1 manner for any two groups +de_res4 <- getDE(input_eset = pbmc14k_clustered.eset, group_by = "clusterID", g1 = c("1", "4"), g0 = c("3","5"), use_method = "limma") +``` + +scMINER also provides a function, `getTopFeatures()`, to easily extract the group-specific markers from the differential expression result: +```{r get-top-markers} +cluster_markers <- getTopFeatures(input_table = de_res1, number = 10, group_by = "g1_tag", sort_by = "log2FC", sort_decreasing = TRUE) +dim(cluster_markers) +head(cluster_markers) +``` + +## Add cell type annotation to SparseEset object + +Based on the supervised and unsupervised methods, we have annotated the cell types for each cluster. To add the cell type annotation information into the SparseEset object: + +```{r add-cell-type-annotation-to-eset} +## Add cell type annotation to SparseEset object +pbmc14k_log2cpm_annotated.eset <- pbmc14k_clustered.eset +celltype_map <- c(`1`="CD4TN", `2`="CD4TCM", `3`="CD8TN", `4`="NK", `5`="B", `6`="Monocyte", `7`="CD4Treg") +pbmc14k_log2cpm_annotated.eset$cell_type <- as.character(celltype_map[pbmc14k_log2cpm_annotated.eset$clusterID]) +head(pData(pbmc14k_log2cpm_annotated.eset)) +``` + +The `draw_barplot()` function can visualize the cell composition of self-defined groups. We can use it to show the purity of MICA clusters: + +```{r group-barplot, fig.width=6, fig.height=6, fig.align='center'} +## Show the composition of true labels of cell types among the annotated cell types +draw_barplot(input_eset = pbmc14k_log2cpm_annotated.eset, group_by = "cell_type", color_by = "trueLabel_full", xlabel.angle = 45) +``` + +Don't forget to save the SparseEset object after the cell type annotation is added. + +```{r save-eset-annotated, eval=FALSE} +saveRDS(pbmc14k_log2cpm_annotated.eset, file = "/your-path/PBMC14k/DATA/pbmc14k_log2cpm_annotated.eset") +``` + +# Network inference + +scMINER constructs the cellular networks using [SJARACNe](https://academic.oup.com/bioinformatics/article/35/12/2165/5156064), a scalable software tool for gene network reverse engineering from big data. Similar to MICA, SJARACNe is also a component of scMINER framework, and can work seamlessly with scMINER R package and SparseEset object. + +## Generate SJARACNe inputs + +Network inference should be performed on groups of homogeneous cells. Usually it's in a **cluster-** or **cell type-specific** basis. In this showcase, we know the true labels of cell types and would like to use them for grouping: + +```{r generate-sjaracne-inputs-from-true-label, eval=FALSE} +## Columns with any illegal characters can not be used for groupping +generateSJARACNeInput(input_eset = pbmc14k_log2cpm_annotated.eset, group_name = "trueLabel", sjaracne_dir = "/your-path/PBMC14k/SJARACNe", species_type = "hg", driver_type = "TF_SIG", downSample_N = NULL) +``` + +***IMPORTANT NOTE***: Any **illegal characters in path** in group labels may cause issues in subsequent analysis. To avoid it, scMINER only accepts letters(A-Za-z), numbers(0-9), underscores('_') and periods('.'). + +For big datasets, `generateSJARACNeInput()` provides an argument, `downSample_N`, to allow you to down sample size of each group. The default value of `downSample_N` is 1,000, any group with >= 1,000 cells will be down-sampled to 1,000. + +`generateSJARACNeInput()` creates a folder for each of the groups availble in **`trueLabel`** column, and generates the standard input files inside of it: + +- a "**`.exp.txt`**" file: a tab-separated genes/transcripts/proteins by cells/samples expression matrix with the first two columns being ID and symbol. +- a "**`TF`**" folder containing a "**`.tf.txt`**" file: a list of significant gene/transcript/protein IDs of TF drivers. +- a "**`SIG`**" folder containing a "**`.sig.txt`**" file: a list of significant gene/transcript/protein IDs of SIG drivers. +- a bash script (**`runSJARACNe.sh`**) to run SJARACNe. Further modification is needed to run it. +- a json file (**`config_cwlexec.json`**) containing parameters to run SJARACNe. + +## Run SJARACNe + +As mentioned above, `generateSJARACNeInput()` generates a `runSJARACNe.sh` file in the folder of each group. You will need to make some modifications before you can run it: + +- **remove unneeded lines**: there are usually 4 lines in this file: the lines starting with **`sjaracne lsf`** are the command lines to run on IBM LSF cluster, while those starting with **`sjaracne local`** are the command lines running on a local machine (Linux/OSX). Please select the lines based on your situation and remove the others. +- **modify some key parameters**: + + - **-n**: number of bootstrap networks to generate. Default: 100. + - **-pc**: p value threshold to select edges in building consensus network. Default: `e-2` for single-cell data, `e-3` for meta-cell data, and `e-5` for bulk sample data. + +Please use **`sjaracne lsf -h`** or **`sjaracne local -h`** to check more details of arguments available in SJARACNe. + + +There is another file, `config_cwlexec.json`, available in the folder. It contains the information (e.g. memory request for each step of SJARACNe run) used for LSF job submission. This file is only needed for LSF runs and the default values works well in most cases. If you are running SJARACNe on a big dataset, you may need to request more memory from it. + +In this case, we use LSF to run the SJARACNe: +```{r run-sjaracne-lsf, engine = 'bash', eval=FALSE} +## Let's use B cell as an example +# For TF +sjaracne lsf -e /your-path/PBMC14k/SJARACNe/B/B.8572_1902.exp.txt -g /your-path/PBMC14k/SJARACNe/B/TF/B.835_1902.tf.txt -o /your-path/PBMC14k/SJARACNe/B/TF/bt100_pc001 -n 100 -pc 0.01 -j /your-path/PBMC14k/SJARACNe/B/config_cwlexec.json + +# For SIG +sjaracne lsf -e /your-path/PBMC14k/SJARACNe/B/B.8572_1902.exp.txt -g /your-path/PBMC14k/SJARACNe/B/SIG/B.4148_1902.sig.txt -o /your-path/PBMC14k/SJARACNe/B/SIG/bt100_pc001 -n 100 -pc 0.01 -j /work-path/PBMC14k/SJARACNe/B/config_cwlexec.json +``` + +We manually created a folder named "**bt100_pc001**" in both TF and SIG folders of each group, to save the networks generated under **100 bootstraps** (`-n 100`) and **0.01 consensus p value** (`-pc 0.01`). + +To run SJARACNe on a local machine: + +```{r run-sjaracne-local, engine = 'bash', eval=FALSE} +## Let's use B cell as an example +# For TF +sjaracne local -e /your-path/PBMC14k/SJARACNe/B/B.8572_1902.exp.txt -g /your-path/PBMC14k/SJARACNe/B/TF/B.835_1902.tf.txt -o /your-path/PBMC14k/SJARACNe/B/TF/bt100_pc001 -n 100 -pc 0.01 + +# For SIG +sjaracne local -e /your-path/PBMC14k/SJARACNe/B/B.8572_1902.exp.txt -g /your-path/PBMC14k/SJARACNe/B/SIG/B.4148_1902.sig.txt -o /your-path/PBMC14k/SJARACNe/B/SIG/bt100_pc001 -n 100 -pc 0.01 +``` + +## Quality control of networks + +The core output of SJARACNe is the network file named `consensus_network_ncol_.txt`. + +```{r check-network-file-format} +network_format <- read.table(system.file("extdata/demo_pbmc14k/PBMC14k/SJARACNe/B/TF/bt100_pc001/consensus_network_ncol_.txt", package = "scMINER"), + header = T, sep = "\t", quote = "", stringsAsFactors = F) +head(network_format) +``` + +As shown above, it contains 9 columns: + +- **source**: ID of the source gene, can be the gene symbol; +- **target**: ID of the target gene, can be the gene symbol; +- **source.symbol**: symbol of the source gene; +- **target.symbol**: symbol of the target gene; +- **MI**: mutual information of source-gene pair; +- **pearson**: Pearson correlation coefficient, [-1,1] +- **pearson**: Spearman correlation coefficient, [-1,1] +- **slope**: slop of the regression line, returned by `stats.linregression()` +- **p.value**: p-value for a hypothesis test whose null hypothesis is that the slope is zero, using Wald Test with t-distribution of the test statistic + +To help assess the quality of SJARACNe networks, scMINER provides a function `drawNetworkQC()`: + +```{r assess-network-quality, eval=FALSE} +## Network QC on single network file +network_stats <- drawNetworkQC(network_file = system.file("extdata/demo_pbmc14k/PBMC14k/SJARACNe/B/TF/bt100_pc001/consensus_network_ncol_.txt", package = "scMINER"), generate_html = FALSE) + +## Network QC on all network files under a directory +network_stats <- drawNetworkQC(sjaracne_dir = "/your-path/PBMC14K/SJARACNe", generate_html = FALSE) # Set `generate_html = TRUE` to generate html-format QC report for each network file +``` + +```{r show-network-qc-stats} +## The network QC statistics table is saved separately, for demonstration purposes. +network_stats <- readRDS(system.file("extdata/demo_pbmc14k/PBMC14k/SJARACNe/network_stats.rds", package = "scMINER")) +head(network_stats) +``` + +`drawNetworkQC()` returns a summary table of key statistics of SJARACNe networks. Empirically, a network with **50-300** target size is good. + +# Activity-related analysis + +There are many signaling proteins (e.g., kinases), transcription factors, and other factors that are crucial drivers of phenotypes. These factors are not genetically or epigenetically altered, nor are they differentially expressed at the mRNA or protein level. Instead, they are altered by post-translational or other modifications, and are therefore termed ***hidden drivers***. The gene activity-based analysis is proved to be an effective way to expose these hidden driver. + +scMINER aims to expose the cell type-specific hidden drivers of various biological activities and provides a few useful functions to effortlessly calculate driver activities, identify the hidden drivers and visualize them in multiple ways. + + +## Calculate driver activities + +The **driver activity estimation** is one of the most important features of scMINER. **Mathematically**, the activity of one driver is a type of mean of the expressions of its targets. And **biologically**, the activity can be interpreted as a measure that describes how actively the driver functions, like the enzymes in digesting their subtracts, kinase in activating their downstream genes. Given the gene expression profiles and networks, scMINER can estimate the activities of some predefined drivers, including not only transcription factors (TFs) but also signaling genes (SIGs). + +scMINER provides two functions, `getActivity_individual()` and `getActivity_inBatch()`, to calculate the driver activities individually or in batch: + +`getActivity_individual()` is designed to calculate the driver activities for individual group. It takes the **SparseEset** and **network files** of group-to-calculate as the inputs: + +```{r calculate-activity-individually, eval=FALSE} +## Let's use B cell as an example +activity_B.eset <- getActivity_individual(input_eset = pbmc14k_log2cpm_annotated.eset[, pData(pbmc14k_log2cpm_annotated.eset)$trueLabel == "B"], + network_file.tf = system.file("extdata/demo_pbmc14k/PBMC14k/SJARACNe/B/TF/bt100_pc001/consensus_network_ncol_.txt", package = "scMINER"), + network_file.sig = system.file("extdata/demo_pbmc14k/PBMC14k/SJARACNe/B/SIG/bt100_pc001/consensus_network_ncol_.txt", package = "scMINER"), + driver_type = "TF_SIG") +``` + +While `getActivity_inBatch()` is designed to calculate the driver activities in batch. Instead of networks files, it takes the directory that contains the network files of multiples groups and automatically retrieve them for activity estimation: + +```{r calculate-activity-in-batch} +## let's use B cell as an example +activity.eset <- getActivity_inBatch(input_eset = pbmc14k_log2cpm_annotated.eset, + sjaracne_dir = system.file("extdata/demo_pbmc14k/PBMC14k/SJARACNe", package = "scMINER"), + group_name = "trueLabel", driver_type = "TF_SIG", activity_method = "mean", do.z_normalization = TRUE) +``` + +Both functions return an eSet object of driver activities. The full `phenoData` and part of `featureData` of the activity eSet object is inherited from `SparseEset` object. + +Don't forget to save the activity eSet object. +```{r save-activity-eset, eval=FALSE} +## Save activity eSet object +saveRDS(activity.eset, file = "/your-path/PBMC14k/DATA/pbmc14k_activity.eset") +``` + + +## Differential activity analysis + +scMINER provides a function, `getDA()`, to perform the differential activity analysis and identify the cell type-specific drivers. Similar to `getDE()`, `getDA()` also allows you to compare different groups in a highly flexible way: + +```{r differential-activity-analysis-1} +## 1. To perform differential expression analysis in a 1-vs-rest manner for all groups +da_res1 <- getDA(input_eset = activity.eset, group_by = "cell_type", use_method = "t.test") +head(da_res1) +``` + +```{r differential-activity-analysis-2, eval=FALSE} +## 2. To perform differential expression analysis in a 1-vs-rest manner for one specific group +da_res2 <- getDA(input_eset = activity.eset, group_by = "cell_type", g1 = c("B"), use_method = "t.test") + +## 3. To perform differential expression analysis in a rest-vs-1 manner for one specific group +da_res3 <- getDA(input_eset = activity.eset, group_by = "cell_type", g0 = c("B"), use_method = "t.test") + +## 4. To perform differential expression analysis in a 1-vs-1 manner for any two groups +da_res4 <- getDA(input_eset = activity.eset, group_by = "cell_type", g1 = c("CD4Treg"), g0 = c("CD4TCM"), use_method = "t.test") +``` + +The `getTopFeatures()` function can also be used to effortlessly highlight the cell type-specific drivers from the differential activity analysis results: + +```{r get-top-drivers} +top_drivers <- getTopFeatures(input_table = da_res1, number = 10, group_by = "g1_tag", sort_by = "log2FC", sort_decreasing = TRUE) +dim(top_drivers) +head(top_drivers) +``` + + +
+**Session info** +```{r session-info} +sessioninfo::session_info() +``` +
+\ + + + +