SingleCellNet (SCN) is a tool to perform 'cell typing', or classification of single cell RNA-Seq data. Two nice features of SCN are that it works (1) across species and (2) across platforms. See the original paper for more details. This repository contains the Python version of SCN. The original code was written in R.
pip install pandas numpy sklearn scanpy sklearn statsmodels scipy matplotlib seaborn umap-learn
!pip install git+https://github.com/pcahan1/PySingleCellNet/
Below is a brief tutorial that shows you how to use SCN. In this example, we train a classifier based on mouse lung cells, we assess the performance of the classifier on held out data, then we apply the classifier to analyze indepdendent mouse lung data.
SCN has to be trainined on well-annotated reference data. In this example, we use data generated as part of the Tabula Muris (Senis) project. Specifically, we use the droplet lung data. We have compiled several other training data sets as listed below.
To illustrate how you might use SCN to perform cell tying, we apply it to another dataset from mouse lung:
Angelidis I, Simon LM, Fernandez IE, Strunz M et al. An atlas of the aging lung mapped by single cell transcriptomics and deep tissue proteomics. Nat Commun 2019 Feb 27;10(1):963. PMID: 30814501
Query expression data <- You will need to decompress this file prior to loading it.
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import scanpy as sc
import scipy as sp
import numpy as np
# Loompy is only needed if using loom files
# import loompy
import anndata
sc.settings.verbosity = 3
sc.logging.print_header()
import pySingleCellNet as pySCN
You should always start with the raw counts.
adTrain = sc.read("adLung_TabSen_100920.h5ad")
adTrain
# AnnData object with n_obs × n_vars = 14813 × 21969 ...
You should always start with the raw counts. If your expression is stored as a numpy array, you can convert it with the check_adX(adata) function.
qDatT = sc.read_mtx("GSE124872_raw_counts_single_cell.mtx")
qDat = qDatT.T
genes = pd.read_csv("genes.csv")
qDat.var_names = genes.x
qMeta = pd.read_csv("GSE124872_Angelidis_2018_metadata.csv")
qMeta.columns.values[0] = "cellid"
qMeta.index = qMeta["cellid"]
qDat.obs = qMeta.copy()
# If your expression data is stored as a numpy array, convert it
# type(qDat.X)
# <class 'numpy.ndarray'>
# pySCN.check_adX(qDat)
When you train the classifier, you should ensure that the query data and the reference data are limited to a common set of genes. In this case, we also limit the query data to those cells with at least 500 genes.
genesTrain = adTrain.var_names
genesQuery = qDat.var_names
cgenes = genesTrain.intersection(genesQuery)
len(cgenes)
# 16543
adTrain1 = adTrain[:,cgenes]
adQuery = qDat[:,cgenes].copy()
adQuery = adQuery[adQuery.obs["nGene"]>=500,:].copy()
adQuery
# AnnData object with n_obs × n_vars = 4240 × 16543
Ideally, we would assess performance on an indepdendent data. The dLevel parameter indicates the label used to group cells into categories or classes. Set this argument as appropriate for your training data.
expTrain, expVal = pySCN.splitCommonAnnData(adTrain1, ncells=200,dLevel="cell_ontology_class")
This can take several minutes.
[cgenesA, xpairs, tspRF] = pySCN.scn_train(expTrain, nTopGenes = 100, nRand = 100, nTrees = 1000 ,nTopGenePairs = 100, dLevel = "cell_ontology_class", stratify=True, limitToHVG=True)
Rows indicate class labels as defined in the dLevel argument. Columns represent cells, which are grouped by the class with the maximum score.
adVal = pySCN.scn_classify(expVal, cgenesA, xpairs, tspRF, nrand = 0)
ax = sc.pl.heatmap(adVal, adVal.var_names.values, groupby='SCN_class', cmap='viridis', dendrogram=False, swap_axes=True)
The assessment object holds other evaluation metrics including multiLogLoss, Kappa, and accuracy.
assessment = pySCN.assess_comm(expTrain, adVal, resolution = 0.005, nRand = 0, dLevelSID = "cell", classTrain = "cell_ontology_class", classQuery = "cell_ontology_class")
pySCN.plot_PRs(assessment)
plt.show()
The heatmap groups the cells according to the cell type with the maximum SCN classification score. Cells in the 'rand' SCN_class or category have a higher SCN score in the 'random' SCN_class than any cell type from the training data.
adQlung = pySCN.scn_classify(adQuery, cgenesA, xpairs, tspRF, nrand = 0)
ax = sc.pl.heatmap(adQlung, adQlung.var_names.values, groupby='SCN_class', cmap='viridis', dendrogram=False, swap_axes=True)
Now group cells according to the annotation provided by the associated study.
ax = sc.pl.heatmap(adQlung, adQlung.var_names.values, groupby='celltype', cmap='viridis', dendrogram=False, swap_axes=True)
We add the SCN scores as well as the softmax classification (i.e. a label corresponding to the cell type with the maximum SCN score -- this goes in adata.obs["SCN_class"]).
pySCN.add_classRes(adQuery, adQlung)
adM1Norm = adQuery.copy()
sc.pp.filter_genes(adM1Norm, min_cells=5)
sc.pp.normalize_per_cell(adM1Norm, counts_per_cell_after=1e4)
sc.pp.log1p(adM1Norm)
sc.pp.highly_variable_genes(adM1Norm, min_mean=0.0125, max_mean=4, min_disp=0.5)
adM1Norm.raw = adM1Norm
sc.pp.scale(adM1Norm, max_value=10)
sc.tl.pca(adM1Norm, n_comps=100)
sc.pl.pca_variance_ratio(adM1Norm, 100)
npcs = 20
sc.pp.neighbors(adM1Norm, n_neighbors=10, n_pcs=npcs)
sc.tl.leiden(adM1Norm,.1)
sc.tl.umap(adM1Norm, .5)
sc.pl.umap(adM1Norm, color=["leiden", "SCN_class"], alpha=.9, s=15, legend_loc='on data')
To plot a heatmap with the clustering information, you need to add this annotation to the annData object that is returned from scn_classify()
adQlung.obs['leiden'] = adM1Norm.obs['leiden'].copy()
adQlung.uns['leiden_colors'] = adM1Norm.uns['leiden_colors']
ax = sc.pl.heatmap(adQlung, adQlung.var_names.values, groupby='leiden', cmap='viridis', dendrogram=False, swap_axes=True)
sc.pl.umap(adM1Norm, color=["epithelial cell", "stromal cell", "B cell"], alpha=.9, s=15, legend_loc='on data', wspace=.3)
Cross-species classification depends on an ortholog table. Mouse-to-human ortholog table
To use this, you need to convert query gene symbols to the ortholog names of the species of the training data
oTab = pd.read_csv("oTab.csv")
[adQuery,adTrain] = pySCN.csRenameOrth(adQuery, adTrain, oTab)
Then you can proceed with the same training and analysis steps as above, starting with the call to splitCommonAnnData.