DeepTree gene selection algorithm

This tutorial shows how to use DeepTree algorithm to select feature genes from any gene list (e.g. all genes, or highly variable genes)

Citation

These scripts were developed for He et al. 2020 Nature

Introduction

Highly variable genes are not just about genes supporting cell types and cell states, they also contain sporadic genes that contribute to the noise of the transcriptome. By using co-expression information extracted from hierarchical clustering, we are able to remove some of these noises to make the modular gene expression structure clearer in downstream analyses.

If you de-noise the gene list per each batch and then merge the gene list for combined analysis, your batch effect will also be mitigated. Mitigating batch effect may save weak cell-type signatures compared to hard batch-effect correction methods that may mask some useful features.

This tutorial is only for MATLAB users. The python version is highly recommended because it is compatible with Scanpy. Check out here

Installing

Download the whole folder, and then point MATLAB to this folder as one of customer scripts. This is the code for Mac Users:

addpath(genpath('/Users/path_to_your_downloaded_and_extracted_codes'));

PC and Linux users need to pay attention to the path especially forward slash versus backward slash

Getting Started

We will use Seurat package's sample data here with Matlab 2018a on Mac.

Import Data

These commands are for Unix terminal

Extracting the file on Unix terminal (or double click on MatLab Finder)

tar -xzf pbmc3k_filtered_gene_bc_matrices.tar.gz

The data will be in a folder called "filtered_gene_bc_matrices/hg19". Go to that folder.

Below are commands for Matlab. Please make sure your current directory is correct (see picture below)

Then import matrixmarket data to DataMatrix, a type similar to CellDataSet used by Monocle and SeuratObject in Seurat, on R platform.

PMBCdm=read10XCount('matrix.mtx','genes.tsv','barcodes.tsv');

or, by default if you used CellRanger to generate matrix.mtx, genes.tsv and barcodes.tsv in the current folder

PMBCdm=read10XCount();

Normalize and Filter Data

Log normalize the data

PMBC_perc=log2(PMBCdm./sum(PMBCdm)*10000+1);

Check Data Quality

test=Gene10XCount(PMBCdm,1);

The plot shows the numbers of genes with 1 count/2 count/3count etc. The height of each bar represents the total number of genes detected in each cell.

Filter out low-quality cells (less than 500 genes detected) and low-quality genes (detected in less than 0.5% cells)

PMBC_perc_filtered=PMBC_perc(sum((PMBC_perc>0),2)>sum(sum(test)>500,2)/200,sum(test)>500);

Calculate high-dispersion genes

Dispersion=var(PMBC_perc_filtered,0,2)./mean(PMBC_perc_filtered,2);
[P I]=sort(Dispersion,'descend');

For the downstream analyses, following the codes shown above, a straightforward and naive method is provided here, which directly clusters the cells and comments on gene signature and dimensionality without bioinformatic tricks. More sophisticated methods are given below (please continue reading).

*If you want to remove mitochondria genes, you could use strnotcontain('_mt',XXXX) to remove all the genes whose names start with "mt".

Extracting cell type-specific genes for clustering analysis

Mainstream methods like Seurat calculates "highly variable genes" based on dispersion (variance/mean of log-transformed abundance estimate), which is only able to enrich cell type-specific genes and deplete cell-cycle associated and house-keeping genes, but still include a large number of sporadic noises. One key difference between real cell markers and noise is that real markers usually have their followers due to tight regulation of the gene network. So they sit in "modules" of co-regulated genes that are highly correlated with each other. This type of strong co-linearity can be captured as "deepest branches" of a phylogenetic tree of genes. Below I'm showing you how this Deep-tree analysis (python version can be found here) identifies markers. By doing this, you don't need to perform a PCA for cell clustering.

Here I use the top 4000 high-dispersion genes as an initial input.

PMBCDeepTree=DeepTreeCluster(PMBC_perc_filtered(I(1:4000),:),0.8,2)

This image shows a two-way clustergram of all the 4000 genes, most of which are sporadic noises without coherent patterns, seen as green areas with random red dots. The co-expressed genes, however, form deep branches in the tree, highlighted with distinguishable colors in the dendrogram. These 149 genes were extracted and used for a second clustering shown below.

which can be annotated interactively following Matlab Clustergram Manual to make it look like the following

In addition to the 8 clusters Seurat identified, small clusters (of as few as 3 cells) can also be identified, summing up to 11 cell types. In fact, high-dispersion cut favors identification of rare cell types but certain clustering methods may ignore small clusters of cells. Here we see that even for small clusters they may still be well supported by multiple marker genes that are co-expressed.

Promisingly, these 149 marker genes contain 8 of the 9 cherry-picked marker genes in Seurat Tutorial, confirming that real markers should be only a fraction of high-dispersion genes. CD3E is a marker for the most dominant cell type that was not captured by the "deep-tree" method but can still be retrived by looking for significantly different genes between defined clusters.

Checking gene expressions

Here I use the 9 markers shown in Seurat Tutorial as an example to plot a 2-D heatmap.

SeuratMarkers={'ENSG00000156738_MS4A1','ENSG00000115523_GNLY','ENSG00000198851_CD3E','ENSG00000170458_CD14','ENSG00000179639_FCER1A','ENSG00000203747_FCGR3A','ENSG00000090382_LYZ','ENSG00000163736_PPBP','ENSG00000153563_CD8A'};
HeatMap(PMBC_perc_filtered(SeuratMarkers,get(PMBCDeepTree,'ColumnLabels')),'Standardize',2,'DisplayRange',2.5,'Colormap',colormap(jet),'Symmetric',true)

A HeatMap matching the cell order of the deep-tree clustergram is then generated.

If you only know the gene name (e.g. HOPX) but not the gene ID, you can look it up using this:

get(PMBCdm(strcontain('HOPX',get(PMBCdm,'RowNames')),:),'RowNames')

It will show the full name

t-SNE plot

If you want to use the clusters defined by hierarchical clustering, let's first drop down our manually assigned cell clusters from clustering analysis by right clicking the branches and saving them as Cluster1, Cluster2... Based on the coloring in the example image, the clusters are: Cluster 1 - Purple Cluster 2 - Orange Cluster 3 - Blue Cluster 4 - Dark Green Cluster 5 - Dark Blue Cluster 6 - Pink Cluster 7 - Cyan Cluster 8 - Red Cluster 9 - Green Cluster 10 - Yellow

ClusterID=repmat([0],size(get(PMBCDeepTree,'ColumnLabels')));
ClusterID(GetOrder(get(Cluster1,'ColumnLabels'),get(PMBCDeepTree,'ColumnLabels')))=1;
ClusterID(GetOrder(get(Cluster2,'ColumnLabels'),get(PMBCDeepTree,'ColumnLabels')))=2;
ClusterID(GetOrder(get(Cluster3,'ColumnLabels'),get(PMBCDeepTree,'ColumnLabels')))=3;
ClusterID(GetOrder(get(Cluster4,'ColumnLabels'),get(PMBCDeepTree,'ColumnLabels')))=4;
ClusterID(GetOrder(get(Cluster5,'ColumnLabels'),get(PMBCDeepTree,'ColumnLabels')))=5;
ClusterID(GetOrder(get(Cluster6,'ColumnLabels'),get(PMBCDeepTree,'ColumnLabels')))=6;
ClusterID(GetOrder(get(Cluster7,'ColumnLabels'),get(PMBCDeepTree,'ColumnLabels')))=7;
ClusterID(GetOrder(get(Cluster8,'ColumnLabels'),get(PMBCDeepTree,'ColumnLabels')))=8;
ClusterID(GetOrder(get(Cluster9,'ColumnLabels'),get(PMBCDeepTree,'ColumnLabels')))=9;
ClusterID(GetOrder(get(Cluster10,'ColumnLabels'),get(PMBCDeepTree,'ColumnLabels')))=10;

Now calculate t-SNE coordinates and plot

mappedX=tsne(double(PMBC_perc_filtered(get(PMBCDeepTree,'RowLabels'),get(PMBCDeepTree,'ColumnLabels')))','Algorithm','barneshut','NumPCAComponents',0,'Distance','spearman','Perplexity',30);

You should right click mappedX in Workspace window to save it to a file. So that next time you can use mappedX=importdata(); to import it

figure
scatter(mappedX(:,1),mappedX(:,2),50,[0.8 0.8 0.8],'.')

The t-SNE map may vary from run to run due to different seeds used. But you should see that most cells fall in large "islands" while a few outliers are far apart.

Now paint the t-SNE plot with cluster colors.

hold on
scatter(mappedX(ClusterID==1,1),mappedX(ClusterID==1,2),50,[0.49 0.18 0.56],'.')
scatter(mappedX(ClusterID==2,1),mappedX(ClusterID==2,2),50,[0.85 0.33 0.1],'.')
scatter(mappedX(ClusterID==3,1),mappedX(ClusterID==3,2),50,[0 0.45 0.74],'.')
scatter(mappedX(ClusterID==4,1),mappedX(ClusterID==4,2),50,[0.47 0.67 0.19],'.')
scatter(mappedX(ClusterID==5,1),mappedX(ClusterID==5,2),50,[0.3 0.75 0.93],'.')
scatter(mappedX(ClusterID==6,1),mappedX(ClusterID==6,2),50,[1 0 1],'.')
scatter(mappedX(ClusterID==7,1),mappedX(ClusterID==7,2),50,[0 1 1],'.')
scatter(mappedX(ClusterID==8,1),mappedX(ClusterID==8,2),50,[1 0 0],'.')
scatter(mappedX(ClusterID==9,1),mappedX(ClusterID==9,2),50,[0 1 0],'.')
scatter(mappedX(ClusterID==10,1),mappedX(ClusterID==10,2),50,[1 1 0],'.')

Unsupervised k-means clustering

Of course, you can do an unsupervised clustering using the top 10 PCs I mentioned above. kmeans_colors=distinguishable_colors(10); figure;scatter(mappedX(:,1),mappedX(:,2),50,kmeans_colors(kmeans(double(PMBC_perc_filtered(get(PMBCDeepTree,'RowLabels'),get(PMBCDeepTree,'ColumnLabels')))',9),:),'.');

Check expression of a specific gene

figure
scatter(mappedX(:,1),mappedX(:,2),50,PMBC_perc_filtered('ENSG00000105369_CD79A',get(PMBCDeepTree,'ColumnLabels')),'.')
colormap(hot)

Compatibility with Scanpy on python

To export data matrix from Scanpy object, use this

adata.to_df().to_csv('matrix.csv')

Then import this matrix to MATLAB

PBMC=readtable('matrix.csv');

Then convert to DataMatrix format

PBMCdm=Table2DataMatrix(PBMC);

Then you can perform the normal workflow on MATLAB. Skip the normalization step if the matrix is already normalized.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DeepTree_Tutorial.md

DeepTree_Tutorial.md

DeepTree gene selection algorithm

Citation

Introduction

Installing

Getting Started

Import Data

Normalize and Filter Data

Extracting cell type-specific genes for clustering analysis

Checking gene expressions

t-SNE plot

Unsupervised k-means clustering

Check expression of a specific gene

Compatibility with Scanpy on python

Files

DeepTree_Tutorial.md

Latest commit

History

DeepTree_Tutorial.md

File metadata and controls

DeepTree gene selection algorithm

Citation

Introduction

Installing

Getting Started

Import Data

Normalize and Filter Data

Extracting cell type-specific genes for clustering analysis

Checking gene expressions

t-SNE plot

Unsupervised k-means clustering

Check expression of a specific gene

Compatibility with Scanpy on python