According to the CDC, kidney disease is the ninth leading cause of death in the United States, affecting more than 1 in 7 adults. However, advancements in RNA sequencing technologies promise to provide answers, giving revolutionary insight into the complex mechanisms of kidney disease at cell-level resolution. This project seeks to compare the accuracy of machine learning algorithms of predicting kidney cell types from sc/sn-RNA-seq data.
While nearly all other steps in the sc/snRNA-seq analysis pipeline are automated, as visualized by the flowchart below, the identification of cell types clusters is often performed manually. However, this system has limitations, as manual annotation is time consuming, requires master-level knowlage of the landscape of the human transcriptome, introduces variable subjectivity to otherwise data-driven analyses, creates non-standard labeling vocabularies, and has low reproducibility in the selection of biomarkers used to identify cell types.
By creating a pipeline to automatically identify kidney cells, we hope to demonstrate the effectiveness and accessability of an automated approach and thereby address these concerns. We seek to emphasize reproducability and transparency in our methods, hoping to provide a model for automation of the annotation process and a pipeline to standardize and harness existing data for data-driven annotation of unknown cells.
To maximize the applicability of our model, we included the most diverse a collection of samples we could obtain. We used cells from different single cell and single nucleus sequencing technologies, biopsy locations, ages, and sexes. A summary of our samples is shown in the table below.
The code we used to replicate the original author analyses, as well as links to our data sources, is available in the Datset Replications folder of this repository.
The workflow of our project is visualized below. After obtaining our data, we removed poorly annotated cells by original author notes, UMAP visualization, and SVM outlier detection. Next, we merged and standardized the samples before removing batch effects using Seurat rPCA integration. Finally, we standardized the ontology across studies by plotting the correlation between original author annotations. This processed data was then used to evaluate the efficacy of various machine learning models by predicting cell types in a single study using the other four as a naive reference.
Our performance testing included 5 different models, listed below, in a rejection scheme that marked uncertian cells as unknown. We used 0.6 as our threshold for rejection; however, this threshold may be tuned for more specific applications.
- Support Vector Machine
- Random Forest
- Multi-Layer Perceptitron
- K Neighbors
- XGBoost
All algorithms showed a strong performance, as visualized in the heatmap below. The best performer was XGBoost, which achieves an average median F1 score of 0.98 across the five datasets along with a average median rejection rate of known cells just slightly above 0. This highest performing model is implemented in an accessible Colab workflow linked to this page.
Start by downloading our dataset, MergedObject.RDS, from Zenodo.
Place the dataset in a directory named data inside the root repository so that you can access it by data/ from the directory containing the Snakefile. (i.e. if root directory was named home, data should be on the path home/data/)
Next, make sure that your system has singularity and snakemake installed, and load them into your environment. For our Linux server, we type the commands:
module load singularity
conda activate snakemake
Next, change your directory to the one containing the Snakefile.
Finally, use this command to run the script:
snakemake --use-singularity --cores <n cores>
Here's what's happening under the hood:
- (File 1: IntegrateAll.R) Integration of all five datasets for quality control
- Input: Pre-batch correction object of ~62,000 pre-QC cells
- Output: Batch corrected object of ~62,000 pre-QC cells
- (File 2: QualityControl.py) SVM quality control
- Input: Output from IntegrateAll.R
- Output: a CSV of binary cell quality control designations
- (File 3: IntegrateSome.R) Train-test split
- Input: Pre-batch correction object of ~62,000 pre-QC cells and CSV of binary cell qualtiy control designations designations produced by QualityControl.py
- Output: Five objects of ~57,000 batch corrected, post-qc cells, marked with training and testing split
- (File 4: PerformanceTesting.py) Machine learning performance testing
- Input: Five train-test split objects produced by IntegrateSome.R
- Output: Figures and classification report (unknown percentages, f1 scores, and confusion matricies)
First, create a Seurat object for your data, and upload this object to a Google Drive.
Second, open our Colab workflow with the link below and follow the included instructions to produce an annotated Seurat object saved to your Google Drive.
If you have any questions or run into issues please leave them in the issues tab or contact us by email.
Maintainers: Stephen Blough [email protected] & Adam Tisch [email protected] & Fadhl Alakwaa [email protected]
- Lake, B.B. et al. A single-nucleus RNA-sequencing pipeline to decipher the molecular anatomy and pathophysiology of human kidneys. Nat Commun 10, 2832 (2019).
- Liao, J. et al. Single-cell RNA sequencing of human kidney. Sci Data 7, 4 (2020).
- Menon, R. et al. Single cell transcriptomics identifies focal segmental glomerulosclerosis remission endothelial biomarker. JCI Insight 5, e133267 (2020).
- Wu, H. et al. Single-cell transcriptomics of a human kidney allograft biopsy specimen defines a diverse inflammatory response. J Am Soc Nephrol 29: 2069–2080 (2018).
- Young, M. D. et al. Single-cell transcriptomes from human kidneys reveal the cellular identity of renal tumors. Science 361, 594–599 (2018).