Identification of Kidney Cell Types in scRNA-seq Data Using Machine Learning Algorithms

Introduction

According to the CDC, kidney disease is the ninth leading cause of death in the United States, affecting more than 1 in 7 adults. However, advancements in RNA sequencing technologies promise to provide answers, giving revolutionary insight into the complex mechanisms of kidney disease at cell-level resolution. This project seeks to compare the accuracy of machine learning algorithms of predicting kidney cell types from sc/sn-RNA-seq data.

Pipeline Objective

While nearly all other steps in the sc/snRNA-seq analysis pipeline are automated, as visualized by the flowchart below, the identification of cell types clusters is often performed manually. However, this system has limitations, as manual annotation is time consuming, requires master-level knowlage of the landscape of the human transcriptome, introduces variable subjectivity to otherwise data-driven analyses, creates non-standard labeling vocabularies, and has low reproducibility in the selection of biomarkers used to identify cell types.

By creating a pipeline to automatically identify kidney cells, we hope to demonstrate the effectiveness and accessability of an automated approach and thereby address these concerns. We seek to emphasize reproducability and transparency in our methods, hoping to provide a model for automation of the annotation process and a pipeline to standardize and harness existing data for data-driven annotation of unknown cells.

Data Composition

To maximize the applicability of our model, we included the most diverse a collection of samples we could obtain. We used cells from different single cell and single nucleus sequencing technologies, biopsy locations, ages, and sexes. A summary of our samples is shown in the table below.

The code we used to replicate the original author analyses, as well as links to our data sources, is available in the Datset Replications folder of this repository.

Workflow

The workflow of our project is visualized below. After obtaining our data, we removed poorly annotated cells by original author notes, UMAP visualization, and SVM outlier detection. Next, we merged and standardized the samples before removing batch effects using Seurat rPCA integration. Finally, we standardized the ontology across studies by plotting the correlation between original author annotations. This processed data was then used to evaluate the efficacy of various machine learning models by predicting cell types in a single study using the other four as a naive reference.

Performance

Our performance testing included 5 different models, listed below, in a rejection scheme that marked uncertian cells as unknown. We used 0.6 as our threshold for rejection; however, this threshold may be tuned for more specific applications.

Support Vector Machine
Random Forest
Multi-Layer Perceptitron
K Neighbors
XGBoost

All algorithms showed a strong performance, as visualized in the heatmap below. The best performer was XGBoost, which achieves an average median F1 score of 0.98 across the five datasets along with a average median rejection rate of known cells just slightly above 0. This highest performing model is implemented in an accessible Colab workflow linked to this page.

To Reproduce Our Results

Start by downloading our dataset, MergedObject.RDS, from Zenodo.

Place the dataset in a directory named data inside the root repository so that you can access it by data/ from the directory containing the Snakefile. (i.e. if root directory was named home, data should be on the path home/data/)

Next, make sure that your system has singularity and snakemake installed, and load them into your environment. For our Linux server, we type the commands:

module load singularity 
conda activate snakemake

Next, change your directory to the one containing the Snakefile.

Finally, use this command to run the script:

snakemake --use-singularity --cores <n cores>

Here's what's happening under the hood:

(File 1: IntegrateAll.R) Integration of all five datasets for quality control

Input: Pre-batch correction object of ~62,000 pre-QC cells
Output: Batch corrected object of ~62,000 pre-QC cells

(File 2: QualityControl.py) SVM quality control

Input: Output from IntegrateAll.R
Output: a CSV of binary cell quality control designations

(File 3: IntegrateSome.R) Train-test split

Input: Pre-batch correction object of ~62,000 pre-QC cells and CSV of binary cell qualtiy control designations designations produced by QualityControl.py
Output: Five objects of ~57,000 batch corrected, post-qc cells, marked with training and testing split

(File 4: PerformanceTesting.py) Machine learning performance testing

Input: Five train-test split objects produced by IntegrateSome.R
Output: Figures and classification report (unknown percentages, f1 scores, and confusion matricies)

To Query Our merged object

First, create a Seurat object for your data, and upload this object to a Google Drive.

Second, open our Colab workflow with the link below and follow the included instructions to produce an annotated Seurat object saved to your Google Drive.

Questions and Issues

If you have any questions or run into issues please leave them in the issues tab or contact us by email.

Maintainers: Stephen Blough [email protected] & Adam Tisch [email protected] & Fadhl Alakwaa [email protected]

Citations

Included Datasets:

CDC Statistics:

Centers for Disease Control and Prevention. Chronic Kidney Disease in the United States, 2021. Atlanta, GA: US Department of Health and Human Services, Centers for Disease Control and Prevention; 2021.

Project Inspiration:

Abdelaal, T. et al. A comparison of automatic cell identification methods for single-cell RNA sequencing data. Genome Biol 20, 194 (2019).

Name		Name	Last commit message	Last commit date
Latest commit History 115 Commits
Dataset Replications		Dataset Replications
Figures		Figures
Scripts		Scripts
README.md		README.md
Snakefile		Snakefile
Steps to Add New Reference Data.md		Steps to Add New Reference Data.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Identification of Kidney Cell Types in scRNA-seq Data Using Machine Learning Algorithms

Introduction

Pipeline Objective

Data Composition

Workflow

Performance

To Reproduce Our Results

To Query Our merged object

Questions and Issues

Citations

Included Datasets:

CDC Statistics:

Project Inspiration:

About

Releases

Packages

Contributors 3

Languages

FADHLyemen/IKCTML

Folders and files

Latest commit

History

Repository files navigation

Identification of Kidney Cell Types in scRNA-seq Data Using Machine Learning Algorithms

Introduction

Pipeline Objective

Data Composition

Workflow

Performance

To Reproduce Our Results

To Query Our merged object

Questions and Issues

Citations

Included Datasets:

CDC Statistics:

Project Inspiration:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages