scGenAI is a Python package for single-cell RNA sequencing (scRNA-seq) data prediction and analysis using large language models (LLMs). The package allows users to train, fine-tune, and make predictions on single-cell data using transformer-based models, including custom versions of LLaMA, GPT, BigBird, and scGenT. It provides multi-GPU support with PyTorch DistributedDataParallel (DDP).
To install the package, use the following steps:
-
Optional: Create a Env for scGenAI:
conda create -n scGenAI python==3.10 conda activate scGenAI
-
Clone the repository:
git clone https://github.com/VOR-Quantitative-Biology/scGenAI.git
-
Navigate to the project directory:
cd scGenAI
-
Install dependencies, then install scGenAI:
pip install -r requirements.txt pip install .
Once installed, scGenAI
can be accessed through either (1) python IDE or notebook or (2) the command line interface (CLI). You can train, predict, or fine-tune a model by calling the scGenAI in python or CLI commands along with a configuration YAML file containing your settings.
As a quick start, we highly recommend users to begin with the following tutorials using a testing size of data (40 cells) and config template files according to the training/prediction purposes:
Testing Case | Tutorial | Training/Finetune Config Template | Prediction Config Template |
---|---|---|---|
Modeling healthy cell type | TrainData_Tutorial | config_Train | config_Prediction |
Modeling disease/cancer cell type | TrainData_Tutorial | config_Train or config_Train | config_Prediction |
Modeling cell genotype | TrainData_Tutorial | config_Train | config_Prediction |
Modeling using multiomics data | TrainData_Tutorial | config_Train | config_Prediction |
Fine-tune using pretrained model | FinetuneData_Tutorial | config_Train | config_Prediction |
In addition to the testing data, we also provide full-size datasets and config files according to the training/prediction purposes.
Study | Tutorial | Training/Finetune Config | Prediction Config |
---|---|---|---|
Mouse eye cell types gene list context |
TrainData_Tutorial | config_Train | config_Prediction |
Myeloma myeloid cell types genomic context |
TrainData_Tutorial | config_Train | config_Prediction |
AML cell types and cell status biofunction context |
TrainData_Tutorial | config_Train | config_Prediction |
PBMC CITE-Seq | TrainData_Tutorial | config_Train | config_prediction |
Fine-tune of bone marrow cell types | FinetuneData_Tutorial | config_Finetune | config_Prediction |
The CLI supports the following commands:
-
Train a model:
scgenai train --config_file <path_to_config.yaml>
-
Make predictions:
scgenai predict --config_file <path_to_config.yaml>
-
Fine-tune a pre-trained model:
scgenai finetune --config_file <path_to_config.yaml>
Please see the full documentation for the details usage of scGenAI.
-
The use of scGenAI is governed by a custom license permitting non-commercial use only LICENSE. This package is freely available to individuals, universities, non-profit organizations, educational institutions, and government entities for non-commercial research or journalistic purposes.
-
By cloning or downloading this repository, the user acknowledge that the user has read, understood, and agree to abide by the terms outlined in the LICENSE file.