This document describes the input files and parameters required by scGenAI for different modes of operation: Train
, Predict
, and Finetune
.
mode
: Specifies the mode of operation. Choices:Train
,Predict
,Finetune
. Default:Train
.
These files are required in most configurations. Depending on the mode (Train, Predict, Finetune), some may be required:
train_file
: Required whenmode
isTrain
orFinetune
. The primary training data file in.h5ad
format.val_file
: Optional. The validation data file used during training.train_ADTfile
: Required whenmultiomics
is set toYes
andmode
isTrain
orFinetune
. The training file for ADT (antibody-derived tags) data.val_ADTfile
: Optional. The validation file for ADT data.cytofile
: Required whencontext_method
is set togenomic
andmode
isTrain
. The cytoband data file used in genomic context. It should be a tab separated files contains the two columns,gene_symbol
andcytobandID
.gmtfile
: Required whencontext_method
is set tobiofounction
andmode
isTrain
. The gmt format file for biofunction context.glstfile
: Required whencontext_method
is set togenelist
andmode
isTrain
orFinetune
. A gene list file used in genelist context. It should be a single column no header with gene names.predict_file
: Required whenmode
isPredict
. The input file for prediction tasks.predict_ADTfile
: Required whenmultiomics
is set toYes
andmode
isPredict
. The ADT file for prediction tasks (if using multi-omics data).outputfile
: Required whenmode
isPredict
. The output CSV file where the prediction results will be saved.cache_dir
: Required. Directory to cache models template during training and prediction, when the first time of using model template, the cache files will be saved in this folder, and will be loaded directly for reuse.
log_dir
: Required whensavelog
is set toYes
. Directory to store logs. Default:examples/logs
.model_dir
: Required whenmode
isTrain
orFinetune
. Directory where models are saved during training, which is also the input folder in theFinetune
mode.finetune_dir
: Required whenmode
isFinetune
. Directory used for saving fine-tuned models. Default:examples/finetune
.
min_cells
: Optional. Minimum number of cells required for filtering. Default:50
.target_feature
: Required. The feature to predict (e.g.,celltype
). Default:celltype
.multiomics
: Optional. Whether multi-omics data (e.g., RNA and ADT) is used. Choices:Yes
,No
. Default:No
.savelog
: Optional. Whether to save logs to thelog_dir
. Choices:Yes
,No
. Default:No
.num_bins
: Number of bins used for gene expression normalization. Default:10
.
model_backbone_name
: Required whenmode
isTrain
. The backbone model template to use. Choices:gpt
,bigbird
,llama
,scGenT
. Default:llama
.model_backbone_size
: Required whenmode
isTrain
. The size of the model. Choices:small
,normal
,large
. Default:normal
. Suggested size:small
forllama
andbigbird
;normal
forgpt
andscGenT
.context_method
: Method for generating context. Choices:random
,genomic
,biofounction
,genelist
. Default:random
.
batch_size
: Required. The batch size for training. Default:8
. We suggest the user begin with 1 to avoid the error of out of GPU memory.learning_rate
: Optional. The learning rate for the model trainning. Default:1e-5
.num_epochs
: Optional. Number of training epochs. Default:30
. We suggest use30
for training,20
~30
for finetune mode.weight_decay
: Optional. Weight decay rate for the optimizer. Default:0.01
.depth
: Optional. Depth for the genomic/biofounction/genelist context method. Default:2
.seed
: Optional. Random seed for reproducibility. Default:1314521
.max_length
: Required whenmode
isTrain
. Maximum sequence length for tokenization. Default:1024
. Suggested size:5120
forllama
;1024
forgpt
;4096
forbigbird
;2048
forscGenT
.
All the required parameters must be defined in the YAML configuration file. Please read the doc of Configuration for details