This document explains how to manage the configuration files for scGenAI. You can configure the package by either:
- Editing the YAML configuration file manually, using the provided template.
- Generating the YAML configuration file programmatically via the
config.py
script using command-line arguments.
The configuration file is a YAML file that defines various settings such as input files, model parameters, and training details. We strongly suggest to start the editing from the config template that prebuilt: config_templates
As a quick start, the user may consider revising the template based on the corresponding training purpose, as shown in the table below:
Analysis Purpose | Configuration Template |
---|---|
Healthy cell type training | config_Train_genelist_template_llama.yaml |
Disease/Cancer cell type training and prediction | config_Train_biofounction_context_template_llama.yaml or config_Train_genomic_context_template_llama.yaml |
Prediction | config_Prediction_template.yaml |
Training using multi-omics data | config_Train_MultiOmicsData_template.yaml |
Fine-tuning using a pretrained model | config_Finetune_MultiOmicsData_template.yaml |
A example of the format in the config file shown below:
# Mode
mode: "Train"
# Directories
cache_dir: "/path_to_your_cache_dir/cache" ## cache dir to save the model template files
model_dir: "/path_to_your_output_model_dir/" ## output model dir
log_dir: "/path_to_your_log_dir/logs/"
#### Input data files ####
train_file: "/path_to_your_trainfile/train.h5ad"
val_file: "/path_to_your_valfile/val.h5ad" ## Optional
#### General setting ####
savelog: "Yes"
target_feature: "ct" # Target name for prediction
num_bins: 10 # Bins for gene expression discretization
#### Model template and context method setting ####
model_backbone_name: "llama" ### "llama", "gpt", "bigbird", "scgent"
model_backbone_size: "small" ### "small", "normal", "large". Suggest "small" for llama
max_length: 5120
context_method: "random"
#### Other settings ####
min_cells: 50 # suggest 50
batch_size: 1 # set this based on GPU memory, higher batch_size higher training speed, but also much more GPU memory will be used
learning_rate: 1e-5 # suggest 1e-5
num_epochs: 30 # suggest 30
mode
: Operating mode of the package. It can beTrain
,Predict
, orFinetune
.train_file
: The path to the primary training data file in.h5ad
format. This is required inTrain
andFinetune
modes.val_file
: (Optional) The validation data file used during training.batch_size
: The batch size for training the model.learning_rate
: The learning rate used during training.num_epochs
: The number of epochs for training the model.model_backbone_name
: The backbone model to be used. Choose fromgpt
,bigbird
,llama
, orscGenT
.model_backbone_size
: Size of the backbone model (small
,normal
, orlarge
).context_method
: Method for generating context for input sequences. Choices includerandom
,genomic
,biofounction
, orgenelist
.target_feature
: The feature to predict (e.g.,celltype
).output_dir
: Directory to save the model outputs.log_dir
: Directory to store log files.cache_dir
: Directory to cache models during training and prediction.
For more details on input files and parameters, please see the Input Documentation.
You can generate the configuration file programmatically using the config.py
script with command-line arguments. This approach ensures that you can dynamically override default values based on your specific requirements.
Run the following command to generate a YAML configuration file:
python config.py --mode Train --train_file /rootdir/examples/data/train_data.h5ad \
--val_file /rootdir/examples/data/val_data.h5ad --batch_size 2 \
--learning_rate 1e-5 --num_epochs 30 --model_backbone_name llama \
--model_backbone_size small --context_method genomic \
--cytofile /rootdir/examples/data/cytoband_data.txt --target_feature celltype \
--outputconfig /rootdir/train_config.yaml
This will generate a train_config.yaml
file based on the provided parameters.
--mode
: Specifies the mode of operation (Train
,Predict
,Finetune
).--train_file
: Path to the training file.--val_file
: Path to the validation file.--evaluate_during_training
: whether to use validation file in training or prediction (true
,false
). Default istrue
. Set this tofalse
if there is no validation file.--batch_size
: Batch size for training.--learning_rate
: Learning rate for training.--num_epochs
: Number of training epochs.--model_backbone_name
: Backbone model name (gpt
,bigbird
,llama
,scGenT
).--model_backbone_size
: Size of the model (small
,normal
,large
).--context_method
: Context generation method (random
,genomic
,biofounction
,genelist
).--cytofile
: Path to the cytoband data file (if usinggenomic
context).--target_feature
: Feature to predict (e.g.,celltype
).--output_dir
: Directory to save outputs.--log_dir
: Directory to save logs.--outputconfig
: The path where the generated configuration YAML file will be saved.
For more details on input files and parameters, please see the Input Documentation.
Once the config file is built, the user can now run the scGenAI. Please read the doc of Run scGenAI for details