This document explains the structure of the output files and folders generated by scGenAI during training, prediction and finetune.
Output folder and files are same from the Train
and Finetune
mode, which is list as below:
├── best_model
│ ├── config.json
│ ├── expression_vocab.npy
│ ├── gene_vocab.npy
│ ├── label_encoder_classes.npy
│ ├── pad_token_id.npy
│ ├── scGenAI_model.pt
│ ├── special_tokens_map.json
│ ├── tokenizer.json
│ ├── tokenizer_config.json
│ ├── train_setting.yaml
│ └── trained_genes.npy
├── last_model
│ ├── config.json
│ ├── expression_vocab.npy
│ ├── gene_vocab.npy
│ ├── label_encoder_classes.npy
│ ├── pad_token_id.npy
│ ├── scGenAI_model.pt
│ ├── special_tokens_map.json
│ ├── tokenizer.json
│ ├── tokenizer_config.json
│ ├── train_setting.yaml
│ └── trained_genes.npy
├── combined_epoch_results.csv
└── train_summary.pdf
This file contains the combined results of all training/finetune epochs, including metrics such as loss and accuracy over time.
This PDF file provides a summary of the training/finetune process, including visualizations such as loss curves and accuracy metrics over time.
This folder contains the best model performance checkpoints, configuration, and tokenization data after training/finetune. The files include:
config.json
: Contains the model's configuration details.expression_vocab.npy
: The expression vocabulary file used during model training/fine-tune.gene_vocab.npy
: The gene vocabulary file used during model training/finetune.label_encoder_classes.npy
: Encodes the labels (e.g., cell types) used for training/fine-tune.pad_token_id.npy
: The padding token ID used during tokenization.scGenAI_model.pt
: The PyTorch model file containing the trained weights.special_tokens_map.json
: A mapping of special tokens used by the tokenizer.tokenizer.json
: The tokenizer configuration used during training/finetune.tokenizer_config.json
: Detailed configuration of the tokenizer.train_setting.yaml
: The YAML configuration file used during training/finetune.trained_genes.npy
: The list of genes the model was trained on.
Similar to the best_model
folder, this folder contains the model's last checkpoint after the final epoch. The contents are the same as the best model, including the configuration, vocabularies, and tokenizer settings.
The prediction output is a CSV file, as defined in the configuration file. It contains the original metadata extracted from the input prediction file (obs
slot) along with three additional prediction columns: context_id
, PredictedFeature
, and prediction_score
.
context_id
represents the context used to determine the prediction for the corresponding cell.PredictedFeature
is the final predicted feature for the cell using the trained model.prediction_score
indicates the confidence level of the prediction, with a maximum value of 1.