GenML Pipeline is a genomic machine learning pipeline designed to preprocess genomic data, tokenize sequences, and extract features using pretrained foundation models. The pipeline supports multiple encoders and tokenizers, and can be configured to process data with custom parameters.
- Load and preprocess genomic data, e.g sequences
- Tokenize sequences with loaded tokenizers
- Extract features using different encoders/foundation models
- Easy configuration through YAML files
-
Clone the repository:
git clone <repository_url> cd genml
-
Create and activate a virtual environment:
python -m venv .venv source .venv/bin/activate
or install Anaconda to create env.
-
Install the required dependencies:
pip install -r requirements.txt
Tips: for DNABERT2, additional env is suggested, and then
pip install -r requirements_db2.txt
and then
pip uninstall triton
-
Listing Available Encoders and Tokenizers
To list all available encoders and their corresponding tokenizers(go to genml):python -m src list-encoders
-
Configuration
Go to genml/conf to set the parameter configuration.- config.yml is for the feature extraction process.
- feature_params/encoder.yml is for the foundation models you will use, set the download as True at the first time.
-
Running the Pipeline
To run the pipeline with the specified configurations(go to genml):python -m src extract-feature
- pat_column: the column of patient id
- mut_column: the column of mutation sequence
- encoder_name: use the correct encoder after listing available encoders and tokenizers
(Create a new branch firstly.)
-
Including a new Encoder
a. Add a class NewEncoderStrategy(EncoderStrategy) in 'src/feature_extraction/encoder_strategy.py'
b. Register the new Encoder to 'src/feature_extraction/encoder_factory.py'
c. Set up for the new Encoder in 'conf/feature_params/encoder.yml' -
After validating an Encoder
Add the mapping to conf/feature_params/mapping.yml