GenML Pipeline

Overview

GenML Pipeline is a genomic machine learning pipeline designed to preprocess genomic data, tokenize sequences, and extract features using pretrained foundation models. The pipeline supports multiple encoders and tokenizers, and can be configured to process data with custom parameters.

Features

Load and preprocess genomic data, e.g sequences
Tokenize sequences with loaded tokenizers
Extract features using different encoders/foundation models
Easy configuration through YAML files

Installation

Clone the repository:
```
git clone <repository_url>
cd genml
```
Create and activate a virtual environment:
```
python -m venv .venv
source .venv/bin/activate
```
or install Anaconda to create env.
Install the required dependencies:
```
pip install -r requirements.txt
```
Tips: for DNABERT2, additional env is suggested, and then
```
pip install -r requirements_db2.txt
```
and then
```
pip uninstall triton
```

Usage

Listing Available Encoders and Tokenizers
To list all available encoders and their corresponding tokenizers(go to genml):
```
python -m src list-encoders
```
Configuration
Go to genml/conf to set the parameter configuration.
- config.yml is for the feature extraction process.
- feature_params/encoder.yml is for the foundation models you will use, set the download as True at the first time.
Running the Pipeline
To run the pipeline with the specified configurations(go to genml):
```
python -m src extract-feature
```

Configuration

pat_column: the column of patient id
mut_column: the column of mutation sequence
encoder_name: use the correct encoder after listing available encoders and tokenizers

Contributing

(Create a new branch firstly.)

Including a new Encoder
a. Add a class NewEncoderStrategy(EncoderStrategy) in 'src/feature_extraction/encoder_strategy.py'
b. Register the new Encoder to 'src/feature_extraction/encoder_factory.py'
c. Set up for the new Encoder in 'conf/feature_params/encoder.yml'
After validating an Encoder
Add the mapping to conf/feature_params/mapping.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

GenML Pipeline

Overview

Features

Installation

Usage

Configuration

Contributing

Files

README.md

Latest commit

History

README.md

File metadata and controls

GenML Pipeline

Overview

Features

Installation

Usage

Configuration

Contributing