GenML Pipeline

Overview

GenML Pipeline is a genomic machine learning pipeline designed to preprocess genomic data, tokenize sequences, and extract features using pretrained foundation models. The pipeline supports multiple encoders and tokenizers, and can be configured to process data with custom parameters.

Features

Load and preprocess genomic data, e.g sequences
Tokenize sequences with loaded tokenizers
Extract features using different encoders/foundation models
Easy configuration through YAML files

Installation

Clone the repository:
```
git clone <repository_url>
cd genml
```
Create and activate a virtual environment:
```
python -m venv .venv
source .venv/bin/activate
```
or install Anaconda to create env.
Install the required dependencies:
```
pip install -r requirements.txt
```
Tips: for DNABERT2, additional env is suggested, and then
```
pip install -r requirements_db2.txt
```
and then
```
pip uninstall triton
```

Usage

Listing Available Encoders and Tokenizers
To list all available encoders and their corresponding tokenizers(go to genml):
```
python -m src list-encoders
```
Configuration
Go to genml/conf to set the parameter configuration.
- config.yml is for the feature extraction process.
- feature_params/encoder.yml is for the foundation models you will use, set the download as True at the first time.
Running the Pipeline
To run the pipeline with the specified configurations(go to genml):
```
python -m src extract-feature
```

Configuration

pat_column: the column of patient id
mut_column: the column of mutation sequence
encoder_name: use the correct encoder after listing available encoders and tokenizers

Contributing

(Create a new branch firstly.)

Including a new Encoder
a. Add a class NewEncoderStrategy(EncoderStrategy) in 'src/feature_extraction/encoder_strategy.py'
b. Register the new Encoder to 'src/feature_extraction/encoder_factory.py'
c. Set up for the new Encoder in 'conf/feature_params/encoder.yml'
After validating an Encoder
Add the mapping to conf/feature_params/mapping.yml

Name		Name	Last commit message	Last commit date
Latest commit History 68 Commits
conf		conf
docs		docs
logs		logs
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
example.ipynb		example.ipynb
requirements.txt		requirements.txt
requirements_db2.txt		requirements_db2.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GenML Pipeline

Overview

Features

Installation

Usage

Configuration

Contributing

About

Releases

Packages

Languages

KatherLab/genml

Folders and files

Latest commit

History

Repository files navigation

GenML Pipeline

Overview

Features

Installation

Usage

Configuration

Contributing

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages