Skip to content

This is a pipeline of feature extraction for genomics mutation using foundation models.

Notifications You must be signed in to change notification settings

KatherLab/genml

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

66 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

GenML Pipeline

Overview

GenML Pipeline is a genomic machine learning pipeline designed to preprocess genomic data, tokenize sequences, and extract features using pretrained foundation models. The pipeline supports multiple encoders and tokenizers, and can be configured to process data with custom parameters.

Features

  • Load and preprocess genomic data, e.g sequences
  • Tokenize sequences with loaded tokenizers
  • Extract features using different encoders/foundation models
  • Easy configuration through YAML files

Installation

  1. Clone the repository:

    git clone <repository_url>
    cd genml
  2. Create and activate a virtual environment:

    python -m venv .venv
    source .venv/bin/activate

    or install Anaconda to create env.

  3. Install the required dependencies:

    pip install -r requirements.txt

    Tips: for DNABERT2, additional env is suggested, and then

    pip install -r requirements_db2.txt

    and then

    pip uninstall triton

Usage

  1. Listing Available Encoders and Tokenizers
    To list all available encoders and their corresponding tokenizers(go to genml):

    python -m src list-encoders
  2. Configuration
    Go to genml/conf to set the parameter configuration.

    • config.yml is for the feature extraction process.
    • feature_params/encoder.yml is for the foundation models you will use, set the download as True at the first time.
  3. Running the Pipeline
    To run the pipeline with the specified configurations(go to genml):

    python -m src extract-feature

Configuration

  • pat_column: the column of patient id
  • mut_column: the column of mutation sequence
  • encoder_name: use the correct encoder after listing available encoders and tokenizers

Contributing

(Create a new branch firstly.)

  1. Including a new Encoder
    a. Add a class NewEncoderStrategy(EncoderStrategy) in 'src/feature_extraction/encoder_strategy.py'
    b. Register the new Encoder to 'src/feature_extraction/encoder_factory.py'
    c. Set up for the new Encoder in 'conf/feature_params/encoder.yml'

  2. After validating an Encoder
    Add the mapping to conf/feature_params/mapping.yml

About

This is a pipeline of feature extraction for genomics mutation using foundation models.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages