Skip to content

data processing codes for "Investigating the performance of foundation models on human 3’UTR sequences"

Notifications You must be signed in to change notification settings

sergeyvilov/investigating-foundation-models-3utr

Repository files navigation

Investigating the performance of foundation models on human 3’UTR sequences

Sergey Vilov and Matthias Heinig

bioRxiv preprint

Codes for data analysis

  • effect_prediction : compute functionality scores for ClinVar, gnomAD, and eQTL variants and evaluate the models
  • motif_search : evaluate the models on RBP binding motifs prediction
  • half_life : prediction of mRNA half-life from (Agarwal and Kelley, 2022) based on language model embeddings
  • mpra : prediction of measured MPRA activity for (Griesemer et al., 2021) and (Siegel et al., 2022) experiments
  • embeddings : generate embeddings for the DNABERT, DNABERT-2, and NT models
  • inference : derive per-base scores for DNABERT, NT, and PhyloP models

The codes for extraction of 3'UTR sequences from the Zoonomia .hal alignment and the scripts for model training will be made available upon the paper acceptance.

The intermediate data for the downstream tasks can be found at https://zenodo.org/records/10655595. The 3'UTR multispecies fasta files and model weights will be added to the Zenodo repository upon the paper acceptance.

Links to the scripts used to generate paper figures and tables:

Fig. 1: Odds Ratios and mobility distribution for RBP binding sites recognition

Fig. 2: ROC curves for embeddings-based variant effect predictions on ClinVar, gnomAD, and eQTL data

Fig. S1: Distribution of 3’UTR length for 18,134 transcripts of the human genome.

Fig. S2: Average mobility for the putative functional motifs at 425,413 positions as a function of the conservation distance R.

Table 1: Pearson r correlation coefficient between Ridge-based predictions from sequence embeddings and ground truth MPRA expression from (Griesemer et al., 2021).

Table S1: ROC AUC scores for ClinVar, gnomAD, and eQTL data computed based on zero-shot functionality scores for all models.

Table S2: ROC AUC scores from prediction of functional variants on ClinVar, gnomAD, and eQTL data using language model embeddings and PhyloP conservation scores.

Table S3: Pearson r correlation coefficient between SVR-based predictions from sequence embeddings and ground truth MPRA activity from (Griesemer et al., 2021).

Table S4: Pearson r correlation coefficient between Ridge-based predictions from sequence embeddings and ground truth MPRA data from (Siegel et al., 2022).

Table S5: Pearson r correlation coefficient between SVR-based predictions from sequence embeddings and ground truth MPRA data from (Siegel et al., 2022).

Table S6: Pearson r correlation coefficient between mRNA half-life prediction and ground truth data from (Agarwal and Kelley, 2022), using different 3’UTR embeddings.

Table S7: Pearson r correlation coefficient for mRNA half-life prediction with the BC3MS model based on different 3’UTR embeddings and the Saluki model.

Installation

  1. Create new conda environment:
conda create -n lm-3utr-models python=3.10
conda activate lm-3utr-models
  1. Install Pytorch v.2.0.1

  2. Install the other requirements using pip:

pip install -r requirements.txt
  1. To train DNABERT-2 models also install
pip install triton==2.0.0.dev20221202 --force --no-dependencies

Training of DNABERT-2 is currently only possible on NVIDIA A100 due to the employed flash attention implementation.

About

data processing codes for "Investigating the performance of foundation models on human 3’UTR sequences"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published