Investigating the performance of foundation models on human 3’UTR sequences

Sergey Vilov and Matthias Heinig

Codes for data analysis

effect_prediction : compute functionality scores for ClinVar, gnomAD, and eQTL variants and evaluate the models
motif_search : evaluate the models on RBP binding motifs prediction
half_life : prediction of mRNA half-life from (Agarwal and Kelley, 2022) based on language model embeddings
mpra : prediction of measured MPRA activity for (Griesemer et al., 2021) and (Siegel et al., 2022) experiments
embeddings : generate embeddings for the DNABERT, DNABERT-2, and NT models
inference : derive per-base scores for DNABERT, NT, and PhyloP models

The codes for extraction of 3'UTR sequences from the Zoonomia .hal alignment and the scripts for model training will be made available upon the paper acceptance.

The intermediate data for the downstream tasks can be found at https://zenodo.org/records/10655595. The 3'UTR multispecies fasta files and model weights will be added to the Zenodo repository upon the paper acceptance.

Links to the scripts used to generate paper figures and tables:

Fig. 1: Odds Ratios and mobility distribution for RBP binding sites recognition

Fig. 2: ROC curves for embeddings-based variant effect predictions on ClinVar, gnomAD, and eQTL data

Fig. S1: Distribution of 3’UTR length for 18,134 transcripts of the human genome.

Fig. S2: Average mobility for the putative functional motifs at 425,413 positions as a function of the conservation distance R.

Table 1: Pearson r correlation coefficient between Ridge-based predictions from sequence embeddings and ground truth MPRA expression from (Griesemer et al., 2021).

Table S1: ROC AUC scores for ClinVar, gnomAD, and eQTL data computed based on zero-shot functionality scores for all models.

Table S2: ROC AUC scores from prediction of functional variants on ClinVar, gnomAD, and eQTL data using language model embeddings and PhyloP conservation scores.

Table S3: Pearson r correlation coefficient between SVR-based predictions from sequence embeddings and ground truth MPRA activity from (Griesemer et al., 2021).

Table S4: Pearson r correlation coefficient between Ridge-based predictions from sequence embeddings and ground truth MPRA data from (Siegel et al., 2022).

Table S5: Pearson r correlation coefficient between SVR-based predictions from sequence embeddings and ground truth MPRA data from (Siegel et al., 2022).

Table S6: Pearson r correlation coefficient between mRNA half-life prediction and ground truth data from (Agarwal and Kelley, 2022), using different 3’UTR embeddings.

Table S7: Pearson r correlation coefficient for mRNA half-life prediction with the BC3MS model based on different 3’UTR embeddings and the Saluki model.

Installation

Create new conda environment:

conda create -n lm-3utr-models python=3.10
conda activate lm-3utr-models

Install Pytorch v.2.0.1
Install the other requirements using pip:

pip install -r requirements.txt

To train DNABERT-2 models also install

pip install triton==2.0.0.dev20221202 --force --no-dependencies

Training of DNABERT-2 is currently only possible on NVIDIA A100 due to the employed flash attention implementation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Investigating the performance of foundation models on human 3’UTR sequences

Codes for data analysis

Links to the scripts used to generate paper figures and tables:

Installation

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
effect_prediction		effect_prediction
embeddings		embeddings
half_life		half_life
inference		inference
motif_search		motif_search
mpra		mpra
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

sergeyvilov/investigating-foundation-models-3utr

Folders and files

Latest commit

History

Repository files navigation

Investigating the performance of foundation models on human 3’UTR sequences

Codes for data analysis

Links to the scripts used to generate paper figures and tables:

Installation

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages