Sergey Vilov and Matthias Heinig
- effect_prediction : compute functionality scores for ClinVar, gnomAD, and eQTL variants and evaluate the models
- motif_search : evaluate the models on RBP binding motifs prediction
- half_life : prediction of mRNA half-life from (Agarwal and Kelley, 2022) based on language model embeddings
- mpra : prediction of measured MPRA activity for (Griesemer et al., 2021) and (Siegel et al., 2022) experiments
- embeddings : generate embeddings for the DNABERT, DNABERT-2, and NT models
- inference : derive per-base scores for DNABERT, NT, and PhyloP models
The codes for extraction of 3'UTR sequences from the Zoonomia .hal alignment and the scripts for model training will be made available upon the paper acceptance.
The intermediate data for the downstream tasks can be found at https://zenodo.org/records/10655595. The 3'UTR multispecies fasta files and model weights will be added to the Zenodo repository upon the paper acceptance.
Fig. 1: Odds Ratios and mobility distribution for RBP binding sites recognition
Fig. 2: ROC curves for embeddings-based variant effect predictions on ClinVar, gnomAD, and eQTL data
Fig. S1: Distribution of 3’UTR length for 18,134 transcripts of the human genome.
- Create new conda environment:
conda create -n lm-3utr-models python=3.10
conda activate lm-3utr-models
-
Install Pytorch v.2.0.1
-
Install the other requirements using pip:
pip install -r requirements.txt
- To train DNABERT-2 models also install
pip install triton==2.0.0.dev20221202 --force --no-dependencies
Training of DNABERT-2 is currently only possible on NVIDIA A100 due to the employed flash attention implementation.