This repository contains the implementation of a protein language models pre-trained with structural, surface and interaction data.
For more details, see: NeurIPS-MLSB2023.
Language models applied to protein sequence data have gained a lot of interest in recent years, mainly due to their ability to capture complex patterns at the protein sequence level. However, their understanding of why certain evolution-related conservation patterns appear is limited. This work explores the potential of protein language models to further incorporate intrinsic protein properties stemming from protein structures, surfaces, and interfaces. The results indicate that this multi-task pretraining allows the PLM to learn more meaningful representations by leveraging information obtained from different protein views. We evaluate and show improve�ments in performance on various downstream tasks, such as enzyme classification, remote homology detection, and protein engineering datasets.
The model is trained and evaluated using publicly available datasets:
- PLM pretraining dataset: Uniref90
- Enzyme Commission (EC) dataset: IEConv_proteins
- Fold recognition dataset: TAPE
- FLIP benchmark datasests: FLIP
- SCOPe datasests: SCOPe
We provide all datasets for download as a single folder (apart from Uniref90): all-data.
To pretrain the protein language model you can run train_prose_multitask.py
.
The implementation uses multiple GPUs and can be run on a single machine or on a cluster. The scripts for running the
file on a cluster can be found at iridis-scripts
. The progress of the training
can be monitored using tensorboard.sh
. All trained models can be downloaded in the release section.
After pretraining the protein language model, you can finetune it on downstream tasks. You can do this by running the following python files:
train_enzyme.py
for the EC datasettrain_fold.py
for the Fold recognition datasettrain_flip.py
for the FLIP benchmark datasets
If you want to run these experiments on a cluster, take a look in the folder: iridis-scripts
To reproduce the plots for the protein embedding projection using TSNE, use the notebook scop-tsne.ipynb
.
If you want to embedd a set of protein sequences using any of the models, you can use the embedd.py
script. You only need to provide a fasta file.
This code contains various bits of code taken from other sources. If you find the repo useful, please cite the following work too:
Ioan Ieremie, Rob M. Ewing, Mahesan Niranjan
@article{ieremiestructure,
title={Structure, Surface and Interface Informed Protein Language Model},
author={Ieremie, Ioan and Mahesan, Niranjan and Ewing, Rob M}
}
ii1g17 [at] soton [dot] ac [dot] uk