Precise Generation of Conformational Ensembles for Intrinsically Disordered Proteins Using Fine-tuned Diffusion Models
We developed a generative deep learning model that predict IDP conformational ensembles directly from their sequences using fine-tuned diffusion models, named as IDPFold. IDPFold bypasses the need for Multiple Sequence Alignments (MSA) or experimental data, achieving accurate predictions of ensemble properties across numerous IDPs.
IDPFold is pretrained on the PDB database and fine-tuned on conformational ensembles provided by IDRome, achieving more precise sampling of IDP ensembles than SOTA deep learning models and MD simulation.
The codebase of IDPFold is mainly inspired by Str2Str, thank Jiarui Lu for his valuable suggestions.
git clone https://github.com/Junjie-Zhu/IDPFold.git
cd IDPFold
# Create a new conda environment
conda env create -f environment.yml
conda activate idpfold
# Install ESM for sequence embedding extraction
pip install fair-esm
# Install IDPFold as a package
pip install -e .
After installation, you need to update the .env
file that contains path to datasets. We provide a script for initializing .env
file, just run the folloing command:
python initialize.py
To generate conformational ensembles for given sequences, you should:
- Prepare a
fasta
file, both single sequence and multiple are allowed, an example has been provided indata/example.fasta
which contains 3 IDP sequences - Check the checkpoint file, our pretrained model checkpoints can be accessed from Google Drive
- Run the following command
# Extract sequence embeddings
python src/read_seqs.py pred_dir='./data/example.fasta'
# Inference
python src/eval.py ckpt_path='/path/to/ckpt'
To be updated ...
This is a test version of IDPFold, if you have any question please either create an issue or directly contact [email protected]!