Generating Reaction-specific Enzyme Catalytic Pockets through Flow-Matching and Co-Evolutionary Dynamics
Follow my recent GENzyme on pocket design + pocket inpainting for full enzyme design. GENzyme is trained with AlphaFold losses with more geometric regularizations, less atomic clashes and more reasonable atom-atom distances.
Follow my neurips ReactZyme on enzyme-reaction dataset and benchmark.
EnzymeFlow Paper at arxiv.
python>=3.11
CUDA=12.1
torch==2.4.1 (>=2.0.0)
torch_geometric==2.4.0
pip install mdtraj==1.10.0 (do first will install numpy, scipy as well, install later might raise dependency issues)
pip install pytorch-warmup==0.1.1
pip install POT==0.9.4
pip install rdkit==2023.9.5
pip install biopython==1.84
pip install tmtools==0.2.0
pip install geomstats==2.7.0
pip install dm-tree==0.1.8
pip install ml_collections==0.1.1
pip install OpenMM
pip install einx
pip install einops
conda install conda-forge::pdbfixer
-
Please refer to the below, to see how we prepare training data.
-
configs.py
contain all training configurations and hyperparameters. -
Train model using
train_ddp.py
for parallal training with multi-gpus (we trained with 4 A40 gpus).
CUDA_VISIBLE_DEVICES=0,1,2,3 python -m torch.distributed.launch --nproc_per_node=4 train_ddp.py
- The training loads pre-trained model. You may also train from scratch by setting the configs in
configs.py
, setting parametersckpt_from_pretrain=False pretrain_ckpt_path=None
.
A mini-EnzymeFlow checkpoint is put in Google drive. Once you download it, put it under ./checkpoint
folder.
EnzymeFlow inference demo is provided in jupyter notebook.
Unseen reaction inference demo is provided in jupyter notebook, you only need to generate ligand.mol2 file.
For RFDiffAA and LigandMPNN, please refer to RFDiffAA-official and LigandMPNN-official. For each enzyme-reaction pair in evaluation data, we use RFDiffAA with default params to generate 100 catalytic pockets (with 32 residues) for each unique substrate. Then we use LigandMPNN to perform sequence prediction (inverse folding) on the generated catalytic pockets post-hoc.
We provide some RFDiffAA-generated samples in ./data/rfdiffaa_generated
folder at link.
We provide LigandMPNN-predicted sequences for RFDiffAA-generated pockets at file.
We provide CLEAN-predicted EC-Class for LigandMPNN-predicted pocket sequences at file.
Baselines like RFDiffAA or others do not generate EC-class for the design of catalytic pockets. We use CLEAN to infer the EC-class of sequence representations of these pockets. For CLEAN, please refer to CLEAN-official or CLEAN-webserver. We use CLEAN with greedy max-separation
approach for EC-class inference.
For ESM3, please refer to ESM3-official. For each sequence representation of generated catalytic pocket, we use ESM3 to recover the full enzyme sequence (by 'entire' meaning, we recover 32 residues into a protein sequence of 200 residues). We can perform enzyme retrieval on both (1) pocket enzymes sequences and (2) full enzyme sequences. ESM3 prompting is at link.
For ranking-based retrieval evaluation, please refer to RectZyme-paper. We train a pocket-specific enzyme CLIP model with enzyme pockets features computed by latest ESM3 and reactions features computed by MAT-2D. The training data are those of 60%-homology (~50,000 positive samples); evaluation data are those unique, non-repeated ones; training negative samples are training data that are not annotated to catalyze a specific reaction like ClipZyme; evaluation do not use negative data.
./data
contain all substrate and product molecules, can be downloaded at link.
./data
contain all enzyme pockets, can be downloaded at link.
./data
folder. More rawdata (50%, 60%, 80%, 90% homologys) can be downloaded at link.
./data
contain reaction MSAs.
./data
contain enzyme MSAs, can be downloaded at link.
./data
is co-evolution vocabulary.
When the raw data--enzyme pockets, molecules, co-evolution--are ready (stored in right folders), we proceed to process them into metadata.
--rawdata_file_name
, e.g., python process_data.py --rawdata_file_name rawdata_cutoff-0.4.csv
. Warning: we have absolute path in metadata.csv
, so you might need to change it to your path.
./data/processed
folder, including:
./data/processed/protein
folder.
./data/processed/ligand
folder.
./data/processed/msa
folder.
./data/processed/product
folder.
./data
folder. Warning: we have absolute path in metadata.csv, so you might need to change it to your path.
./data/raw_eval_data
folder.
./data/processed_eval
folder.
process_data.py
. Remeber to change the configs, e.g., python process_data.py --rawdata_file_name eval-data_cutoff-0.1_unique-subs-enz_100.csv --metadata_file_name metadata_eval.csv
.
No Commercial use of either the model nor generated data, details to be found in license.md.
@article{hua2024enzymeflow,
title={EnzymeFlow: Generating Reaction-specific Enzyme Catalytic Pockets through Flow Matching and Co-Evolutionary Dynamics},
author={Hua, Chenqing and Liu, Yong and Zhang, Dinghuai and Zhang, Odin and Luan, Sitao and Yang, Kevin K and Wolf, Guy and Precup, Doina and Zheng, Shuangjia},
journal={arXiv preprint arXiv:2410.00327},
year={2024}
}