Token-Mol

The repository is for Token-Mol 1.0：tokenized drug design with large language model

Environment

The codes for pre-training and fine-tuning have been test on Windows and Linux system.
Codes for reinforcement learning with docking can stably run only on Linux.

Python Dependencies

Python >= 3.8
Pytorch >= 1.13.1
RDKit >= 2022.09.1
transformers >= 4.24.0
networkx >= 2.8.4
pandas >= 1.5.3
scipy == 1.10.0
easydict (any version)

Software Dependencies

CUDA 11.6

For docking/reinforcement learning:

Software	Source
QuickVina2	Download
ADFRsuite	Download
Open Babel	Download

Data

You can download directly or get access to datasets according to the following table:

Task	Source or access
Pre-training	GEOM
Conformation generation	Tora3D
Property prediction	MolecularNet & ADME
Pocket-based generation	ResGen

Pre-training

We pre-trained model with GPT2 architecture with HuggingFace🤗 Transformers.
The weight and configuration files of pre-trained model can be found in Zenedo.

Fine-tuning

Before running fine-tuning, move weight and configuration files of pre-trained model into Pretrain_model folder.
Validation set have been preset in data folder, processed training set can be downloaded in Here.

# Path of training set and validation set have been preset in the code
python pocket_fine_tuning_rmse.py --batch_size 32 --epochs 40 --lr 5e-3 --every_step_save_path Trained_model/pocket_generation

The finetuned model will be saved at Trained_model/pocket_generation.pt.

Generation

Encoded single pocket example/ARA2A.pkl or multiple pockets data/test_protein_represent.pkl can be used as input for generation.

Total number of generated molecules each pockets = batch size * epoch

# single pocket
python gen.py --model_path ./Trained_model/pocket_generation.pt --protein_path ./example/ARA2A.pkl --output_path ARA2A.csv --batch 25 --epoch 4

# multiple pockets
python gen.py --model_path ./Trained_model/pocket_generation.pt --protein_path ./data/test_protein_represent.pkl --output_path test.csv --batch 25 --epoch 4

The pocket can be encoded with GVP. Original pockets in pdb format are attached at data/test_set.zip and example/3rfm_a2a-pocket10.pdb.

Post-processing

The generated molecules should be processed and converted from sequences to RDKit RDMol objects and then used for subsequent applications.

Output file *.csv will be input to confs_to_mols_pocket.py (or confs_to_mols_pocket_multi.py for preset test set).

We recommend manually changing the following two lines in the code.

# Customize file names
csv_file = 'output.csv'
# Modify reshape dimension to the number of generated molecules, default 100
generate_mol = pd.read_csv(csv_file, header=None).values.reshape(-1, 100).tolist()

Then, run directly

python confs_to_mols_pocket.py

Processed molecules will save at results folder in pickle format.

Reinforcement learning

Reward score

You can customize reward score in reinforce/reward_score.py,and there are detailed description in the code.

Running

Before running reinforcement learning, target pocket need to be specified and encoded

We strongly recommend running the code on a multi-GPU machine as a too small batch size will result in an inability to perform an efficient gradient update.

python ./reinforce/reinforce_multi_gpu.py --restore-from ./Trained_model/pocket_generation.pt --protein-dir ./reinforce/usecase --agent ./reinforce.pt --world-size 4 --batch-size 8 --num-steps 1000 --max-length 85

Args	Description
--restore-from	Model checkpoint
--protein-dir	Path of target pocket
--agent	Agent model save path
--world-size	World size (DDP)
--batch-size	Batch size on a single GPU
--num-steps	Total running steps
--max-length	Max length of sequence

After testing, code under the above arguments can run on machine with at least 4*Quadro RTX8000 (48GB vRAM).

You can also check molecules generated in each steps in every_steps_saved.pkl and detailed reward terms in reward_terms*.csv.

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
Pretrained_model		Pretrained_model
__pycache__		__pycache__
data		data
early_stop		early_stop
example		example
pretrained_model		pretrained_model
reinforce		reinforce
utils		utils
LICENSE		LICENSE
README.md		README.md
ada_model.py		ada_model.py
bert_tokenizer.py		bert_tokenizer.py
confs_to_mols_pocket.py		confs_to_mols_pocket.py
confs_to_mols_pocket_multi.py		confs_to_mols_pocket_multi.py
gen.py		gen.py
pocket_fine_tuning_rmse.py		pocket_fine_tuning_rmse.py
smi_torsion_2_molobj.py		smi_torsion_2_molobj.py
val_list.txt		val_list.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Token-Mol

Environment

Python Dependencies

Software Dependencies

Data

Pre-training

Fine-tuning

Generation

Post-processing

Reinforcement learning

Reward score

Running

About

Releases

Packages

Contributors 2

Languages

License

jkwang93/Token-Mol

Folders and files

Latest commit

History

Repository files navigation

Token-Mol

Environment

Python Dependencies

Software Dependencies

Data

Pre-training

Fine-tuning

Generation

Post-processing

Reinforcement learning

Reward score

Running

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages