Deep generative models are rapidly becoming popular tools for generating new molecules and optimizing the chemical properties. In this work, we will introduce a VAE model based on the grammar and semantic of molecular sequence - SD VAE.
This repository will give you the instruction of training a SD VAE model.
You can download the data from [datalink] (https://baidu-nlp.bj.bcebos.com/PaddleHelix/datasets/molecular_generation/data_SD_VAE.tgz).
- create a folder './data'
- unzip the file into './data'
data (project root)
|__ data_SD_VAE
|__ |__ context_free_grammars
|__ |__ zinc
Before training/evaluation, we need to cook the raw txt dataset.
cd data_preprocessing
python make_dataset_parallel.py \
-info_fold ../data/data_SD_VAE/context_free_grammars \
-grammar_file ../data/data_SD_VAE/context_free_grammars/mol_zinc.grammar \
-smiles_file ../data/data_SD_VAE/zinc/250k_rndm_zinc_drugs_clean.smi
python dump_cfg_trees.py \
-info_fold ../data/data_SD_VAE/context_free_grammars \
-grammar_file ../data/data_SD_VAE/context_free_grammars/mol_zinc.grammar \
-smiles_file ../data/data_SD_VAE/zinc/250k_rndm_zinc_drugs_clean.smi
The above two scripts will compile the txt data into binary file and cfg dump, correspondingly.
The model config is the parameters used for building the model graph. They are saved in the file: model_config.json
"latent_dim":the hidden size of latent space
"max_decode_steps": maximum steps for making decoding decisions
"eps_std": the standard deviation used in reparameterization tric
"encoder_type": the type of encoder
"rnn_type": The RNN type
In order to train the model, we need to set the training parameters. The dafault paramaters are saved in file: args.py
-loss_type : the type of loss
-num_epochs : number of epochs
-batch_size : minibatch size
-learning_rate : learning_rate
-kl_coeff : coefficient for kl divergence used in vae
-clip_grad : clip gradients to this value
To run the trianing scripts:
CUDA_VISIBLE_DEVICES=0 python train_zinc.py \
-mode='gpu' \
You can download the trained model from (https://baidu-nlp.bj.bcebos.com/PaddleHelix/models/molecular_generation/SD_VAE_model.tgz).
unzip the file and put the model into './model' folder:
|__ model
|__ |__ train_model_epoch499
Sample from normal distribution prior:
python sample_prior.py \
-info_fold ../data/data_SD_VAE/context_free_grammars \
-grammar_file ../data/data_SD_VAE/context_free_grammars/mol_zinc.grammar \
-model_config ../model_config.json \
-saved_model ../model/train_model_epoch499
reconstruct from the reference sequence:
python reconstruct_zinc.py \
-info_fold ../data/data_SD_VAE/context_free_grammars \
-grammar_file ../data/data_SD_VAE/context_free_grammars/mol_zinc.grammar \
-model_config ../model_config.json \
-saved_model ../model/train_model_epoch499 \
-smiles_file ../data/data_SD_VAE/zinc/250k_rndm_zinc_drugs_clean.smi
valid: 0.49
unique@100: 1.0
unique@1000: 1.0
IntDiv: 0.92
IntDiv2: 0.82
Filters: 0.30
accuracy: 0.92
[1] @misc{dai2018syntaxdirected, title={Syntax-Directed Variational Autoencoder for Structured Data}, author={Hanjun Dai and Yingtao Tian and Bo Dai and Steven Skiena and Le Song}, year={2018}, eprint={1802.08786}, archivePrefix={arXiv}, primaryClass={cs.LG} }