Lipophilicity is one of the factors determining the permeability of the cell membrane to a drug molecule. Hence, accurate lipophilicity prediction is an essential step in the development of new drugs. We introduce a novel approach to encoding additional graph information by extracting molecular substructures. By adding a set of generalized atomic features of these substructures to an established Direct Message Passing Neural Network (D-MPNN) we were able to achieve a new state-of-the-art result at the task of prediction of two main lipophilicity coefficients, namely logP and logD descriptors. We further improve our approach by employing a multitask approach to predict logP and logD values simultaneously. Additionally, we present a study of the model performance on symmetric and asymmetric molecules, that may yield insight for further research.
The figure below shows the overall network architecture of our method named StructGNN.
The following datasets have been used:
Dataset name | Number of Samples | Description | Sources |
---|---|---|---|
logp_wo_logp_json_wo_averaging | 13688 | All logP datasets except logp.json | Diverse1KDataset.csv, NCIDataset.csv, ochem_full.csv, physprop.csv |
logd_Lip_wo_averaging | 4166 | Merged datasets w/o strange (very soluble) molecules and standardized SMILES. Between duplicated logD for one SMILES the most common value was chosen | Lipophilicity |
logp_wo_logp_json_logd_Lip_wo_averaging | 17603 | Merged LogP and LogD datasets, 251 molecules have logP and logD values | logp_wo_logp_json_wo_averaging, logd_Lip_wo_averaging |
For a detailed description of StructGNN we refer the reader to the paper "Lipophilicity Prediction with Multitask Learning and Molecular Substructures Representation".
If you wish to cite this code, please do it as follows:
@misc{lukashina2020lipophilicity,
title={Lipophilicity Prediction with Multitask Learning and Molecular Substructures Representation},
author={Nina Lukashina and Alisa Alenicheva and Elizaveta Vlasova and Artem Kondiukov and Aigul Khakimova and Emil Magerramov and Nikita Churikov and Aleksei Shpilman},
year={2020},
eprint={2011.12117},
archivePrefix={arXiv},
primaryClass={cs.LG}
}
Machine Learning for Molecules Workshop @ NeurIPS 2020
There are 3 main folders:
- Jupyter Notebooks with EDA, data preprocessing, predictions analysis
- Data files
- Scripts for models training
This repository was built with the help of
To get a local copy up and running follow these simple steps.
To use chemprop
with GPUs, you will need:
- cuda >= 8.0
- cuDNN
git clone https://github.com/jbr-ai-labs/lipophilicity-prediction.git
cd scripts/SOTA/dmpnn
conda env create -f environment.yml
conda activate chemprop
pip install -e .
To train the model you can either use existing DVC pipeline or run training manually.
The first step is common for both runs.
- Set
params.yaml
additional_encoder: True # set StructGNN architecture
file_prefix: <name of dataset without format and train/test/val prefix>
split_file_prefix: <name of dataset without format and train/test/val prefix for `train_val_data_preparation.py` script>
input_path: <path to split dataset>
data_path: <path to train_val dataset>
separate_test_path: <path to test dataset>
save_dir: <path to experiments logs>
epochs: <number of training epochs>
patience: <early stopping patience>
delta: <early stopping delta>
features_generator: [rdkit_wo_fragments_and_counts]
no_features_scaling: True
target_columns: <name of target column>
split_type: k-fold
num_folds: <number of folds>
substructures_hidden_size: 300
hidden_size: 800 # dmpnn ffn hidden size
A full list of available arguments can be found in dmpnn/chemprop/args.py
- Run
python ./scripts/SOTA/dmpnn/train_val_data_preparation.py
- to create dataset for cross-validation procedure - Run
python ./scripts/SOTA/dmpnn/train.py --dataset_type regression --config_path_yaml ./params.yaml
- to train model
- Run
dvc repro
command
Article - Analyzing Learned Molecular Representations for Property Prediction
Original Github Repo - https://github.com/chemprop/chemprop
All the requirements are the same as for StructGNN
The training procedure is the same as StructGNN, but set additional_encoder: False
in params.yaml
Article - Optimal Transport Graph Neural Networks
Original Github Repo - https://github.com/benatorc/OTGNN
conda create -n mol_ot python=3.6.8
sudo apt-get install libxrender1
conda install pytorch torchvision -c pytorch
conda install -c rdkit rdkit
conda install -c conda-forge pot
conda install -c anaconda scikit-learn
conda install -c conda-forge matplotlib
conda install -c conda-forge tqdm
conda install -c conda-forge tensorboardx
Prepara data and splits with 1_data_preparation.ipynb notebook
Running cross-validation:
cd ./scripts/SOTA/otgnn/; python train_proto.py -data logp_wo_json -output_dir output/exp_200 -lr 5e-4 -n_splits 5 -n_epochs 100 -n_hidden 50 -n_ffn_hidden 100 -batch_size 16 -n_pc 20 -pc_size 10 -pc_hidden 5 -distance_metric wasserstein -separate_lr -lr_pc 5e-3 -opt_method emd -mult_num_atoms -nce_coef 0.01
Article - Junction Tree Variational Autoencoder for Molecular Graph Generation
Original Github Repo - https://github.com/wengong-jin/icml18-jtnn
conda create -n jtree python=2.7
conda install pytorch torchvision -c pytorch
conda install -c rdkit rdkit
conda install -c anaconda scikit-learn
conda install -c conda-forge matplotlib
conda install -c conda-forge tqdm
conda install -c conda-forge tensorboardx
Original article proposed to train autoencoder architecture for reconstruction task, here we use only encoder part in regression task.
To run with prepared data:
Download pickle-file SMILES_TO_MOLTREE.pickle from gdrive and place it to data/raw/baselines/jtree/
directory.
NB!
JTree Vocabulary can lead to exceptions in case of unknown substructures. To skip such molecules run 2_encode_molecules.ipynb with appropriate data. It will save nesessary files in data/raw/baselines/jtree/train_errs.txt(val_errs.txt, test_errs.txt)
.
Running with best parameters:
cd ./scripts/SOTA/jtree/; python train_encoder_more_atom_feats_CV.py --filename "exp" --epochs 200 --patience 35 --vocab_path '../../../data/raw/baselines/jtree/vocab.txt' --file_prefix logp_wo_logp_json_wo_averaging