This Python script is used to train, validate, test deep learning model for prediction of drug-target interaction (DTI) Deep learning model will be built by Keras with tensorflow. You can set almost hyper-parameters as you want, See below parameter description DTI, drug, target protein and their interaction data must be written as csv file format. And feature should be tab-delimited format for script to parse data. Basically, this script builds convolutional neural network on sequence. If you don't want convolutional neural network but traditional dense (fully connected) layers on provide protein feature, specify type of feature and feature length.
tensorflow > 1.0
keras > 2.0
numpy
pandas
scikit-learn
usage: DTI_deep.py [-h] [--test-name [TEST_NAME [TEST_NAME ...]]]
[--test-dti-dir [TEST_DTI_DIR [TEST_DTI_DIR ...]]]
[--test-drug-dir [TEST_DRUG_DIR [TEST_DRUG_DIR ...]]]
[--test-protein-dir [TEST_PROTEIN_DIR [TEST_PROTEIN_DIR ...]]]
[--with-label WITH_LABEL]
[--window-sizes [WINDOW_SIZES [WINDOW_SIZES ...]]]
[--protein-layers [PROTEIN_LAYERS [PROTEIN_LAYERS ...]]]
[--drug-layers [DRUG_LAYERS [DRUG_LAYERS ...]]]
[--fc-layers [FC_LAYERS [FC_LAYERS ...]]]
[--learning-rate LEARNING_RATE] [--n-epoch N_EPOCH]
[--prot-vec PROT_VEC] [--prot-len PROT_LEN]
[--drug-vec DRUG_VEC] [--drug-len DRUG_LEN]
[--activation ACTIVATION] [--dropout DROPOUT]
[--n-filters N_FILTERS] [--batch-size BATCH_SIZE]
[--decay DECAY] [--validation] [--predict]
[--save-model SAVE_MODEL] [--output OUTPUT]
dti_dir drug_dir protein_dir
All training, validation, test should follow specification to be parsed correctly by DeepConv-DTI
-
Model takes 3 types data as a set, Drug-target interaction data, target protein data, compound data.
-
They should be
.csv
format. -
For feature column, each dimension of features in columns should be delimited with tab (
\t
)
After three data are correctly listed, target protein data and compound data will be joined with drug-target data, generating DTI feature.
Drug target interaction data should be at least 2 columns Protein_ID
and Compound_ID
,
and should have Label
column except --test
case. Label
colmun has to have label 0
as negative and 1
as positive.
Protein_ID | Compound_ID | Label |
---|---|---|
PID001 | CID001 | 0 |
... | ... | ... |
PID100 | CID100 | 1 |
Because DeepConvDTI focuses on convolution on protein sequence, protein data specification is little different from other data.
If Sequence
column is specified in data and --prot-vec
is Convolution
, it will execute convolution on protein.
Or if you specify other type of column with --prot-vec
(i.e. Prot2Vec
), it will construct dense (fully connected) network
Protein_ID
column will be used as foreign key from Protein_ID
from Drug-target interaction data.
Protein_ID | Sequence | Prot2Vec |
---|---|---|
PID001 | MALAC....ACC | 0.539\t-0.579\t...\t0.39 |
Basically same with Target protein data, but no Convolution
.
Compound_ID
column will be used as forein key from Compound_ID
from Drug-target interaction data.
Compound_ID | morgan_r2 |
---|---|
CID001 | 0\t1\t...\t0\t1 |
dti_dir Training DTI information [drug, target, label]
drug_dir Training drug information [drug, SMILES,[feature_name,
..]]
protein_dir Training protein information [protein, seq,
[feature_name]]
For training model, you should input 3 files, DTI information file, drug information file and target protein information file, as their format is specified above.
DTI information file for training should have Label
column for training.
--validation Excute validation with independent data, will give AUC
and AUPR (No prediction result)
--predict Predict interactions of independent test set
DeepConvDTI script has two mode, validation and predict mode.
In validation mode, performances (AUC, AUPR, threshold for AUC and AUPR) on each step and selected hyperparameters are recorded.
In test step, prediction results for given test dataset will be reported after training.
--test-name [TEST_NAME [TEST_NAME ...]], -n [TEST_NAME [TEST_NAME ...]]
Name of test data sets
--test-dti-dir [TEST_DTI_DIR [TEST_DTI_DIR ...]], -i [TEST_DTI_DIR [TEST_DTI_DIR ...]]
Test dti [drug, target, [label]]
--test-drug-dir [TEST_DRUG_DIR [TEST_DRUG_DIR ...]], -d [TEST_DRUG_DIR [TEST_DRUG_DIR ...]]
Test drug information [drug, SMILES,[feature_name,
..]]
--test-protein-dir [TEST_PROTEIN_DIR [TEST_PROTEIN_DIR ...]], -t [TEST_PROTEIN_DIR [TEST_PROTEIN_DIR ...]]
Test Protein information [protein, seq,
[feature_name]]
--with-label WITH_LABEL, -W WITH_LABEL
Existence of label information in test DTI
You can input multiple datasets for validation or test with argument specifier.
In addition to DTI information file, drug information file and target protein information file, you need to name of validation or test datasets.
For test dataset, you can inform that test dataset has label or not with -W
value
--n-epoch N_EPOCH, -e N_EPOCH
The number of epochs for training or validation
The number of epoch for model training.
Validation will stop evaluating performance with specified epoch.
Test for given dataset will be executed after specified epoch.
--prot-vec PROT_VEC, -v PROT_VEC
Type of protein feature, if Convolution, it will
execute conlvolution on sequeunce
--prot-len PROT_LEN, -l PROT_LEN
Protein vector length
--drug-vec DRUG_VEC, -V DRUG_VEC
Type of drug feature
--drug-len DRUG_LEN, -L DRUG_LEN
Drug vector length
Parameters for parsing data.
Model will parse column with specified name of column for drug and target protein feature.
Also, you need to give length of feature, which will be parsed.
For special case, with -v Convolution
model will build convolution layer on Sequence
of dataset.
--window-sizes [WINDOW_SIZES [WINDOW_SIZES ...]], -w [WINDOW_SIZES [WINDOW_SIZES ...]]
Window sizes for model (only works for Convolution)
--protein-layers [PROTEIN_LAYERS [PROTEIN_LAYERS ...]], -p [PROTEIN_LAYERS [PROTEIN_LAYERS ...]]
Dense layers for protein
--drug-layers [DRUG_LAYERS [DRUG_LAYERS ...]], -c [DRUG_LAYERS [DRUG_LAYERS ...]]
Dense layers for drugs
--fc-layers [FC_LAYERS [FC_LAYERS ...]], -f [FC_LAYERS [FC_LAYERS ...]]
Dense layers for concatenated layers of drug and
target layer
--n-filters N_FILTERS, -F N_FILTERS
Number of filters for convolution layer, only works
for Convolution
Hyperparameters which determine shape of neural network.
If you want to deeper layer, you can write as -c 128 32
, which will construct two consecutive neural layer on input drug feature with 128 and 32 units
--activation ACTIVATION, -a ACTIVATION
Activation function of model
--dropout DROPOUT, -D DROPOUT
Dropout ratio
Other hyperparameters.
We don't recommend to use dropout
--batch-size BATCH_SIZE, -b BATCH_SIZE
Batch size
--learning-rate LEARNING_RATE, -r LEARNING_RATE
Learning late for training
--decay DECAY, -y DECAY
Learning rate decay
Hyperparameters for training
--save-model SAVE_MODEL, -m SAVE_MODEL
save model
Save model after training done.
--output OUTPUT, -o OUTPUT
Prediction output
Directory of .csv
format file.
In validation mode, it will contain results for every epoch
In test mode, it will contain prediction results for given DTI pairs.
For line example, if you have training dataset, toy_examples/training_dataset/training_dti.csv
, toy_examples/training_dataset/training_compound.csv
and toy_examples/training_dataset/training_protein.csv
with right specification.
You can validate model with validation dataset, toy_examples/validation_dataset/validation_dti.csv
, toy_examples/validation_dataset/validation_compound.csv
and toy_examples/validation_dataset/validation_protein.csv
by using this command line.
python DeepConvDTI.py ./toy_examples/training_dataset/training_dti.csv ./toy_examples/training_dataset/training_compound.csv ./toy_examples/training_dataset/training_protein.csv --validation -n validation_dataset -i ./toy_examples/validation_dataset/validation_dti.csv -d ./toy_examples/validation_dataset/validation_compound.csv -t ./toy_examples/validation_dataset/validation_protein.csv -W -c 512 128 -w 10 15 20 25 30 -p 128 -f 128 -r 0.0001 -n 30 -v Convolution -l 2500 -V morgan_fp_r2 -L 2048 -D 0 -a elu -F 128 -b 32 -y 0.0001 -o ./validation_output.csv -m ./model.model -e 1
This command will train model with given hyper-parameters for 1 epoch. (because of -e 1).
And resulting in validation result ./validation_output.csv
, and corresponding model ./model.model
There are two ways to predict DTIs with command line
When you have toy_examples/test_dataset/test_dti.csv
, toy_examples/test_dataset/test_compound.csv
and toy_examples/test_dataset/test_protein.csv
python DeepConvDTI.py ./toy_examples/training_dataset/training_dti.csv ./toy_examples/training_dataset/training_compound.csv ./toy_examples/training_dataset/training_protein.csv --predict -n predict -i ./toy_examples/test_dataset/test_dti.csv -d ./toy_examples/test_dataset/test_compound.csv -t ./toy_examples/test_dataset/test_protein.csv -c 512 128 -w 10 15 20 25 30 -p 128 -f 128 -r 0.0001 -n 30 -v Convolution -l 2500 -V morgan_fp_r2 -L 2048 -D 0 -a elu -F 128 -b 32 -y 0.0001 -o ./test_output.csv -m ./model.model -e 15 -W
With this command, model will be trained with training dataset for 15 epochs and predict test dataset when it finished training, resulting in test_output.csv
which have prediction score and its true label (because of -W)
The second way of prediction is using predict_with_model.py. If you have model which is saved from validation or something, you can use it.
python predict_with_model.py ./model.model -n predict -i ./toy_examples/test_dataset/test_dti.csv -d ./toy_examples/test_dataset/test_compound.csv -t ./toy_examples/test_dataset/test_protein.csv -v Convolution -l 2500 -V morgan_fp_r2 -L 2048 -W -o test_result.csv
this code will result in same result file with first command.
In addition, you can evaluate the performances of prediction results with label by using evaluate_performance.py.
When you have an optimal threshold from validation and the names of the test dataset.
python evaluate_performance.py test_result.csv -n predict -T 0.2
This command will report performances.
DeepConv-DTI follow GPL 3.0v license. Therefore, DeepConv-DTI is open source and free to use for everyone.
However, compounds which are found by using DeepConv-DTI follows CC-BY-NC-4.0. Thus, those compounds are freely available for academic purpose or individual research, but restricted for commecial use.