The growing use of chemicals in various industries highlights the need for effective methods to predict their environmental impact, especially in terms of biodegradability. As biodegradation is a crucial mechanism for the removal of organic chemicals in natural systems, accurately predicting a chemical's biodegradability is essential for mitigating environmental risks and improving the design of more sustainable chemicals.
This is particularly important for chemicals that enter aquatic environments, whether in large or small quantities. Estimating their biodegradability is necessary for assessing the full scope of their potential hazards. Therefore, developing reliable methods to quickly and accurately diagnose and analyze biodegradability is critical to ensuring safer chemical use and minimising environmental harm.
This project aims to build an effective and reliable predictive model that is able to accurately predict the biodegradability of chemical compounds using QSAR (Quantitative Structure-Activity Relationship) models. This project employs four predictive modeling techniques: K-Nearest Neighbor (KNN), Decision Tree, Neural Network, and Logistic Regression. The goal is to compare the performance of these models based on accuracy, recall, precision, F1-score, and confusion matrix metrics.
The dataset used is a QSAR biodegradation dataset, which includes molecular descriptors for various chemical compounds, and their corresponding biodegradability class (RB or NRB).
The QSAR biodegradation dataset was built in the Milano Chemometrics and QSAR Research Group (Università degli Studi Milano-Bicocca, Milano, Italy). The research leading to these results has received funding from the European Community's Seventh Framework Programme [FP7/2007-2013] under Grant Agreement n. 238701 of Marie Curie ITN Environmental Chemoinformatics (ECO) project.
The data have been used to develop QSAR (Quantitative Structure Activity Relationships) models for the study of the relationships between chemical structure and biodegradation of molecules. Biodegradation experimental values of 1055 chemicals were collected from the webpage of the National Institute of Technology and Evaluation of Japan (NITE). Classification models were developed in order to discriminate ready (356) and not ready (699) biodegradable molecules by means of three different modelling methods: k Nearest Neighbours, Partial Least Squares Discriminant Analysis and Support Vector Machines. Details on attributes (molecular descriptors) selected in each model can be found in the quoted reference: Mansouri, K., Ringsted, T., Ballabio, D., Todeschini, R., Consonni, V. (2013). Quantitative Structure - Activity Relationship models for ready biodegradability of chemicals. Journal of Chemical Information and Modeling, 53, 867-878.
Table 1: Sample of the QSAR Biodegradation Dataset
The data set contains values for 41 attributes (molecular descriptors) to classify 1055 chemicals into 2 classes (ready and not ready biodegradable). The attribute information of 41 molecular descriptors and 1 experimental class is as below:
SpMax_L
: Leading eigenvalue from Laplace matrixJ_Dz(e)
: Balaban-like index from Barysz matrix weighted by Sanderson electronegativitynHM
: Number of heavy atomsF01[N-N]
: Frequency of N-N at topological distance 1F04[C-N]
: Frequency of C-N at topological distance 4NssssC
: Number of atoms of type ssssCnCb-
: Number of substituted benzene C(sp2)C%
: Percentage of C atomsnCp
: Number of terminal primary C(sp3)nO
: Number of oxygen atomsF03[C-N]
: Frequency of C-N at topological distance 3SdssC
: Sum of dssC E-statesHyWi_B(m)
: Hyper-Wiener-like index (log function) from Burden matrix weighted by massLOC
: Lopping centric indexSM6_L
: Spectral moment of order 6 from Laplace matrixF03[C-O]
: Frequency of C - O at topological distance 3Me
: Mean atomic Sanderson electronegativity (scaled on Carbon atom)Mi
: Mean first ionization potential (scaled on Carbon atom)nN-N
: Number of N hydrazinesnArNO2
: Number of nitro groups (aromatic)nCRX3
: Number of CRX3SpPosA_B(p)
: Normalized spectral positive sum from Burden matrix weighted by polarizabilitynCIR
: Number of circuitsB01[C-Br]
: Presence/absence of C - Br at topological distance 1B03[C-Cl]
: Presence/absence of C - Cl at topological distance 3N-073
: Ar2NH / Ar3N / Ar2N-Al / R..N..RSpMax_A
: Leading eigenvalue from adjacency matrix (Lovasz-Pelikan index)Psi_i_1d
: Intrinsic state pseudoconnectivity index - type 1dB04[C-Br]
: Presence/absence of C - Br at topological distance 4SdO
: Sum of dO E-statesTI2_L
: Second Mohar index from Laplace matrixnCrt
: Number of ring tertiary C(sp3)C-026
: R--CX--RF02[C-N]
: Frequency of C - N at topological distance 2nHDon
: Number of donor atoms for H-bonds (N and O)SpMax_B(m)
: Leading eigenvalue from Burden matrix weighted by massPsi_i_A
: Intrinsic state pseudoconnectivity index - type S averagenN
: Number of Nitrogen atomsSM6_B(m)
: Spectral moment of order 6 from Burden matrix weighted by massnArCOOR
: Number of esters (aromatic)nX
: Number of halogen atomsexperimental class
: ready biodegradable (RB) and not ready biodegradable (NRB)
For K-Nearest Neighbors (KNN) and Decision Tree, chi squared method is used as the feature selection method because chi squared works well on a categorical dataset. The dataset response is a binary classification, which has classes of 'RB' and 'NRB' as its values. This is equivalent to true or false value in binary classification. So, chi squared is the best features selection method for our dataset for both prescriptive models, K-Nearest Neighbours and Decision Tree.
To plot the graph, ten features out of 41 features and one response were chosen. Because the top ten features have the lowest values of all the features, this is the case. Each feature with the lowest value has the top ten highest correlation with the set of data. These are the top ten features, as determined by the chi square method, with the lowest possible scores:
Table 2: The Selected Features for KNN and Decision Tree
For Neural Network and Logistic Regression, 7he feature selection method is ANOVA (Analysis of Variance). ANOVA is the best fit because the input variable is numerical and the target output is a categorical dataset with classes of 'RB' and 'NRB'. The dataset response is a binary classification equivalent to a true or false value. As a result, the ANOVA method is the best feature selection method for the QSAR dataset for both the prescriptive Neural Network model and the logistic model.
When choosing features, filtering techniques were used. The statistical method is being used to determine the correlation between each input variable and the target variable. It determines the strength of the correlation between the feature and the target response using the QSAR dataset to compute the probability (p) value and score. The lower p values obtained, the stronger the relationship was with the response. Lower p values were obtained the stronger the relationship was with the response. The relationship between the response and score increases as the score rises. Since the p-value for each feature is less than 0.05, all features were chosen. The list of features with p-value and score is shown below.
Table 3: The Selected Features for Neural Network and Logistic Regression
Table 4: Comparisons Between K-Nearest Neighbors (KNN) and Decision Tree
Table 5: Comparisons Between Neural Network and Logistic Regression
For K-Nearest Neighbors and Decision Tree, the ratio split is 60% training 20% validation and 20% testing. The parameter for both predictive models is shown in Table 4.
Table 6: Parameter of the KNN and Decision Tree Predictive Models
For K-Nearest Neighbors model, we first need to find the best k value to train the model. We use the validation testing method to determine the optimized value of k. The model will be trained against the validation set until it finds the k value with the highest score.
Figure 1: Result of The Best Value K
Figure 2: Analysis for the KNN Model Result
Figure 3: Decision Boundary of the Perceptron Classifier
This is the report of the Decision Tree accuracy, precision, recall, f1-score, support and confusion matrix:
Figure 4: Analysis for Decision Tree Model Result
The diagram below shows a decision tree model with a max depth of 11.
Figure 5: Decision Tree Model Diagram
Figure 6: Decision Boundary of the Decision Tree Classifier
For Neural Network and Logistic Regression, ratio split is 80% training and 20% testing. The parameter for both predictive models is shown in Table 4.
Table 7: Parameter of the Neural Network and Logistic Regression Predictive Models
Finding the optimal hyperparameters for the Neural network model is the first step in determining how the model will perform. The grid search method is the optimization technique used. This model's hyperparameter concentrates on the batch sizes, epochs, optimizer algorithm, and initialization mode that work best with this model's optimization. An optimizer named Adam has been used in this model.
Figure 7: Neural Network - Result of the Best-Optimized Parameter
Figure 8: Neural Network - Accuracy of Training and Test Dataset
Figure 9: Neural Network - Accuracy, Confusion Matrix and Classification Report
Figure 10: Accuracy Curve for the Neural Network Model Over Epochs
Figure 11: Loss Curve for the Neural Network Model Over Epochs
Grid Search CV was used to build a logistic regression model. This is the report of the Logistic Regression accuracy, precision, recall, f1-score, support and confusion matrix:
Figure 12: Logistic Regression - Accuracy, Confusion Matrix and Classification Report
Figure 13: Hierarchy of Metrics from Raw measurements or Labeled Data to F1-Score
Precision is how precise or accurate the model is out of those predicted positive and how many of them are actual positive. Precision is a good measure to determine, when the costs of False Positive is high and we want to minimize false positives.
Recall calculates how many of the Actual Positives our model capture through labelling it as positive (True Positive). It shall be the model metric we use to select our best model when there is a high cost associated with False Negative and we want to minimize the chance of missing positive cases (predicting false negatives).
Accuracy is a good measure if we have quite balanced datasets and are interested in all types of outputs equally. It is largely contributed by a large number of True Negatives which in most scenarios, we do not focus on much.
F1 score is needed when we want to seek a balance between Precision and Recall. It is a better measure to use if the datasets are imbalanced and there is an uneven class distribution (large number of Actual Negatives).
In simple, a model is considered to be good if it gives high accuracy scores in production or test data or is able to generalise well on an unseen data. In our opinion, accuracy greater than 70% is considered as a great model performance.
Figure 14: KNN and Decision Tree Discussion
Based on the results of each model, we can observe that the accuracy score of the K-Nearest Neighbor model is higher compared to that of the Decision Tree. This can be concluded as the accuracy of the K-Nearest Neighbor model is more accurate and closer to the value one. Both models resulted in the same value of recall, precision, and F1-score, K-Nearest Neighbor. Therefore, we can conclude that the K-Nearest Neighbor algorithm is the better predictive model than Decision Tree to be used on the QSAR Biodegradation dataset.
Figure 15: Neural Network - Accuracy, Confusion Matrix and Classification Report
Figure 16: Logistic Regression - Accuracy, Confusion Matrix and Classification Report
For both models, they have the same accuracy value and almost the same Precision, Recall and f1-score. Hence, we can conclude that both models have the same performance. However, we think that Logistic Regression is better than Neural Network because:
-
Although Neural Network can find more interesting patterns in the data, which can lead to better performance, it can be more complex and harder to build and maintain.
-
Logistic Regression has significantly lower training time and cost than Neural Network.
-
Logistic Regression has significantly lower inference time than Neural Network to run the model and making predictions.