Introduction

The purpose of this report is to examine the performance of an Intrusion Detection System (IDS) capable of distinguishing between ‘attacks’ and ‘normal’ traffic. The IDS was developed by training a deep neural network, which was applied to the KDDCUP’ 99 dataset. ‘Deep neural networks’ refer to feedforward artificial neural networks with two or more hidden layers. The network learning process is supervised, with the researcher defining the ‘attack’ labels associated with training examples. The report is structured as follows: dataset description, experimental design given, the experiment results and conclusions, including limitations of the study.

Data set Description

The KDDCUP’ 99 dataset contains network traffic data with 42 features and four types of attacks: DoS (37%); probe (1.4%); R2L (0.7%); and U2R (0.03%) in addition to 'normal' connections (60.3%). Duplicate values were removed to prevent biassed estimates and reduce the dimensions of the dataset. 16 highly correlated features were dropped and replaced by four PCAs. Categorical features were one-hot encoded, and feature crosses were created to test for interactions, increasing the number of features to 112 of 145,586 rows. Descriptive statistics showed that many features had extreme skew and kurtosis values, which were addressed by cube-root transforming the data. The dataset was z-score normalised to minimise differences between features and help optimisation smoothness, and split into training, validation & test sets (70%/15%/15%). Non-linear separability was tested and verified using the simplex method finding no solutions. Additional features were also added from five k-means clusters.

Experiment Design

Neural networks were trained using Tensorflow in Python. To determine the best performing network architecture and hyperparameter choice, the dataset was first subset into a smaller randomly divided table of 10,000 rows to approximate the full dataset and decrease computation time. Grid search was used to test different parameter combinations, performing 10 k-fold cross-validation on the data to measure prediction accuracy with cross-entropy loss. Three tests were conducted, beginning with a broad, large number of combinations and narrowing down. The first test (see table 1) included 216 combinations of 2-layer networks for: number of neurons per layer, optimisers, epochs and transfer functions. Using the best performing number of neurons per layer and epochs, the second test searched 6 combinations of three additional optimisers. The third took the best performing hyperparameters and topology and searched for 5 combinations of learning rates. The tests were timed to assess efficiency. In addition to accuracy, the metrics of precision, recall and F1 score were taken from the confusion matrix.

Table 1. Test 1 - Average accuracy for layer 1 (rows) and 2 (cols) neuron combinations with three optimisers

Optimiser	traingd			traingda			traingdm			traingdx
L1_neurons \ L2_neurons	15	40	50	15	40	50	15	40	50	15	40	50
20	0.51	0.73	0.76	0.71	0.85	0.87	0.50	0.72	0.75	0.67	0.81	0.84
50	0.62	0.71	0.79	0.77	0.84	0.89	0.62	0.71	0.77	0.74	0.81	0.86
100	0.58	0.81	0.82	0.77	0.90	0.90	0.59	0.80	0.78	0.72	0.88	0.86

Results

The best-performing network from the grid search tests included 90 epochs, 100 neurons in layer 1, 50 neurons in layer 2, optimiser of resilient backpropagation, transfer functions of logarithmic sigmoid and a learning rate of 0.001. Training on the dataset with this configuration and predicting on the test set yielded an overall accuracy score of 99.7%. Removing the cluster & PCA columns reduced this to 99.6%, suggesting they assisted with generalisation. Removing the feature-cross columns did not alter performance. The overall precision was 80% and recall was 81%; F1 score was 80.6% given their equal balance. Recall is emphasised due to the greater impact of failing to detect an attack (false negatives) compared to incorrectly identifying normal traffic (false positives). Because of the imbalance and smaller number of training examples for some attack types, DoS had the highest recall of 99.9%, probe had 96%, while R2L had 92% and U2R 67%, meaning that 33% of U2R attack types may be missed. Testing different threshold values did not provide better recall scores. Due to class imbalance, recall varied between tests on U2R.

Conclusions

The best-performing model provided 99.7% accuracy in correctly predicting whether traffic was normal or an attack, corresponding to benchmark performance in the literature1. Recall and precision were approximately 80% each, however, recall for minority classes could have been improved through additional testing. For instance, use of focal loss instead of cross-entropy loss to address class imbalance, down-weighting the contribution of well-classified examples. Another strategy would be to undersample the majority class to address imbalance. Benchmarking the results against other algorithms could also provide additional insight. Alternative methods which may be more efficient for optimising network hyperparameters might include genetic algorithms or bayesian optimisation. Inference of the model was limited but could have been improved by using relative feature importance methods.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Introduction

Data set Description

Experiment Design

Results

Conclusions

Files

README.md

Latest commit

History

README.md

File metadata and controls

Introduction

Data set Description

Experiment Design

Results

Conclusions