Project is based on the example given in:
https://towardsdatascience.com/a-tutorial-using-spark-for-big-data-an-example-to-predict-customer-churn-9078ac9a1e85
We analyze 19GB from data (taken from link below)
https://www.kaggle.com/mryanm/luflow-network-intrusion-detection-data-set
This dataset describes potential situation of malicious cyber intrusion.
The following dataset contains many entries. Each entry describe possible caractertics of potiontial cyber threat. Each entry has also a label haiving one of the possible following values:
- Malicious
- Begnin
- Outlier
Our target here is to learn a model capable to predict labels of entries similar to those available in these files.
Another solution (implemnted in python) for this problem is available in Kaggle plateform (see link below)
https://www.kaggle.com/houssembenlahmar/prediction-of-intrusion