It's an experiment based on a 09 KDD paper, Beyond Blacklists: Learning to Detect Malicious Web Sites from Suspicious URLs.
You can read and download this paper via https://cseweb.ucsd.edu/~voelker/pubs/mal-url-kdd09.pdf
- Try to build 9 Feature Sets as the paper do (Feature Comparison);
- Using several classifiers(Naive Bayes, SVM and Logistic Regression) to validate the result under 9 feature set (Classification Comparison).
URL dataset comes from https://github.com/Anmol-Sharma/URL_CLASSIFICATION_SYSTEM/blob/master/train_dataset.csv. Benign URL comes from DMOZ and Malicious URL comes from Phishtank.
URL dataset is Data/train_dataset.csv Blacklist is Data/url_blacklists.txt
Get_registrar.py aims to scrap the legal registrars from ICANN and the data has been saved as Data/registrar.txt
Build the 9 Feature Sets as paper do.
- Basic Feature Set
- Botnet Feature Set
- Blacklist Feature Set
- Blacklist + Botnet Feature Set
- Whois Feature Set
- Host-based Feature Set
- Lexical Feature Set
- Full (Lexical + Host-based) Feature Set
- Full except Blacklist + Whois Feature Set
Using sklearn packages https://scikit-learn.org/stable/.
Classification code is in main.py
Classifier:
- Gaussian Naive Bayes
- Multinomial Naive Bayes
- Linear SVM
- RBF SVM
- L1-regularization Logistic Regression
If you have any questions, please feel free to issue me. Plus, if you like the project, you can make a star for me hah. Thanks in advance!