URL_Reputation_Classification

1. Project Decription

It's an experiment based on a 09 KDD paper, Beyond Blacklists: Learning to Detect Malicious Web Sites from Suspicious URLs.

Try to build 9 Feature Sets as the paper do (Feature Comparison);
Using several classifiers(Naive Bayes, SVM and Logistic Regression) to validate the result under 9 feature set (Classification Comparison).

URL dataset comes from https://github.com/Anmol-Sharma/URL_CLASSIFICATION_SYSTEM/blob/master/train_dataset.csv. Benign URL comes from DMOZ and Malicious URL comes from Phishtank.

URL dataset is Data/train_dataset.csv Blacklist is Data/url_blacklists.txt

Get_registrar.py aims to scrap the legal registrars from ICANN and the data has been saved as Data/registrar.txt

Build the 9 Feature Sets as paper do.

Classification code is in main.py

Classifier:

If you have any questions, please feel free to issue me. Plus, if you like the project, you can make a star for me hah. Thanks in advance!