Project Description
: The ultimate objective of this project is the classification of documents among 9 different categories given the large uncompressed Microsoft Malware Classification Challenge dataset. The 9 Maleware categories are as follow:
- Ramnit
- Gatak
- Lollipop
- Kelihos_ver3
- Simda
- Tracur
- Kelihos_verl
- Vundo
- Obfuscator.ACY
The files in the dataset contains only hexadecimal codes. The challenge is to design and develop a Classification model that can classify around 2721 test documents into the above mentioned 9 Malware categories.
- Dask
- Google Cloud Platform or alternatively you can use Coiled
- Create a Dask cluster with the required configuration as per your dataset volume
- Connect it through Web Interface / SSH and open Jupyter Notebook
- Parse the file and extract all the words
- Remove stopwords, punctuations
- Calculate TF-IDF values and create a dataframe
- Separate the dataset into training and testing datasets
- Take the training dataset and separate it by the target values
- Calculate statistical values such as mean, standard deviation for the dataset
- Summarize the data by class
- Calculate the Gaussian Probability Density Function
- Estimate the class probabilities
We estimated the probability of the documents by testing it against the trained Naive Bayes classifier and got the accuracy around 66%. By changing the classifier to Logistic Regression, we almost got 84% accuracy which is an improvement from our previous step.
Please see CONTRIBUTORS file for more details.
This project is licensed under the MIT License - see the LICENSE file for the details.