Skip to content

dsp-uga/gregory-p1

Repository files navigation

Malware Classification

Team - Gregory

Project Description: The ultimate objective of this project is the classification of documents among 9 different categories given the large uncompressed Microsoft Malware Classification Challenge dataset. The 9 Maleware categories are as follow:

  • Ramnit
  • Gatak
  • Lollipop
  • Kelihos_ver3
  • Simda
  • Tracur
  • Kelihos_verl
  • Vundo
  • Obfuscator.ACY

The files in the dataset contains only hexadecimal codes. The challenge is to design and develop a Classification model that can classify around 2721 test documents into the above mentioned 9 Malware categories.

Installation

Approach

  • Create a Dask cluster with the required configuration as per your dataset volume
  • Connect it through Web Interface / SSH and open Jupyter Notebook
  • Parse the file and extract all the words
  • Remove stopwords, punctuations
  • Calculate TF-IDF values and create a dataframe
  • Separate the dataset into training and testing datasets
  • Take the training dataset and separate it by the target values
  • Calculate statistical values such as mean, standard deviation for the dataset
  • Summarize the data by class
  • Calculate the Gaussian Probability Density Function
  • Estimate the class probabilities

Improvements

We estimated the probability of the documents by testing it against the trained Naive Bayes classifier and got the accuracy around 66%. By changing the classifier to Logistic Regression, we almost got 84% accuracy which is an improvement from our previous step.

Contributions

Please see CONTRIBUTORS file for more details.

Authors

License

This project is licensed under the MIT License - see the LICENSE file for the details.

References

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 3

  •  
  •  
  •