To develop a machine learning pipeline that classifies malware (input as byte strings) as accurately as possible.
These instructions describe the prerequisites and steps to get the project up and running.
This project can be easily set up on the Google Cloud Platform, using their Dataproc service for batch processing. Learn about Dataproc here https://cloud.google.com/dataproc/docs/concepts/overview .
We recommend you setup a virtual environment and install the software listed in requirements.txt. We use Python version 3.7.
For features we used basic word counts with Laplace smoothing
where word counts are, by default, adjusted using additive smoothing.
A list of numerical predictions each corresponding to a virus, which we feed to an online scoring app.
- datasets: contains much smaller subsets of our final training and testing datasets for setup and initial experiments
- features: csv files containing the features we find for our malware data
- notebooks: jupyter notebook python files, .jnb
- output:
files with results as output from experiments
- main: master project branch for tested, working code accepted via pull requests
- zain: meekail's development branch
- vance: jonathan's development branch
- shihan: shihan's development branch
See Contributors file for more details.
This project is licensed under the MIT License. See LICENSE for more details.