MALCLEANSE

Overview

This repository is associated with our FSE'25 paper titled Mitigating Emergent Malware Label Noise in DNN-Based Android Malware Detection. In this paper, we first explore the characteristics of label noise in real-world Android malware datasets, which we refer to as EM label noise. Then, we design this tool to remove EM label noise. We start by assessing the model's uncertainty regarding the samples in the dataset, and based on these uncertainties, we use anomaly detection algorithms to eliminate this label noise.

We use AVPASS to obfuscate malware, and the state-of-the-art (SOTA) tool are Cleanlab and MalWhiteout, MalWhiteout includes 3 malware detectors: Drebin, CSBD, and MalScan.

Prerequisites:

The codes depend on Python 3.8.10. Before using the tool, some dependencies need to be installed. The list is as follows:

 tensorflow==2.9.1
 tensorflow-probability==0.17.0
 numpy==1.24.1
 scikit-learn==0.23.0
 pandas>=1.0.4
 cleanlab==1.0.1
 androguard==3.3.5
 absl-py==0.8.1
 python-magic-bin==0.4.14
 seaborn
 Pillow
 pyelftools
 capstone
 python-magic
 pyyaml
 multiprocessing-logging

Usage

0. Hyperparameters:

📊 noise_type: this parameter specifies the type of noise in the training data along with relevant configuration details. The naming convention follows the format:

1️⃣ Random Noise

Example: random_1_drebin_10
Explanation:
- random: Indicates that the noise type is random noise.
- 1: Represents the first experiment (each experiment is conducted three times for stability).
- drebin: Specifies the source dataset, in this case, the Drebin dataset. other optional: malradar
- 10: Indicates the noise ratio is 10%.

2️⃣ EM Noise

Example: thr_1_18
Explanation:
- thr represents EMLN; thr_variation represents EMLNO.
- 1: Represents the first experiment.
- 18: Indicates the VirusTotal threshold is set to 18.

📊 model_type: The types of uncertainty estimation models available for our experiments are options `vanilla`, `bayesian`, `mcdropout`, `deepensemble`.

1. Estimate uncertainty：We use Variational Bayesian Inference (VBI) to perform 5-fold cross-validation on the dataset with label noise to evaluate the uncertainty of each sample. Specifically, we generate 10 prediction probabilities for each sample. The differences among these 10 prediction probabilities represent the uncertainty for that sample.

 cd Training 

 python3 train_noise_model.py -mt vanilla -nt thr_1_10

2. Label Noise detection：We calculate the average cross-entropy of these 10 prediction probabilities and use it as a quantification metric for uncertainty. Then, using anomaly detection algorithms, the anomalous samples are identified as label noise samples.

 cd myexperiments

 python3 my_tool.py -nt thr_1_10 -detection_model if

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MALCLEANSE

Overview

Prerequisites:

Usage

0. Hyperparameters:

📊 noise_type: this parameter specifies the type of noise in the training data along with relevant configuration details. The naming convention follows the format:

1️⃣ Random Noise

2️⃣ EM Noise

📊 model_type: The types of uncertainty estimation models available for our experiments are options `vanilla`, `bayesian`, `mcdropout`, `deepensemble`.

2. Label Noise detection：We calculate the average cross-entropy of these 10 prediction probabilities and use it as a quantification metric for uncertainty. Then, using anomaly detection algorithms, the anomalous samples are identified as label noise samples.

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
AVPASS		AVPASS
MalWhiteout		MalWhiteout
Training		Training
myexperiments		myexperiments
README.md		README.md

Dirtyboy1029/MALCLEANSE

Folders and files

Latest commit

History

Repository files navigation

MALCLEANSE

Overview

Prerequisites:

Usage

0. Hyperparameters:

📊 noise_type: this parameter specifies the type of noise in the training data along with relevant configuration details. The naming convention follows the format:

1️⃣ Random Noise

2️⃣ EM Noise

📊 model_type: The types of uncertainty estimation models available for our experiments are options vanilla, bayesian, mcdropout, deepensemble.

2. Label Noise detection：We calculate the average cross-entropy of these 10 prediction probabilities and use it as a quantification metric for uncertainty. Then, using anomaly detection algorithms, the anomalous samples are identified as label noise samples.

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

📊 model_type: The types of uncertainty estimation models available for our experiments are options `vanilla`, `bayesian`, `mcdropout`, `deepensemble`.

Packages