Skip to content

Latest commit

 

History

History

Volt

Problem Statement

The job of security analysts has become much more difficult as the scale of the projects they support has grown from one machine to many systems, each deployed with their own configurations, software, versions, and containers. It is crucial for a platform security analyst to justify and mitigate these findings to vet systems are secure and worthy of production. As the number of systems involved has grown, the effort and complexity involved in justifying the findings has increased exponentially. To reduce the effort required to write these justifications, we would like a tool to identify situations where one mitigation would resolve multiple related CVEs and assist in building the associated reports.

Project Objective

Create an Apache Spark application that can aid cybersecurity analysts in the justification and adjudication process by applying techniques such as TF-IDF, and TF-IDF vectorization to provide search and similarity search between CVEs.

Project Description

The project utilizes the NVD (National Vulnerability Database) which consists of a dataset of over 248k Common Vulnerabilities and Exposures (CVEs) and is increasing over time, at its current size it is 1.2GB of text data. By creating a Spark application that can pre-process the dataset with tokenization, normalization, TF-IDF, and TF-IDF Vectorization it expands the capability beyond the NVD dataset to data from other applications/services such as TwistLock, Fortify, and SonarQube vulnerability reports to reuse the pipeline. Ultimately creating an extensible framework for ML and further big data applications. Additionally, the TF-IDF vectorization can be used in document similarity search with cosine similarity which leverages the parallelization provided by Spark to compute cosine similarities. At demand this would be expensive without horizontal scaling which Spark provides.

Application Flow

ApplicationFlow

Data Engine

Data Engine

REST Service

REST Service

Requirements

  • Java 17
  • Apache Spark 3.5.1
  • Python 3.10

Python Packages

pip install pyspark
pip install nltk
pip install wordcloud

Running the application Stack

First open the Jupyter Notebook

jupyter lab ./VoltCVESolver.ipynb 

In Jupyter lab:

Run -> Run all Cells

Wait for the notebook to complete.

Start the Rest service

./gradlew run

Using the Application

Once the application is started wait 10 seconds for the REST service to complete initialization.

Using Postman or Curl issue request to: localhost:7070/query?query=<keyword(s)/cve>

Examples

Keyword Search

Uses top TF-IDF score.

curl --location 'localhost:7070/query?query=vm'

Mult-Keyword Search

curl --location 'localhost:7070/query?query=vm%20memory%20leak'

CVE-to-CVE Search

curl --location 'localhost:7070/query?query=CVE-2023-34034'

Sample Output

Sample JSON is provided for each of the query types. Located in the directory sample_output.

  • Keyword Search: Keyword search with the keyword 'vm': keyword_search-vm.json

  • Multi-Keyword Search: With keywords 'vm memory leak': multi-keyword_search-vm-memory-leak.json

  • CVE Search: Search for the most similar CVE to 'CVE-2023-34055': cve_search-CVE-2023-34055.json