Name		Name	Last commit message	Last commit date
parent directory ..
.ipynb_checkpoints		.ipynb_checkpoints
.vscode		.vscode
app		app
conf		conf
docs		docs
gradle/wrapper		gradle/wrapper
sample_output		sample_output
.env		.env
.gitignore		.gitignore
README.md		README.md
VoltCVESolver.ipynb		VoltCVESolver.ipynb
gradle.properties		gradle.properties
gradlew		gradlew
gradlew.bat		gradlew.bat
settings.gradle		settings.gradle

README.md

Problem Statement

The job of security analysts has become much more difficult as the scale of the projects they support has grown from one machine to many systems, each deployed with their own configurations, software, versions, and containers. It is crucial for a platform security analyst to justify and mitigate these findings to vet systems are secure and worthy of production. As the number of systems involved has grown, the effort and complexity involved in justifying the findings has increased exponentially. To reduce the effort required to write these justifications, we would like a tool to identify situations where one mitigation would resolve multiple related CVEs and assist in building the associated reports.

Project Objective

Create an Apache Spark application that can aid cybersecurity analysts in the justification and adjudication process by applying techniques such as TF-IDF, and TF-IDF vectorization to provide search and similarity search between CVEs.

Project Description

The project utilizes the NVD (National Vulnerability Database) which consists of a dataset of over 248k Common Vulnerabilities and Exposures (CVEs) and is increasing over time, at its current size it is 1.2GB of text data. By creating a Spark application that can pre-process the dataset with tokenization, normalization, TF-IDF, and TF-IDF Vectorization it expands the capability beyond the NVD dataset to data from other applications/services such as TwistLock, Fortify, and SonarQube vulnerability reports to reuse the pipeline. Ultimately creating an extensible framework for ML and further big data applications. Additionally, the TF-IDF vectorization can be used in document similarity search with cosine similarity which leverages the parallelization provided by Spark to compute cosine similarities. At demand this would be expensive without horizontal scaling which Spark provides.

Application Flow

Data Engine

REST Service

Requirements

Java 17
Apache Spark 3.5.1
Python 3.10

Python Packages

pip install pyspark
pip install nltk
pip install wordcloud

Running the application Stack

First open the Jupyter Notebook

jupyter lab ./VoltCVESolver.ipynb

In Jupyter lab:

Run -> Run all Cells

Wait for the notebook to complete.

Start the Rest service

./gradlew run

Using the Application

Once the application is started wait 10 seconds for the REST service to complete initialization.

Using Postman or Curl issue request to: localhost:7070/query?query=<keyword(s)/cve>

Examples

Keyword Search

Uses top TF-IDF score.

curl --location 'localhost:7070/query?query=vm'

Mult-Keyword Search

curl --location 'localhost:7070/query?query=vm%20memory%20leak'

CVE-to-CVE Search

curl --location 'localhost:7070/query?query=CVE-2023-34034'

Sample Output

Sample JSON is provided for each of the query types. Located in the directory sample_output.

Keyword Search: Keyword search with the keyword 'vm': keyword_search-vm.json
Multi-Keyword Search: With keywords 'vm memory leak': multi-keyword_search-vm-memory-leak.json
CVE Search: Search for the most similar CVE to 'CVE-2023-34055': cve_search-CVE-2023-34055.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Volt

Volt

README.md

Problem Statement

Project Objective

Project Description

Application Flow

Data Engine

REST Service

Requirements

Python Packages

Running the application Stack

First open the Jupyter Notebook

Start the Rest service

Using the Application

Examples

Keyword Search

Mult-Keyword Search

CVE-to-CVE Search

Sample Output

Files

Volt

Directory actions

More options

Directory actions

More options

Latest commit

History

Volt

Folders and files

parent directory

README.md

Problem Statement

Project Objective

Project Description

Application Flow

Data Engine

REST Service

Requirements

Python Packages

Running the application Stack

First open the Jupyter Notebook

Start the Rest service

Using the Application

Examples

Keyword Search

Mult-Keyword Search

CVE-to-CVE Search

Sample Output