The job of security analysts has become much more difficult as the scale of the projects they support has grown from one machine to many systems, each deployed with their own configurations, software, versions, and containers. It is crucial for a platform security analyst to justify and mitigate these findings to vet systems are secure and worthy of production. As the number of systems involved has grown, the effort and complexity involved in justifying the findings has increased exponentially. To reduce the effort required to write these justifications, we would like a tool to identify situations where one mitigation would resolve multiple related CVEs and assist in building the associated reports.
Create an Apache Spark application that can aid cybersecurity analysts in the justification and adjudication process by applying techniques such as TF-IDF, and TF-IDF vectorization to provide search and similarity search between CVEs.
The project utilizes the NVD (National Vulnerability Database) which consists of a dataset of over 248k Common Vulnerabilities and Exposures (CVEs) and is increasing over time, at its current size it is 1.2GB of text data. By creating a Spark application that can pre-process the dataset with tokenization, normalization, TF-IDF, and TF-IDF Vectorization it expands the capability beyond the NVD dataset to data from other applications/services such as TwistLock, Fortify, and SonarQube vulnerability reports to reuse the pipeline. Ultimately creating an extensible framework for ML and further big data applications. Additionally, the TF-IDF vectorization can be used in document similarity search with cosine similarity which leverages the parallelization provided by Spark to compute cosine similarities. At demand this would be expensive without horizontal scaling which Spark provides.
- Java 17
- Apache Spark 3.5.1
- Python 3.10
pip install pyspark
pip install nltk
pip install wordcloud
jupyter lab ./VoltCVESolver.ipynb
In Jupyter lab:
Run -> Run all Cells
Wait for the notebook to complete.
./gradlew run
Once the application is started wait 10 seconds for the REST service to complete initialization.
Using Postman or Curl issue request to:
localhost:7070/query?query=<keyword(s)/cve>
Uses top TF-IDF score.
curl --location 'localhost:7070/query?query=vm'
curl --location 'localhost:7070/query?query=vm%20memory%20leak'
curl --location 'localhost:7070/query?query=CVE-2023-34034'
Sample JSON is provided for each of the query types. Located in the directory sample_output.
-
Keyword Search: Keyword search with the keyword 'vm':
keyword_search-vm.json
-
Multi-Keyword Search: With keywords 'vm memory leak':
multi-keyword_search-vm-memory-leak.json
-
CVE Search: Search for the most similar CVE to 'CVE-2023-34055':
cve_search-CVE-2023-34055.json