Code Graph Analysis Pipeline - Commands

Start an Analysis
Generate Markdown References
Validate Links in Markdown
Manual Setup
Database Queries
Stop Neo4j
Jupyter Notebook
References
Other Commands

Start an Analysis

An analysis is started with the script analyze.sh. To run all analysis steps simple execute the following command:

./../../scripts/analysis/analyze.sh

👉 See scripts/examples/analyzeAxonFramework.sh as an example script that combines all the above steps for a Java Project. 👉 See scripts/examples/analyzeReactRouter.sh as an example script that combines all the above steps for a Typescript Project.
👉 See Code Structure Analysis Pipeline on how to do this within a GitHub Actions Workflow.

Command Line Options

The analyze.sh command comes with these command line options:

--report Csv only generates CSV reports. This speeds up the report generation and doesn't depend on Python, Jupyter Notebook or any other related dependencies. The default value os All to generate all reports. Jupiter will only generate Jupyter Notebook reports. DatabaseCsvExport exports the whole graph database as a CSV file (performance intense, check if there are security concerns first).
--profile Neo4jv4 uses the older long term support (june 2023) version v4.4.x of Neo4j and suitable compatible versions of plugins and JQAssistant. Neo4jv5 will explicitly select the newest (june 2023) version 5.x of Neo4j. Without setting a profile, the newest versions will be used. Other profiles can be found in the directory scripts/profiles.
--profile Neo4jv5-continue-on-scan-errors is based on the default profile (Neo4jv5) but uses the jQAssistant configuration template template-neo4jv5-jqassistant-continue-on-error.yaml to continue on scan error instead of failing fast. This is temporarily useful when there is a known error that needs to be ignored. It is still recommended to use the default profile and fail fast if there is something wrong. Other profiles can be found in the directory scripts/profiles.
--profile Neo4jv5-low-memory is based on the default profile (Neo4jv5) but uses only half of the memory (RAM) as configured in template-neo4j-low-memory.conf. This is useful for the analysis of smaller codebases with less resources. Other profiles can be found in the directory scripts/profiles.
--explore activates the "explore" mode where no reports are generated. Furthermore, Neo4j won't be stopped at the end of the script and will therefore continue running. This makes it easy to just set everything up but then use the running Neo4j server to explore the data manually.

Notes

Be sure to use Java 17 for Neo4j v5 and Java 11 for Neo4j v4
Use your own initial Neo4j password
For more details have a look at the script analyze.sh

Examples

Start an analysis with CSV reports only

If only the CSV reports are needed, that are the result of Cypher queries and don't need any further dependencies (like Python) the analysis can be speeded up with:

./../../scripts/analysis/analyze.sh --report Csv

Start an analysis with Jupyter reports only

If only the Jupyter reports are needed e.g. when the CSV reports had already been generated, the this can be done with:

./../../scripts/analysis/analyze.sh --report Jupyter

Start an analysis with PDF generation

Note: Generating a PDF from a Jupyter notebook using nbconvert takes some time and might even fail due to a timeout error.

ENABLE_JUPYTER_NOTEBOOK_PDF_GENERATION=true ./../../scripts/analysis/analyze.sh

Start an analysis without importing git log data

To speed up analysis and get a smaller data footprint you can switch of git log data import of the "source" directory (if present) with IMPORT_GIT_LOG_DATA_IF_SOURCE_IS_PRESENT="none" as shown below or choose IMPORT_GIT_LOG_DATA_IF_SOURCE_IS_PRESENT="aggregated" to reduce data size by only importing monthly grouped changes instead of all commits.

IMPORT_GIT_LOG_DATA_IF_SOURCE_IS_PRESENT="none" ./../../scripts/analysis/analyze.sh

Only run setup and explore the Graph manually

To prepare everything for analysis including installation, configuration and preparation queries to explore the graph manually without report generation use this command:

./../../scripts/analysis/analyze.sh --explore

Generate Markdown References

Generate Cypher Reference

Change into the cypher directory e.g. with cd cypher and then execute the script generateCypherReference.sh with the following command:

./../scripts/documentation/generateCypherReference.sh

Generate Script Reference

Change into the scripts directory e.g. with cd scripts and then execute the script generateScriptReference.sh with the following command:

./documentation/generateScriptReference.sh

Generate Environment Variable Reference

Change into the scripts directory e.g. with cd scripts and then execute the script generateEnvironmentVariableReference.sh with the following command:

./documentation/generateEnvironmentVariableReference.sh

Validate Links in Markdown

The following command shows how to use markdown-link-check to for example check the links in the README.md file:

npx --yes markdown-link-check --quiet --progress --config=markdown-lint-check-config.json README.md COMMANDS.md GETTING_STARTED.md INTEGRATION.md CHANGELOG.md

Manual Setup

The manual setup is only documented for completeness. It isn't needed since the analysis also covers download, installation and configuration of all needed tools.

If any of the script are not allowed to be executed use chmod +x ./scripts/ followed by the script file name to grant execution.

Setup Neo4j Graph Database

Use setupNeo4j.sh to download Neo4j and install the plugins APOC and Graph Data Science. This script requires the environment variable NEO4J_INITIAL_PASSWORD to be set. It sets the initial password with a temporary NEO4J_HOME environment variable to not interfere with a possibly globally installed Neo4j installation.

Start Neo4j Graph Database

Use startNeo4j.sh to start the locally installed Neo4j Graph database. It runs the script with a temporary NEO4J_HOME environment variable to not interfere with a possibly globally installed Neo4j installation.

Setup jQAssistant Java Code Analyzer

Use setupJQAssistant.sh to download jQAssistant.

Download Maven Artifacts to analyze

Use downloadMavenArtifact.sh with the following mandatory options to download a Maven artifact into the artifacts directory:

-g <maven group id>
-a <maven artifact name>
-v <maven artifact version>
-t <maven artifact type (optional, defaults to jar)>
-d <target directory for the downloaded file (optional, defaults to "artifacts")>

Sort out jar files containing external libraries

After collecting all the Java artifacts it might be needed to sort out external libraries you don't want to analyze directly. For that you can use sortOutExternalJavaJarFiles.sh. It needs to be started in the directory of the jar files ("artifacts") of you analysis workspace and will create a new directory called "ignored-jars" besides the "artifacts" directory so that those jars don't get analyzed.

Here is an example that can be started from your temp analysis workspace and that will filter out all jar files that don't contain any org.neo4j package:

cd artifacts; ./../../../scripts/sortOutExternalJavaJarFiles.sh org.neo4j

Download Typescript project to analyze

Use downloadTypescriptProject.sh with the following options to download a Typescript project using git clone and prepare it for analysis:

--url Git clone URL (required)
--version Version of the project
--tag Tag to switch to after "git clone" (optional, default = version)
--project Name of the project/repository (optional, default = clone url file name without .git extension)
--packageManager One of "npm", "pnpm" or "yarn". (optional, default = "npm")

Here is an example:

./../../downloadTypescriptProject.sh \
  --url https://github.com/remix-run/react-router.git \
  --version 6.24.0 \
  --tag "react-router@6.24.0" \
  --packageManager pnpm

Reset the database and scan the java artifacts

Use resetAndScan.sh to scan the local artifacts directory with the previously downloaded Java artifacts and write the data into the local Neo4J Graph database using jQAssistant. It also uses some jQAssistant "concepts" to enhance the data further with relationships between artifacts and packages.

Be aware that this script deletes all previous relationships and nodes in the local Neo4j Graph database.

Import git data

Use importGit.sh to import git data into the Graph. It uses git log to extract commits, their authors and the names of the files changed with them. These are stored in an intermediate CSV file and are then imported into Neo4j with the following schema:

(Git:Log:Author)-[:AUTHORED]->(Git:Log:Commit)->[:CONTAINS_CHANGED]->(Git:Log:File)
(Git:Log:Commit)-[:HAS_PARENT]->(Git:Log:Commit)
(Git:Repository)-[:HAS_COMMIT]->(Git:Log:Commit)
(Git:Repository)-[:HAS_AUTHOR]->(Git:Log:Author)
(Git:Repository)-[:HAS_FILE]->(Git:Log:File)

👉Note: Commit messages containing [bot] are filtered out to ignore changes made by bots.

Import aggregated git log

Instead of importing every single commit, changes can be grouped by month including their commit count. This is in many cases sufficient and reduces data size and processing time significantly. To do this, set the environment variable IMPORT_GIT_LOG_DATA_IF_SOURCE_IS_PRESENT to aggregated. If you don't want to set the environment variable globally, then you can also prepend the command with it like this (inside the analysis workspace directory contained within temp):

IMPORT_GIT_LOG_DATA_IF_SOURCE_IS_PRESENT="aggregated" ./../../scripts/importGit.sh

Here is the resulting schema:

(Git:Log:Author)-[:AUTHORED]->(Git:Log:ChangeSpan)-[:CONTAINS_CHANGED]->(Git:Log:File)
(Git:Repository)-[:HAS_CHANGE_SPAN]->(Git:Log:ChangeSpan)
(Git:Repository)-[:HAS_AUTHOR]->(Git:Log:Author)
(Git:Repository)-[:HAS_FILE]->(Git:Log:File)

Parameter

The optional parameter --source directory-path-to-the-source-folder-containing-git-repositories can be used to select a different directory for the repositories. By default, the source directory within the analysis workspace directory is used. This command only needs the git history to be present. Therefore, git clone --bare is sufficient. If the source directory is also used for code analysis (like for Typescript) then a full git clone is of course needed. Additionally, if you want to focus on a specific version or branch, use --branch branch-name to checkout the branch and --single-branch to exclude other branches before importing the git log data.

Environment Variable

IMPORT_GIT_LOG_DATA_IF_SOURCE_IS_PRESENT supports the values none, aggregated, full and plugin (default). With it, you can switch off git import (none), import aggregated data for a smaller memory footprint (aggregated), import all commits with git log in a simple way (full) or let a plugin take care of git data (plugin= ""=default) .

Resolving git files to code files

After git log data has been imported successfully, Add_RESOLVES_TO_relationships_to_git_files_for_Java.cypher is used to try to resolve the imported git file names to code files. This first attempt will cover most cases, but not all of them. With this approach it is, for example, not possible to distinguish identical file names in different Java jars from the git source files of a mono repo.

You can use List_unresolved_git_files.cypher to find code files that couldn't be matched to git file names and List_ambiguous_git_files.cypher to find ambiguously resolved git files. If you have any idea on how to improve this feel free to open an issue.

Database Queries

Cypher Shell

With cypher-shell CLI provided by Neo4j a query based on a file can simply be made with the following command. Be sure to replace path/to/local/neo4j and password with your settings.

cat ./cypher/Get_Graph_Data_Science_Library_Version.cypher | path/to/local/neo4j/bin/cypher-shell -u neo4j -p password --format plain

Query parameter can be added with the option --param. Here is an example:

cat ./cypher/Get_Graph_Data_Science_Library_Version.cypher | path/to/local/neo4j/bin/cypher-shell -u neo4j -p password --format plain --param {a: 1}

For a full list of options use the help function:

path/to/local/neo4j/bin/cypher-shell --help

HTTP API

Use executeQuery.sh to execute a Cypher query from the file given as an argument. It uses curl and jq to access the HTTP API of Neo4j. Here is an example:

./scripts/executeQuery.sh ./cypher/Get_Graph_Data_Science_Library_Version.cypher

Query parameters can be added as arguments after the file name. Here is an example:

./scripts/executeQuery.sh ./cypher/Get_Graph_Data_Science_Library_Version.cypher a=1

executeQueryFunctions.sh

The script executeQueryFunctions.sh contains functions to simplify the call of executeQuery.sh for different purposes. For example, execute_cypher_summarized prints out the results on the console in a summarized manner and execute_cypher_expect_results fails when there are no results.

The script also provides an API abstraction that defaults to HTTP, but can easily be switched to cypher-shell.

Query parameters can be added as arguments after the file name. Here is an example:

source "${SCRIPTS_DIR}/executeQueryFunctions.sh"
execute_cypher ./cypher/Get_Graph_Data_Science_Library_Version.cypher a=1

Stop Neo4j

Use stopNeo4j.sh to stop the locally running Neo4j Graph Database. It does nothing if the database is already stopped. It runs the script with a temporary NEO4J_HOME environment variable to not interfere with a possibly globally installed Neo4j installation.

Jupyter Notebook

Create a report with executeJupyterNotebookReport.sh

The script executeJupyterNotebookReport.sh combines:

creating a directory within the "reports" directory
data availability validation using executeQueryFunctions.sh
executing and converting the given Notebook using executeJupyterNotebook.sh

Here is an example on how to run the report Wordcloud.ipynb:

./scripts/executeJupyterNotebookReport.sh  --jupyterNotebook Wordcloud.ipynb

Data Availability Validation

Jupyter Notebooks can have additional custom tags within their metadata section. Opening these files with a text editor unveils that typically at the end of the file. Some editors also support editing them directly. Here, the optional metadata property code_graph_analysis_pipeline_data_validation is used to specify which data validation query in the cypher/Validation directory should be used. Without this property, the data validation step is skipped. If a validation is specified, it will be executed before the Jupyter Notebook is executed. If the query has at least one result, the validation is seen as successful. Otherwise, the Jupyter Notebook will not be executed.

This is helpful for Jupyter Notebook reports that are specific to a programming language or other specific data prerequisites. The Notebook will be skipped if there is no data available which would otherwise lead to confusing and distracting reports with empty tables and figures.

You can search the messages Validation succeeded or Validation failed inside the log to get detailed information which Notebook had been skipped for which reason.

Execute a Notebook with executeJupyterNotebook.sh

executeJupyterNotebook.sh executes a Jupyter Notebook in the command line and convert it to different formats like Markdown and PDF (optionally). It takes care of setting up the environment and uses nbconvert to execute the notebook and convert it to other file formats under the hood.

Here is an example on how to use executeJupyterNotebook.sh to for example run Wordcloud.ipynb:

./scripts/executeJupyterNotebook.sh ./jupyter/Wordcloud.ipynb

Manually setup the environment using Conda

Conda provides package, dependency, and environment management for any language. Here, it is used to setup the environment for Juypter Notebooks.

Setup environment

conda create --name codegraph jupyter numpy matplotlib nbconvert nbconvert-webpdf
conda activate codegraph

or by using the environment file codegraph-environment.yml:

conda env create --file ./jupyter/environment.yml
conda activate codegraph

Export full environment.yml

conda env export --name codegraph > full-codegraph-environment.yml

Export only explicit environment.yml

conda env export --from-history --name codegraph | grep -v "^prefix: " > explicit-codegraph-environment.yml

Executing Jupyter Notebooks with nbconvert

nbconvert converts Jupyter Notebooks to other static formats including HTML, LaTeX, PDF, Markdown, reStructuredText, and more.

Install pandoc used by nbconvert for LaTeX support (Mac)
```
brew install pandoc mactex
```
Start Jupyter Notebook
```
jupyter notebook
```

Create new Notebook with executed cells

jupyter nbconvert --to notebook --execute ./jupyter/first-neo4j-tryout.ipynb

Convert Notebook with executed cells to PDF

jupyter nbconvert --to pdf ./jupyter/first-neo4j-tryout.nbconvert.ipynb

References

Conda
jQAssistant
Jupyter Notebook
Jupyter Notebook - Using as a command line tool
Jupyter Notebook - Installing TeX for PDF conversion
Jupyter Notebook Format - Metadata
Integrate Neo4j with Jupyter Notebook
Hello World
Managing environments with Conda
Neo4j - Download
Neo4j - HTTP API
How to Use Conda With Github Actions
Older database download link (neo4j community)

Other Commands

Information about a process that listens to a specific local port

ps -p $( lsof -t -i:7474 -sTCP:LISTEN )

Kill process that listens to a specific local port

kill -9 $( lsof -t -i:7474 -sTCP:LISTEN )

Memory Estimation

Reference: Neo4j memory estimation

NEO4J_HOME=tools/neo4j-community-4.4.20 tools/neo4j-community-4.4.20/bin/neo4j-admin memrec

Files

COMMANDS.md

Latest commit

History