- Start an Analysis
- Generate Markdown References
- Validate Links in Markdown
- Manual Setup
- Database Queries
- Stop Neo4j
- Jupyter Notebook
- References
- Other Commands
An analysis is started with the script analyze.sh. To run all analysis steps simple execute the following command:
./../../scripts/analysis/analyze.sh
👉 See scripts/examples/analyzeAxonFramework.sh as an example script that combines all the above steps for a Java Project.
👉 See scripts/examples/analyzeReactRouter.sh as an example script that combines all the above steps for a Typescript Project.
👉 See Code Structure Analysis Pipeline on how to do this within a GitHub Actions Workflow.
The analyze.sh command comes with these command line options:
-
--report Csv
only generates CSV reports. This speeds up the report generation and doesn't depend on Python, Jupyter Notebook or any other related dependencies. The default value osAll
to generate all reports.Jupiter
will only generate Jupyter Notebook reports.DatabaseCsvExport
exports the whole graph database as a CSV file (performance intense, check if there are security concerns first). -
--profile Neo4jv4
uses the older long term support (june 2023) version v4.4.x of Neo4j and suitable compatible versions of plugins and JQAssistant.Neo4jv5
will explicitly select the newest (june 2023) version 5.x of Neo4j. Without setting a profile, the newest versions will be used. Other profiles can be found in the directory scripts/profiles. -
--profile Neo4jv5-continue-on-scan-errors
is based on the default profile (Neo4jv5
) but uses the jQAssistant configuration template template-neo4jv5-jqassistant-continue-on-error.yaml to continue on scan error instead of failing fast. This is temporarily useful when there is a known error that needs to be ignored. It is still recommended to use the default profile and fail fast if there is something wrong. Other profiles can be found in the directory scripts/profiles. -
--profile Neo4jv5-low-memory
is based on the default profile (Neo4jv5
) but uses only half of the memory (RAM) as configured in template-neo4j-low-memory.conf. This is useful for the analysis of smaller codebases with less resources. Other profiles can be found in the directory scripts/profiles. -
--explore
activates the "explore" mode where no reports are generated. Furthermore, Neo4j won't be stopped at the end of the script and will therefore continue running. This makes it easy to just set everything up but then use the running Neo4j server to explore the data manually.
- Be sure to use Java 17 for Neo4j v5 and Java 11 for Neo4j v4
- Use your own initial Neo4j password
- For more details have a look at the script analyze.sh
If only the CSV reports are needed, that are the result of Cypher queries and don't need any further dependencies (like Python) the analysis can be speeded up with:
./../../scripts/analysis/analyze.sh --report Csv
If only the Jupyter reports are needed e.g. when the CSV reports had already been generated, the this can be done with:
./../../scripts/analysis/analyze.sh --report Jupyter
Note: Generating a PDF from a Jupyter notebook using nbconvert takes some time and might even fail due to a timeout error.
ENABLE_JUPYTER_NOTEBOOK_PDF_GENERATION=true ./../../scripts/analysis/analyze.sh
To speed up analysis and get a smaller data footprint you can switch of git log data import of the "source" directory (if present) with IMPORT_GIT_LOG_DATA_IF_SOURCE_IS_PRESENT="none"
as shown below or choose IMPORT_GIT_LOG_DATA_IF_SOURCE_IS_PRESENT="aggregated"
to reduce data size by only importing monthly grouped changes instead of all commits.
IMPORT_GIT_LOG_DATA_IF_SOURCE_IS_PRESENT="none" ./../../scripts/analysis/analyze.sh
To prepare everything for analysis including installation, configuration and preparation queries to explore the graph manually without report generation use this command:
./../../scripts/analysis/analyze.sh --explore
Change into the cypher directory e.g. with cd cypher
and then execute the script generateCypherReference.sh with the following command:
./../scripts/documentation/generateCypherReference.sh
Change into the scripts directory e.g. with cd scripts
and then execute the script generateScriptReference.sh with the following command:
./documentation/generateScriptReference.sh
Change into the scripts directory e.g. with cd scripts
and then execute the script generateEnvironmentVariableReference.sh with the following command:
./documentation/generateEnvironmentVariableReference.sh
The following command shows how to use markdown-link-check to for example check the links in the README.md file:
npx --yes markdown-link-check --quiet --progress --config=markdown-lint-check-config.json README.md COMMANDS.md GETTING_STARTED.md INTEGRATION.md CHANGELOG.md
The manual setup is only documented for completeness. It isn't needed since the analysis also covers download, installation and configuration of all needed tools.
If any of the script are not allowed to be executed use chmod +x ./scripts/
followed by the script file name to grant execution.
Use setupNeo4j.sh to download Neo4j and install the plugins APOC and Graph Data Science.
This script requires the environment variable NEO4J_INITIAL_PASSWORD to be set. It sets the initial password with a temporary NEO4J_HOME
environment variable to not interfere with a possibly globally installed Neo4j installation.
Use startNeo4j.sh to start the locally installed Neo4j Graph database.
It runs the script with a temporary NEO4J_HOME
environment variable to not interfere with a possibly globally installed Neo4j installation.
Use setupJQAssistant.sh to download jQAssistant.
Use downloadMavenArtifact.sh with the following mandatory options to download a Maven artifact into the artifacts directory:
-g <maven group id>
-a <maven artifact name>
-v <maven artifact version>
-t <maven artifact type (optional, defaults to jar)>
-d <target directory for the downloaded file (optional, defaults to "artifacts")>
After collecting all the Java artifacts it might be needed to sort out external libraries you don't want to analyze directly. For that you can use sortOutExternalJavaJarFiles.sh. It needs to be started in the directory of the jar files ("artifacts") of you analysis workspace and will create a new directory called "ignored-jars" besides the "artifacts" directory so that those jars don't get analyzed.
Here is an example that can be started from your temp analysis workspace and that will filter out all jar files that don't contain any org.neo4j
package:
cd artifacts; ./../../../scripts/sortOutExternalJavaJarFiles.sh org.neo4j
Use downloadTypescriptProject.sh with the following options to download a Typescript project using git clone and prepare it for analysis:
--url
Git clone URL (required)--version
Version of the project--tag
Tag to switch to after "git clone" (optional, default = version)--project
Name of the project/repository (optional, default = clone url file name without .git extension)--packageManager
One of "npm", "pnpm" or "yarn". (optional, default = "npm")
Here is an example:
./../../downloadTypescriptProject.sh \
--url https://github.com/remix-run/react-router.git \
--version 6.24.0 \
--tag "[email protected]" \
--packageManager pnpm
Use resetAndScan.sh to scan the local artifacts
directory with the previously downloaded Java artifacts and write the data into the local Neo4J Graph database using jQAssistant. It also uses some jQAssistant "concepts" to
enhance the data further with relationships between artifacts and packages.
Be aware that this script deletes all previous relationships and nodes in the local Neo4j Graph database.
Use importGit.sh to import git data into the Graph.
It uses git log
to extract commits, their authors and the names of the files changed with them. These are stored in an intermediate CSV file and are then imported into Neo4j with the following schema:
(Git:Log:Author)-[:AUTHORED]->(Git:Log:Commit)->[:CONTAINS_CHANGED]->(Git:Log:File)
(Git:Log:Commit)-[:HAS_PARENT]->(Git:Log:Commit)
(Git:Repository)-[:HAS_COMMIT]->(Git:Log:Commit)
(Git:Repository)-[:HAS_AUTHOR]->(Git:Log:Author)
(Git:Repository)-[:HAS_FILE]->(Git:Log:File)
👉Note: Commit messages containing [bot]
are filtered out to ignore changes made by bots.
Instead of importing every single commit, changes can be grouped by month including their commit count. This is in many cases sufficient and reduces data size and processing time significantly. To do this, set the environment variable IMPORT_GIT_LOG_DATA_IF_SOURCE_IS_PRESENT
to aggregated
. If you don't want to set the environment variable globally, then you can also prepend the command with it like this (inside the analysis workspace directory contained within temp):
IMPORT_GIT_LOG_DATA_IF_SOURCE_IS_PRESENT="aggregated" ./../../scripts/importGit.sh
Here is the resulting schema:
(Git:Log:Author)-[:AUTHORED]->(Git:Log:ChangeSpan)-[:CONTAINS_CHANGED]->(Git:Log:File)
(Git:Repository)-[:HAS_CHANGE_SPAN]->(Git:Log:ChangeSpan)
(Git:Repository)-[:HAS_AUTHOR]->(Git:Log:Author)
(Git:Repository)-[:HAS_FILE]->(Git:Log:File)
The optional parameter --source directory-path-to-the-source-folder-containing-git-repositories
can be used to select a different directory for the repositories. By default, the source
directory within the analysis workspace directory is used. This command only needs the git history to be present. Therefore, git clone --bare
is sufficient. If the source
directory is also used for code analysis (like for Typescript) then a full git clone is of course needed. Additionally, if you want to focus on a specific version or branch, use --branch branch-name
to checkout the branch and --single-branch
to exclude other branches before importing the git log data.
IMPORT_GIT_LOG_DATA_IF_SOURCE_IS_PRESENT
supports the valuesnone
,aggregated
,full
andplugin
(default). With it, you can switch off git import (none
), import aggregated data for a smaller memory footprint (aggregated
), import all commits with git log in a simple way (full
) or let a plugin take care of git data (plugin
=""
=default) .
After git log data has been imported successfully, Add_RESOLVES_TO_relationships_to_git_files_for_Java.cypher is used to try to resolve the imported git file names to code files. This first attempt will cover most cases, but not all of them. With this approach it is, for example, not possible to distinguish identical file names in different Java jars from the git source files of a mono repo.
You can use List_unresolved_git_files.cypher to find code files that couldn't be matched to git file names and List_ambiguous_git_files.cypher to find ambiguously resolved git files. If you have any idea on how to improve this feel free to open an issue.
With cypher-shell
CLI provided by Neo4j a query based on a file can simply be made with the following command.
Be sure to replace path/to/local/neo4j
and password
with your settings.
cat ./cypher/Get_Graph_Data_Science_Library_Version.cypher | path/to/local/neo4j/bin/cypher-shell -u neo4j -p password --format plain
Query parameter can be added with the option --param
. Here is an example:
cat ./cypher/Get_Graph_Data_Science_Library_Version.cypher | path/to/local/neo4j/bin/cypher-shell -u neo4j -p password --format plain --param {a: 1}
For a full list of options use the help function:
path/to/local/neo4j/bin/cypher-shell --help
Use executeQuery.sh to execute a Cypher query from the file given as an argument.
It uses curl
and jq
to access the HTTP API of Neo4j.
Here is an example:
./scripts/executeQuery.sh ./cypher/Get_Graph_Data_Science_Library_Version.cypher
Query parameters can be added as arguments after the file name. Here is an example:
./scripts/executeQuery.sh ./cypher/Get_Graph_Data_Science_Library_Version.cypher a=1
The script executeQueryFunctions.sh contains functions to simplify the
call of executeQuery.sh for different purposes. For example, execute_cypher_summarized
prints out the results on the console in a summarized manner and execute_cypher_expect_results
fails when there are no results.
The script also provides an API abstraction that defaults to HTTP, but can easily be switched to cypher-shell.
Query parameters can be added as arguments after the file name. Here is an example:
source "${SCRIPTS_DIR}/executeQueryFunctions.sh"
execute_cypher ./cypher/Get_Graph_Data_Science_Library_Version.cypher a=1
Use stopNeo4j.sh to stop the locally running Neo4j Graph Database. It does nothing if the database is already stopped. It runs the script with a temporary NEO4J_HOME
environment variable to not interfere with a possibly globally installed Neo4j installation.
The script executeJupyterNotebookReport.sh combines:
- creating a directory within the "reports" directory
- data availability validation using executeQueryFunctions.sh
- executing and converting the given Notebook using executeJupyterNotebook.sh
Here is an example on how to run the report Wordcloud.ipynb:
./scripts/executeJupyterNotebookReport.sh --jupyterNotebook Wordcloud.ipynb
Jupyter Notebooks can have additional custom tags within their metadata section. Opening these files with a text editor unveils that typically at the end of the file. Some editors also support editing them directly. Here, the optional metadata property code_graph_analysis_pipeline_data_validation
is used to specify which data validation query in the cypher/Validation directory should be used. Without this property, the data validation step is skipped. If a validation is specified, it will be executed before the Jupyter Notebook is executed. If the query has at least one result, the validation is seen as successful. Otherwise, the Jupyter Notebook will not be executed.
This is helpful for Jupyter Notebook reports that are specific to a programming language or other specific data prerequisites. The Notebook will be skipped if there is no data available which would otherwise lead to confusing and distracting reports with empty tables and figures.
You can search the messages Validation succeeded
or Validation failed
inside the log to get detailed information which Notebook had been skipped for which reason.
executeJupyterNotebook.sh executes a Jupyter Notebook in the command line and convert it to different formats like Markdown and PDF (optionally). It takes care of setting up the environment and uses nbconvert to execute the notebook and convert it to other file formats under the hood.
Here is an example on how to use executeJupyterNotebook.sh to for example run Wordcloud.ipynb:
./scripts/executeJupyterNotebook.sh ./jupyter/Wordcloud.ipynb
Conda provides package, dependency, and environment management for any language. Here, it is used to setup the environment for Juypter Notebooks.
-
Setup environment
conda create --name codegraph jupyter numpy matplotlib nbconvert nbconvert-webpdf conda activate codegraph
or by using the environment file codegraph-environment.yml:
conda env create --file ./jupyter/environment.yml conda activate codegraph
-
Export full environment.yml
conda env export --name codegraph > full-codegraph-environment.yml
-
Export only explicit environment.yml
conda env export --from-history --name codegraph | grep -v "^prefix: " > explicit-codegraph-environment.yml
nbconvert converts Jupyter Notebooks to other static formats including HTML, LaTeX, PDF, Markdown, reStructuredText, and more.
-
Install pandoc used by nbconvert for LaTeX support (Mac)
brew install pandoc mactex
-
Start Jupyter Notebook
jupyter notebook
-
Create new Notebook with executed cells
jupyter nbconvert --to notebook --execute ./jupyter/first-neo4j-tryout.ipynb
-
Convert Notebook with executed cells to PDF
jupyter nbconvert --to pdf ./jupyter/first-neo4j-tryout.nbconvert.ipynb
- Conda
- jQAssistant
- Jupyter Notebook
- Jupyter Notebook - Using as a command line tool
- Jupyter Notebook - Installing TeX for PDF conversion
- Jupyter Notebook Format - Metadata
- Integrate Neo4j with Jupyter Notebook
- Hello World
- Managing environments with Conda
- Neo4j - Download
- Neo4j - HTTP API
- How to Use Conda With Github Actions
- Older database download link (neo4j community)
ps -p $( lsof -t -i:7474 -sTCP:LISTEN )
kill -9 $( lsof -t -i:7474 -sTCP:LISTEN )
Reference: Neo4j memory estimation
NEO4J_HOME=tools/neo4j-community-4.4.20 tools/neo4j-community-4.4.20/bin/neo4j-admin memrec