This is an <ongoing> personal project aimed to practice building a pipeline to feed a Neo4J database from unstructured data from PDFs containing (fictional) crime reports, and then use a Graph RAG to query the database in natural language.
The pipeline is based on Neo4J - Enhancing the Accuracy of RAG Applications With Knowledge Graphs article.
The GraphRAG is based on the YouTube tutorial Langchain & Neo4j: Query Your Graph Database in Natural Language.
Both parts of the project were adapted to use a locally hosted Neo4J database (Docker) and a locally hosted LLM (Ollama).
Stack: Python, LangChain, Ollama, Neo4J, Docker
To run this project you'll need:
- Docker installed and running on your machine (docker-compose.yml file included in the repository).
- Ollama installed and running on your machine, and a model downloaded.
- A Python environment with the required packages installed. You can install them with
pip install -r requirements.txt
. - A .env file with the following variables:
NEO4J_URI=bolt://localhost:7687
NEO4J_USERNAME=neo4j
NEO4J_PASSWORD=neo4j
pipeline.py -> main script to run the pipeline.
- It extracts text from PDFs in the
files
folder. - Sends the text to the local LLM to extract entities and relationships.
- To use a I needed to build a custom chat_prompt, as pointed out in this StackOverflow topic.
- I chose to also build my own Pydantic class and examples, instead of using the library's default, to align the model to the crime-related theme.
- Inserts into the Neo4J database the extracted entities and relationships.
After running the pipeline script, check out the Neo4J database at http://localhost:7474/browser/
:
MATCH (n)-[r]->(m)
RETURN n, r, m
You should see all the entities and relationships extracted from the PDFs.
Results using Llama3-8B model:
graph_rag.py -> main script to run the Graph RAG Q&A.
- It queries the Neo4J database with a natural language question.
- It returns the answer in natural language based on the result of the query.
Right now you need to write the questions using the same words as the entities and relationships in the database. I'm working on a way to make the questions more flexible...
Results using Llama3-8B model: