LangGraph Information Retrieval System

A Retrieval-Augmented Generation (RAG) system designed to answer questions about LangGraph by retrieving and synthesizing information from multiple sources.

Overview

This project creates a conversational AI assistant capable of answering technical questions about LangGraph by:

Processing and extracting text from multiple document sources
Chunking documents using Docling's advanced document understanding capabilities
Creating and storing embeddings in a vector database (LanceDB)
Retrieving relevant information based on user queries
Generating comprehensive answers using retrieved context

Features

Advanced Document Processing: Leverages Docling for intelligent document understanding and chunking
Multiple Embedding Model Support: Compares and evaluates different embedding models
Optimized Retrieval Parameters: Analysis to determine the optimal k value for each embedding model
Interactive Chat Interface: Clean Streamlit UI for conversational interaction

Data Sources

The system integrates information from:

GitHub Repository: LangGraph GitHub Repository
Technical blogs and articles about LangGraph from Galileo, LinkedIn, Medium, and Towards Data Science
Technical documentation and tutorials

Architecture

1. Document Processing Pipeline

Document Sources → Text Extraction → Hybrid Chunking → Embeddings → Vector Storage

Text Extraction: Docling converts various document formats (PDF, HTML, Markdown) into a unified format
Hybrid Chunking: Smart chunking that preserves document structure and semantic coherence
Embedding Generation: Multiple embedding models evaluated for optimal performance

2. Retrieval System

User Query → Query Embedding → Similarity Search → Context Retrieval → Answer Generation

Vector Search: Fast similarity search through LanceDB
Context Aggregation: Combines multiple relevant chunks with source metadata
Answer Generation: OpenAI model synthesizes answers using retrieved context

Getting Started

Prerequisites

Python 3.8+
OpenAI API key

Installation

Clone the repository:

git clone https://github.com/OranDanon/RAG-application.git
cd RAG-application

Install dependencies:

pip install -r requirements.txt

Set up environment variables:

# Create a .env file with your OpenAI API key
echo "OPENAI_API_KEY=your_api_key_here" > .env

Usage

Run each script in sequence to build and use the system:

Create Embeddings:

python open_ai_embed.py   # Create and store embeddings

Launch Chat Interface:

streamlit run chatbot.py  # Start the Streamlit chat application

Then open your browser and navigate to http://localhost:8501.

Evaluation

Creating synthetic dataset

Using Claude 3.7 Sonnet (Thinks) and an .md files I wrote the following prompt to generate the synthetic dataset:

Please help me create a comprehensive Q&A dataset to evaluate a Document-based Conversational AI System for LangGraph Information Retrieval (RAG ChatBot). Generate 50 questions per batch (to maintain quality and avoid message size limitations) that follow this specific JSON structure:

{
  "questions": [
    {
      "id": "unique-question-identifier",
      "section": "Specific document section or topic",
      "question": "User query that requires information from the documents",
      "context": "Exact text excerpt from the document containing the answer",
      "difficulty": "easy/medium/hard",
      "answer_type": "factoid/descriptive/procedural",
      "question_type": "technical/conceptual/comparative"
    }
  ]
}

Distribution requirements:

Difficulty: 30% easy, 40% medium, 30% hard
Answer types: 40% factoid (short, direct answers), 40% descriptive (explanations), 20% procedural (how-to)
Question types: 50% technical (implementation), 30% conceptual (understanding), 20% comparative (when applicable)

Question characteristics:

Ensure questions span all major LangGraph topics proportionally to their coverage in the documents
Make questions increasingly complex across difficulty levels:
- Easy: Direct information retrieval from a single paragraph
- Medium: Synthesizing information from multiple paragraphs in the same section
- Hard: Requiring deeper understanding, inference, or connecting concepts
Include questions that test:
- Core LangGraph concepts and terminology
- Implementation details and code patterns
- Best practices and application scenarios
- Differences from other frameworks (where documented)

For each question:

Extract the precise document text in the "context" field that contains the answer
Ensure the question is answerable solely from that context
Vary question formulations (what, how, why, compare, explain, etc.)
Assign a unique ID that indicates the topic area and difficulty I'll ask you to generate 2 batches of 50 questions each to reach 100 total questions. For each batch, please focus on different sections.

Embedding Models

We evaluated several embedding models to find the optimal configuration:

OpenAI Embeddings (text-embedding-3-large)
Sentence Transformers
Open Source Alternatives

Our analysis (shown in the graph) demonstrates that text-embedding-3-large with a combined cross-encoder re-ranking strategy achieves the best performance across all recall@k metrics.

Key findings:

OpenAI embeddings consistently outperform open-source alternatives
Optimal k-value for retrieval is 1-50 depending on the model
Adding a cross-encoder for re-ranking significantly improves precision

Model	1	3	5	10	25	50
text-embedding-3-small	0.703	0.836	0.902	0.956	0.985	1.000
all-mpnet-base-v2	0.636	0.803	0.864	0.927	0.990	1.000
all-MiniLM-L6-v2	0.683	0.841	0.886	0.932	0.985	1.000
BAAI/bge-small-en-v1.5	0.725	0.851	0.883	0.931	0.976	1.000

BEST MODEL FOR EACH RECALL@K:
Recall@1: BAAI/bge-small-en-v1.5 (0.7254)
Recall@3: BAAI/bge-small-en-v1.5 (0.8508)
Recall@5: text-embedding-3-small (0.9017)
Recall@10: text-embedding-3-small (0.9559)
Recall@25: all-mpnet-base-v2 (0.9898)
Recall@50: text-embedding-3-small (1.0000)

Document Chunking Strategy

The system uses Docling's HybridChunker which:

Preserves document structure (headings, paragraphs, tables)
Maintains semantic coherence within chunks
Optimizes chunk size for the specific embedding model
Retains metadata and hierarchical relationships

Future Improvements

Figure Interpretation: Use LLMs to translate figures and charts into textual descriptions
Implement cross-encoder re-ranking for improved retrieval precision
- Implement contextual_rag
- Graph RAG, Light RAG, Path RAG (for more advanced relations: cross documents Q&A)
Enhance evaluation with the 100 Q&A dataset to measure system performance: E.g. use rubrics to test generation
Such tests may help to evaluate if using prebuilt chains, e.g., langchain.chains will be beneficial or not.

Evaluation

The system was evaluated using a custom dataset of 100 questions covering:

Different difficulty levels (easy, medium, hard)
Various answer types (factoid, descriptive, procedural)
Different question types (technical, conceptual, comparative)

Performance metrics include:

Relevance of retrieved contexts
Answer accuracy and comprehensiveness
Response time and efficiency

License

This project is licensed under the MIT License - see the LICENSE file for details.

Acknowledgments

Docling for document processing capabilities
LanceDB for vector storage
OpenAI for embedding and language models

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
depricated		depricated
evaluator		evaluator
playgrounds		playgrounds
utils		utils
.gitignore		.gitignore
README.md		README.md
chatbot.py		chatbot.py
eval.py		eval.py
open_ai_embed.py		open_ai_embed.py
[email protected]		[email protected]
requirements.txt		requirements.txt
running example.pdf		running example.pdf
running example.png		running example.png
st-embed.py		st-embed.py
streamlit-debugger.py		streamlit-debugger.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LangGraph Information Retrieval System

Overview

Features

Data Sources

Architecture

1. Document Processing Pipeline

2. Retrieval System

Getting Started

Prerequisites

Installation

Usage

Evaluation

Creating synthetic dataset

Embedding Models

Document Chunking Strategy

Future Improvements

Evaluation

License

Acknowledgments

About

Languages

OranDanon/RAG-application

Folders and files

Latest commit

History

Repository files navigation

LangGraph Information Retrieval System

Overview

Features

Data Sources

Architecture

1. Document Processing Pipeline

2. Retrieval System

Getting Started

Prerequisites

Installation

Usage

Evaluation

Creating synthetic dataset

Embedding Models

Document Chunking Strategy

Future Improvements

Evaluation

License

Acknowledgments

About

Topics

Resources

Stars

Watchers

Forks

Languages