A Retrieval-Augmented Generation (RAG) system designed to answer questions about LangGraph by retrieving and synthesizing information from multiple sources.
This project creates a conversational AI assistant capable of answering technical questions about LangGraph by:
- Processing and extracting text from multiple document sources
- Chunking documents using Docling's advanced document understanding capabilities
- Creating and storing embeddings in a vector database (LanceDB)
- Retrieving relevant information based on user queries
- Generating comprehensive answers using retrieved context
- Advanced Document Processing: Leverages Docling for intelligent document understanding and chunking
- Multiple Embedding Model Support: Compares and evaluates different embedding models
- Optimized Retrieval Parameters: Analysis to determine the optimal k value for each embedding model
- Interactive Chat Interface: Clean Streamlit UI for conversational interaction
The system integrates information from:
- GitHub Repository: LangGraph GitHub Repository
- Technical blogs and articles about LangGraph from Galileo, LinkedIn, Medium, and Towards Data Science
- Technical documentation and tutorials
Document Sources → Text Extraction → Hybrid Chunking → Embeddings → Vector Storage
- Text Extraction: Docling converts various document formats (PDF, HTML, Markdown) into a unified format
- Hybrid Chunking: Smart chunking that preserves document structure and semantic coherence
- Embedding Generation: Multiple embedding models evaluated for optimal performance
User Query → Query Embedding → Similarity Search → Context Retrieval → Answer Generation
- Vector Search: Fast similarity search through LanceDB
- Context Aggregation: Combines multiple relevant chunks with source metadata
- Answer Generation: OpenAI model synthesizes answers using retrieved context
- Python 3.8+
- OpenAI API key
- Clone the repository:
git clone https://github.com/OranDanon/RAG-application.git
cd RAG-application
- Install dependencies:
pip install -r requirements.txt
- Set up environment variables:
# Create a .env file with your OpenAI API key
echo "OPENAI_API_KEY=your_api_key_here" > .env
Run each script in sequence to build and use the system:
- Create Embeddings:
python open_ai_embed.py # Create and store embeddings
- Launch Chat Interface:
streamlit run chatbot.py # Start the Streamlit chat application
Then open your browser and navigate to http://localhost:8501
.
Using Claude 3.7 Sonnet (Thinks) and an .md
files I wrote the following prompt to generate the synthetic dataset:
Please help me create a comprehensive Q&A dataset to evaluate a Document-based Conversational AI System for LangGraph Information Retrieval (RAG ChatBot). Generate 50 questions per batch (to maintain quality and avoid message size limitations) that follow this specific JSON structure:
{
"questions": [
{
"id": "unique-question-identifier",
"section": "Specific document section or topic",
"question": "User query that requires information from the documents",
"context": "Exact text excerpt from the document containing the answer",
"difficulty": "easy/medium/hard",
"answer_type": "factoid/descriptive/procedural",
"question_type": "technical/conceptual/comparative"
}
]
}
Distribution requirements:
- Difficulty: 30% easy, 40% medium, 30% hard
- Answer types: 40% factoid (short, direct answers), 40% descriptive (explanations), 20% procedural (how-to)
- Question types: 50% technical (implementation), 30% conceptual (understanding), 20% comparative (when applicable)
Question characteristics:
- Ensure questions span all major LangGraph topics proportionally to their coverage in the documents
- Make questions increasingly complex across difficulty levels:
- Easy: Direct information retrieval from a single paragraph
- Medium: Synthesizing information from multiple paragraphs in the same section
- Hard: Requiring deeper understanding, inference, or connecting concepts
- Include questions that test:
- Core LangGraph concepts and terminology
- Implementation details and code patterns
- Best practices and application scenarios
- Differences from other frameworks (where documented)
For each question:
- Extract the precise document text in the "context" field that contains the answer
- Ensure the question is answerable solely from that context
- Vary question formulations (what, how, why, compare, explain, etc.)
- Assign a unique ID that indicates the topic area and difficulty I'll ask you to generate 2 batches of 50 questions each to reach 100 total questions. For each batch, please focus on different sections.
We evaluated several embedding models to find the optimal configuration:
- OpenAI Embeddings (
text-embedding-3-large
) - Sentence Transformers
- Open Source Alternatives
Our analysis (shown in the graph) demonstrates that text-embedding-3-large
with a combined cross-encoder re-ranking strategy achieves the best performance across all recall@k metrics.
Key findings:
- OpenAI embeddings consistently outperform open-source alternatives
- Optimal k-value for retrieval is 1-50 depending on the model
- Adding a cross-encoder for re-ranking significantly improves precision
Model | 1 | 3 | 5 | 10 | 25 | 50 |
---|---|---|---|---|---|---|
text-embedding-3-small | 0.703 | 0.836 | 0.902 | 0.956 | 0.985 | 1.000 |
all-mpnet-base-v2 | 0.636 | 0.803 | 0.864 | 0.927 | 0.990 | 1.000 |
all-MiniLM-L6-v2 | 0.683 | 0.841 | 0.886 | 0.932 | 0.985 | 1.000 |
BAAI/bge-small-en-v1.5 | 0.725 | 0.851 | 0.883 | 0.931 | 0.976 | 1.000 |
BEST MODEL FOR EACH RECALL@K:
Recall@1: BAAI/bge-small-en-v1.5 (0.7254)
Recall@3: BAAI/bge-small-en-v1.5 (0.8508)
Recall@5: text-embedding-3-small (0.9017)
Recall@10: text-embedding-3-small (0.9559)
Recall@25: all-mpnet-base-v2 (0.9898)
Recall@50: text-embedding-3-small (1.0000)
The system uses Docling's HybridChunker which:
- Preserves document structure (headings, paragraphs, tables)
- Maintains semantic coherence within chunks
- Optimizes chunk size for the specific embedding model
- Retains metadata and hierarchical relationships
- Figure Interpretation: Use LLMs to translate figures and charts into textual descriptions
- Implement cross-encoder re-ranking for improved retrieval precision
- Implement contextual_rag
- Graph RAG, Light RAG, Path RAG (for more advanced relations: cross documents Q&A)
- Enhance evaluation with the 100 Q&A dataset to measure system performance: E.g. use rubrics to test generation
- Such tests may help to evaluate if using prebuilt chains, e.g.,
langchain.chains
will be beneficial or not.
The system was evaluated using a custom dataset of 100 questions covering:
- Different difficulty levels (easy, medium, hard)
- Various answer types (factoid, descriptive, procedural)
- Different question types (technical, conceptual, comparative)
Performance metrics include:
- Relevance of retrieved contexts
- Answer accuracy and comprehensiveness
- Response time and efficiency
This project is licensed under the MIT License - see the LICENSE file for details.
- Docling for document processing capabilities
- LanceDB for vector storage
- OpenAI for embedding and language models