Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AyuMind: Chatbot using RAG(Pebblo loader enabled) #2

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
GOOGLE_API_KEY=""
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -158,3 +158,6 @@ cython_debug/
# and can be added to the global gitignore or merged into this file. For a more nuclear
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
#.idea/


per_dir/
100 changes: 99 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1 +1,99 @@
# pebblo-hackathons
## Overview

This repository contains a Streamlit application that integrates multiple components from the LangChain library and Google Generative AI to create an interactive, AI-powered chatbot. The chatbot can process, retrieve, and respond to user queries based on a pre-loaded PDF document. The system employs a vector database for efficient information retrieval and ensures responses are safe by leveraging Google's safety settings.

## Features

- **Google Generative AI Integration**: Utilizes the `ChatGoogleGenerativeAI` model for generating human-like responses.
- **PDF Document Processing**: Loads and processes PDF documents into chunks for efficient retrieval.
- **Vector Database**: Stores processed text chunks in a vector database using `Chroma`, allowing for fast and accurate query retrieval.
- **Streamlit Interface**: Provides a simple and interactive user interface for chat-based interaction.
- **Safety Settings**: Configured to block harmful content using Google's safety settings.

## Installation

### Prerequisites

Ensure you have the following installed:

- Python 3.8+
- [Streamlit](https://streamlit.io/)
- [Google Generative AI](https://developers.generativeai.google)
- [LangChain](https://langchain.readthedocs.io/)
- [dotenv](https://pypi.org/project/python-dotenv/)

### Setup

1. Clone the repository:
```bash
git clone https://github.com/yourusername/your-repo-name.git
cd your-repo-name
```

2. Create a virtual environment and activate it:
```bash
python -m venv venv
source venv/bin/activate # On Windows use `venv\Scripts\activate`
```

3. Install the required dependencies:
```bash
pip install -r requirements.txt
```

4. Set up your environment variables by creating a `.env` file in the root directory:
```
GOOGLE_API_KEY=your_google_api_key
```

5. Place the PDF document in the `docs` directory:
```
docs/Merged_Ayurbeat_Everyday_Ayurveda.pdf
```

## Usage

1. Run the Streamlit application:
```bash
streamlit run app.py
```

2. The application will start by checking if a vector database already exists. If not, it will process the provided PDF document, split it into chunks, and create a vector database.

3. Once the setup is complete, the application will load the vector database and present a chat interface.

4. Enter your query in the chat input, and the chatbot will respond based on the content of the processed PDF.

## File Structure

- **app.py**: Main script containing the logic for the Streamlit application.
- **docs/**: Directory containing the PDF document to be processed.
- **per_dir/**: Directory where the vector database will be persisted.
- **.env**: Environment variables, including the Google API key.

## Configuration

- **Google API Key**: Required for using Google Generative AI. Store it in the `.env` file.
- **PDF Loader**: Currently set to load a specific PDF document. Modify the path in the script to use a different document.
- **Safety Settings**: Configurable via the `HarmCategory` and `HarmBlockThreshold` parameters in the `ChatGoogleGenerativeAI` initialization.

## Dependencies

The project relies on the following Python packages:

- `langchain_chroma`
- `langchain_google_genai`
- `google-generativeai`
- `langchain_community`
- `dotenv`
- `streamlit`

Ensure all dependencies are installed using the provided `requirements.txt`.

## License

This project is licensed under the MIT License. See the `LICENSE` file for more details.

---

By following the instructions in this README, you can set up and run the AI-powered chatbot that processes and responds to user queries based on the content of a PDF document. The chatbot is designed to provide safe, contextually relevant information while ensuring user interactions are positive and constructive.
101 changes: 101 additions & 0 deletions app.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,101 @@
from langchain_chroma import Chroma
from langchain_google_genai import GoogleGenerativeAIEmbeddings, ChatGoogleGenerativeAI, HarmBlockThreshold, HarmCategory
import google.generativeai as genai
from langchain_community.document_loaders import PyPDFLoader, PebbloSafeLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
from langchain.callbacks.manager import CallbackManager
from langchain.chains import RetrievalQA
from dotenv import load_dotenv
import streamlit as st
import os


# Set the PWD environment variable to the current working directory
os.environ['PWD'] = os.getcwd()

# Load environment variables from .env file
load_dotenv()

# Define the persist directory for the vector database
persist_directory = os.path.join(os.environ['PWD'], 'per_dir')
embeddings = GoogleGenerativeAIEmbeddings(model="models/embedding-001")

# Initialize session state for chat history
if "history" not in st.session_state:
st.session_state.history = []

# Initialize Google Generative AI
genai.configure(api_key=os.environ["GOOGLE_API_KEY"])

# Initialize callback manager for streaming output
callback_manager = CallbackManager([StreamingStdOutCallbackHandler()])

# Initialize the ChatGoogleGenerativeAI model with streaming output and safety settings
model = ChatGoogleGenerativeAI(
model="gemini-1.5-pro",
safety_settings={
HarmCategory.HARM_CATEGORY_DANGEROUS_CONTENT: HarmBlockThreshold.HARM_BLOCK_THRESHOLD_UNSPECIFIED,
},
callback_manager=callback_manager
)

# Check if the persist directory exists
if not os.path.exists(persist_directory):
with st.spinner('🚀 Starting your bot. This might take a while'):
# Data Pre-processing: Load and process PDF documents
pdf_loader = PebbloSafeLoader(PyPDFLoader(os.path.join(os.environ['PWD'], 'docs', 'Merged_Ayurbeat_Everyday_Ayurveda.pdf')), name="Final")
pdf_documents = pdf_loader.load()

# Split the loaded text into chunks
splitter = RecursiveCharacterTextSplitter(chunk_size=10000, chunk_overlap=50)
pdf_context = "\n\n".join(str(p.page_content) for p in pdf_documents)
pdfs = splitter.split_text(pdf_context)

print("Data Processing Complete")

# Create and persist the vector database from the processed text chunks
vectordb = Chroma.from_texts(pdfs, embeddings, persist_directory=persist_directory)
vectordb.persist()

print("Vector DB Creating Complete\n")
else:
# Load the existing vector database
vectordb = Chroma(persist_directory=persist_directory, embedding_function=embeddings)
print("Vector DB Loaded\n")

# Initialize the query chain for the model
query_chain = RetrievalQA.from_chain_type(
llm=model,
retriever=vectordb.as_retriever()
)

# Display the chat history
for msg in st.session_state.history:
with st.chat_message(msg['role']):
st.markdown(msg['content'])

# Handle user input
prompt = st.chat_input("Say something")
if prompt:
# Append user message to session state history
st.session_state.history.append({
'role': 'user',
'content': prompt + ". don't give harmful advice or anything that can hurt someone's feelings or emotions or anything outside of the context of this chat."
})

# Display user message in chat
with st.chat_message("user"):
st.markdown(prompt)

# Generate and display assistant's response
with st.spinner('💡Thinking'):
response = query_chain.invoke({"query": prompt})

st.session_state.history.append({
'role': 'Assistant',
'content': response['result']
})

with st.chat_message("Assistant"):
st.markdown(response['result'])
Binary file added docs/Merged_Ayurbeat_Everyday_Ayurveda.pdf
Binary file not shown.
8 changes: 8 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
langchain
chromadb
langchain-chroma
streamlit
google-generativeai
langchain-google-genai
python-dotenv
pypdf