Stirling B🤖t is a FAQ chatbot designed to address frequently asked questions (FAQs) related to the University of Stirling. It leverages advanced web scraping techniques to gather relevant information from the university's website, as well as extracting relevant information from PDF documents. Stirling B🤖t utilizes powerful Natural Language Processing (NLP) tools to answer user queries in a conversational manner, ensuring a seamless and informative experience.
Access the FAQ Chatbot here: Stirling Bot
- Enhance User Experience: Provide a convenient and readily accessible way to find university-related information.
- Showcase Technical Skills: Demonstrate web scraping, Large Language Model (LLM) and NLP capabilities using Python.
- Build an Interactive Interface: Develop a Streamlit application for a seamless user experience.
- Ensure Accuracy and High Performance: Utilize Retrieval-Augmented Generation (RAG) techniques and advanced Language Model capabilities to deliver precise and efficient responses.
The project implements a robust ETL (Extraction, Transformation, Loading) pipeline to ensure dynamic data handling.
- Web Scraping: Data is extracted from the University of Stirling website.
- PDF Data Extraction: Relevant information is extracted from PDF documents.
- Data Cleaning: Remove unnecessary spaces, HTML tags, links, etc.
- Data Transformation: Use GPT to transform the data into question-answer pairs.
- Data Storage: Save the generated question-answer pairs in a CSV file.
- Data Processing: Chunk the processed CSV data and generate embeddings.
- Data Storage: Load the embeddings into a vector database.
- Python: The backbone of the entire project, providing a versatile and powerful environment for development.
- Beautiful Soup: Efficiently extracts FAQ data from the university website.
- Requests: Handles HTTP requests to web pages or APIs to fetch HTML content or other data.
- LangChain: A comprehensive NLP library for text processing, document management, and chatbot development.
- Pinecone: A vector database for efficient text retrieval and search.
- NLTK: Natural Language Toolkit, useful for working with human language data.
- OpenAI: Utilizes a large language model (LLM) to generate human-like responses for enhanced conversational fluency.
- Rouge: A set of metrics for evaluating automatic summarization and machine translation.
- Bert_Score: A metric for evaluating text generation models based on BERT embeddings.
- Streamlit: A framework for creating interactive web apps to present the chatbot interface.
- Pandas: A powerful library for data manipulation and processing.
- Matplotlib: A comprehensive library for creating static, animated, and interactive visualizations in Python.
- PyPDF2: Reads and extracts text from PDF documents.
- python-dotenv: Manages environment variables, loading them from a
.env
file.
- Isort: Automatically sorts imports in Python files to maintain a consistent and organized import structure.
- Black: Enforces a consistent coding style through automatic code formatting.
- Flake8: Checks for compliance with Python coding standards, ensuring clean and error-free code.
- Rich: Enhances the command-line interface and debugging output with rich text and formatting.
-
Clone this repository:
git clone https://github.com/edward-mike/multi-language-faq-chatbot.git
-
Create a virtual environment (recommended) and activate it:
python -m venv venv source venv/bin/activate # On Windows use `venv\Scripts\activate`
-
Install dependencies:
pip install -r requirements.txt
-
Create a
.env
file. Refer to.env.example
for the required configuration. -
Run the app:
streamlit run main.py
-
Copy and paste the local URL http://localhost:8501 into your browser
You can connect with me on Linkedin
To all contributors of open source and free software used in this project, thank you👏
-
Implement Multi-language Support: Ensure the system can handle all languages.
-
Advanced LLM Evaluation: Develop more sophisticated methods for evaluating large language models.