RAG model on the medium blog and articles dataset. A chatbot interface to help improve bot response and information accuracy.
The project aims to implement RAG model capabilities in the chat interface creating an AI assistant that can provide a better response to the query, inputted by the user, with precise context and higher accuracy. For this project, I have used The Medium blogs and articles dataset. This project is based on the research paper: Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks.
This project is deployed on my Huggingface space. Have Fun!!!!
Note
Currently, the capabilities of the model deployed on huggingface space are limited due to hardware resources.
In the project, I have used a dataset of medium articles and blogs. which is available on Kaggle. This data was scraped from the medium's website. It is available in .csv
format, which has more than 100 columns. For this project, I have to modify the dataset, extract relevant information and combine it into two columns: 'title' and 'text'
, separated by a tab '\t'
Important
If you want to use the original code, your data must have two columns named 'title' and 'text' and must be separated by a tab '\t'
.
# To save your data in tab-separated .csv file
import pandas as pd
df = pd.read_csv('./path/to/your/file')
# modify your DataFrame as you like with 'title' and 'text' columns.
modified_df.to_csv('./path/to/your/csv', sep = '\t' )
The RAG (Retrieval Augmented Generation) model has two major components, a Retriever and a Generator. The retriever extracts information from the knowledge-base based on the query asked by a user and feeds it to the generator with the query. Then the generator uses that information from the retriever as a context to the query, asked by the user, to generate the relevant answer. With the help of retrieved information, from the retriever, the LLM model used for generation can produce better and up-to-date results. This addresses the two major drawbacks of LLMs, Hallucination and outdated information. With the RAG model, we just need to update our knowledge-base(to extract a piece of updated information) and feed it to a generator. This method saves computational resources and the cost of retraining the model.
We have to create a knowledge-base from which the retriever can get information according to the query(input). The original paper uses the faiss (Facebook AI Similarity Search ) indexing, to create its knowledge-base. In the project, I have also used the same method for my dataset.
import faiss
You can create faiss indexing for your own dataset by using a Python file: use_own_knowledge_datset.py
. This Python file is from the original research paper. As I have mentioned before. if you using a dataset of .csv
file format, ** it must be a tab-separated .csv
file with columns "title" and "text" **. or you can modify the code according to your needs.
python ./use_own_knowledge_dataset.py --csv_path path/to/my_csv --output_dir path/to/my_knowledge_dataset
In the project, I have used facebook/rag-sequence-nq for the retriever, which initializes the question encoder with facebook/dpr-question_encoder-single-nq-base and for the generator, google/gemma-2-2b-it is used.
If you want to see step-by-step implementation you can access my Google Colab notebook
................................ -- coming soon -- ....................................