- Clone repository
git clone https://github.com/mage-ai/rag-project
cd rag-project
- navigate to the
rag-project/llm
directory, addspacy
to the requirements.txt. - Then update the
Dockerfile
found in therag-project
directory with the following:
RUN python -m spacy download en_core_web_sm
- Run
`./scripts/start.sh`
Once started, go to http://localhost:6789/
For more setup information, refer to these instructions
In this section, we cover the ingestion of documents from a single data source.
Once data is ingested, we break it into manageable chunks.
The Q&A data is already chunked - the texts are small and easy to process and index. But other datasets might not be (book texts, transcripts, etc).
In this video, we will talk about turning large texts into smaller documents - i.e. chunking.
Tokenization is a crucial step in text processing and preparing the data for effective retrieval.
Embedding data translates text into numerical vectors that can be processed by models.
Previously we used sentence transformers for that. In this video we show a different strategy for it.
After processing, data needs to be exported for storage so that it can be retrieved for better contextualization of user queries.
Here we will save the embeddings to elasticsearch
please make sure to use the name given to your elasticsearch service in your docker compose file followed by the port as the connection string, e.g below
<docker-compose-service-name><port>
http://elasticsearch:9200
After exporting the chunks and embeddings, we can test the search query to retrieve relevant documents on sample queries.
Automation is key to maintaining and updating your system. This section demonstrates how to schedule and trigger daily runs for your data pipelines, ensuring up-to-date and consistent data processing.
See here.
- First link goes here
- Notes by Abiodun Mage RAG error fixes.
- Did you take notes? Add them above this line (Send a PR with links to your notes)