Skip to content

Latest commit

 

History

History

05-orchestration

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 

Data Preparation in RAG

Getting started

  1. Clone repository
git clone https://github.com/mage-ai/rag-project
cd rag-project
  1. navigate to the rag-project/llm directory, add spacy to the requirements.txt.
  2. Then update the Dockerfile found in the rag-project directory with the following:
RUN python -m spacy download en_core_web_sm
  1. Run
`./scripts/start.sh`

Once started, go to http://localhost:6789/

For more setup information, refer to these instructions

0. Module overview

1. Ingest

In this section, we cover the ingestion of documents from a single data source.

2. Chunk

Once data is ingested, we break it into manageable chunks.

The Q&A data is already chunked - the texts are small and easy to process and index. But other datasets might not be (book texts, transcripts, etc).

In this video, we will talk about turning large texts into smaller documents - i.e. chunking.

Code

3. Tokenization

Tokenization is a crucial step in text processing and preparing the data for effective retrieval.

Code

4. Embed

Embedding data translates text into numerical vectors that can be processed by models.

Previously we used sentence transformers for that. In this video we show a different strategy for it.

Code

5. Export

After processing, data needs to be exported for storage so that it can be retrieved for better contextualization of user queries.

Here we will save the embeddings to elasticsearch

please make sure to use the name given to your elasticsearch service in your docker compose file followed by the port as the connection string, e.g below

<docker-compose-service-name><port> http://elasticsearch:9200

Code

6. Retrieval: Test Vector Search Query

After exporting the chunks and embeddings, we can test the search query to retrieve relevant documents on sample queries.

Code

7. Trigger Daily Runs

Automation is key to maintaining and updating your system. This section demonstrates how to schedule and trigger daily runs for your data pipelines, ensuring up-to-date and consistent data processing.

Homework

See here.

Notes