Name		Name	Last commit message	Last commit date
parent directory ..
code		code
README.md		README.md

README.md

Data Preparation in RAG

Getting started

Clone repository

git clone https://github.com/mage-ai/rag-project
cd rag-project

navigate to the rag-project/llm directory, add spacy to the requirements.txt.
Then update the Dockerfile found in the rag-project directory with the following:

RUN python -m spacy download en_core_web_sm

Run

`./scripts/start.sh`

Once started, go to http://localhost:6789/

For more setup information, refer to these instructions

0. Module overview

1. Ingest

In this section, we cover the ingestion of documents from a single data source.

2. Chunk

Once data is ingested, we break it into manageable chunks.

The Q&A data is already chunked - the texts are small and easy to process and index. But other datasets might not be (book texts, transcripts, etc).

In this video, we will talk about turning large texts into smaller documents - i.e. chunking.

Code

3. Tokenization

Tokenization is a crucial step in text processing and preparing the data for effective retrieval.

Code

4. Embed

Embedding data translates text into numerical vectors that can be processed by models.

Previously we used sentence transformers for that. In this video we show a different strategy for it.

Code

5. Export

After processing, data needs to be exported for storage so that it can be retrieved for better contextualization of user queries.

Here we will save the embeddings to elasticsearch

please make sure to use the name given to your elasticsearch service in your docker compose file followed by the port as the connection string, e.g below

<docker-compose-service-name><port> http://elasticsearch:9200

Code

6. Retrieval: Test Vector Search Query

After exporting the chunks and embeddings, we can test the search query to retrieve relevant documents on sample queries.

Code

7. Trigger Daily Runs

Automation is key to maintaining and updating your system. This section demonstrates how to schedule and trigger daily runs for your data pipelines, ensuring up-to-date and consistent data processing.

Homework

See here.

Notes

First link goes here
Notes by Abiodun Mage RAG error fixes.
Did you take notes? Add them above this line (Send a PR with links to your notes)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

05-orchestration

05-orchestration

README.md

Data Preparation in RAG

Getting started

0. Module overview

1. Ingest

2. Chunk

3. Tokenization

4. Embed

5. Export

6. Retrieval: Test Vector Search Query

7. Trigger Daily Runs

Homework

Notes

Files

05-orchestration

Directory actions

More options

Directory actions

More options

Latest commit

History

05-orchestration

Folders and files

parent directory

README.md

Data Preparation in RAG

Getting started

0. Module overview

1. Ingest

2. Chunk

3. Tokenization

4. Embed

5. Export

6. Retrieval: Test Vector Search Query

7. Trigger Daily Runs

Homework

Notes