Ingest-Splitter and Ingest-Local Pipelines

This repository contains two Jupyter notebooks that form a pipeline for document ingestion, chunking, and language model interaction. These notebooks allow you to convert documents (like PDFs), split them into smaller chunks, and send those chunks to a large language model (such as Mistral-7B Instruct) for processing or text generation.

Files in the Repo

injest-splitter.ipynb: A notebook designed to send chunks of text to a large language model for completion or generation tasks. It uses the Mistral-7B Instruct model via an API.
injest-local.ipynb: A notebook that ingests documents, splits them into chunks, and prepares them for further processing by the language model.

Prerequisites

Python 3.10+
Jupyter Notebook
Libraries (installed in the notebooks):
- docling
- quackling
- llama-index
- semantic-router
- semantic-chunkers
- rich

Installation

Clone the repository:

git clone <repository_url>
cd <repository_folder>

Install the required libraries:

Run the first few code cells in either notebook to install the required dependencies automatically via %pip.

Usage

1. Ingest-Local (Document Ingestion and Chunking)

This notebook (injest-local.ipynb) is used to ingest and split a document into chunks:

Load your document (PDF) in the variable source.
The DocumentConverter and DoclingPDFReader will process the document into a chunked format.
The chunking process is handled by RollingWindowSplitter and StatisticalChunker, which store the chunks in the splits and chunks variables.

You can modify the chunking parameters, including min_split_tokens and max_split_tokens, to fit your needs.

If you want to save the chunked output, you can write it to a file. Here’s an example of how to save the chunks to a JSON file:

import json

with open('chunked_data.json', 'w') as f:
    json.dump([chunk.to_dict() for chunk in chunks], f)

2. Ingest-Splitter (Model Interaction)

Once the document is chunked, use the injest-splitter.ipynb notebook to send those chunks to a large language model.

Load the chunked data (either by running the chunking notebook first or loading a previously saved file).
The notebook will use the OpenLLM library to interact with the model and stream responses for each chunk.

You can modify the max_tokens and timeout settings to control the model's output length and response time.

Example Flow

Step 1: Run injest-local.ipynb to ingest and chunk the document.
Step 2: Save the chunked data as a file (optional).
Step 3: Run injest-splitter.ipynb to send the chunks to the language model and receive the responses.

Environment Variables

Both notebooks require environment variables to connect to the language model API:

API_KEY: Your API key for accessing the model.
LLM_URL: The base URL for the language model API.

You can load these environment variables using a .env file or by setting them directly in the notebook.

Example .env file:

API_KEY=your-api-key
LLM_URL=your-llm-url

Customization

Adjust the chunking parameters in injest-local.ipynb for your document.
Modify the prompt generation and response handling in injest-splitter.ipynb to fit your needs.

License

This project is licensed under the Apache License 2.0 - see the LICENSE file for details.

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
data		data
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
injest-local.ipynb		injest-local.ipynb
injest-splitter.ipynb		injest-splitter.ipynb
qna.yaml		qna.yaml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ingest-Splitter and Ingest-Local Pipelines

Files in the Repo

Prerequisites

Installation

Usage

1. Ingest-Local (Document Ingestion and Chunking)

2. Ingest-Splitter (Model Interaction)

Example Flow

Environment Variables

Customization

License

About

Releases

Packages

Contributors 2

Languages

License

noelo/instruct-generate

Folders and files

Latest commit

History

Repository files navigation

Ingest-Splitter and Ingest-Local Pipelines

Files in the Repo

Prerequisites

Installation

Usage

1. Ingest-Local (Document Ingestion and Chunking)

2. Ingest-Splitter (Model Interaction)

Example Flow

Environment Variables

Customization

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages