-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #17 from ArturOle/readme_update
Update README.md
- Loading branch information
Showing
1 changed file
with
14 additions
and
14 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,28 +1,28 @@ | ||
# Ragger - from raw files to efficient Semantic Search (WIP) | ||
# ContextSearch - from raw files to efficient Semantic Search (WIP) | ||
|
||
Start date 22.08.2024 | ||
|
||
|
||
## Overview | ||
|
||
This project aims to develop easy-in-use automated system for Semantic Serching through the files. Two main scenarios for which the project wants to expand is personal semantic searching (searching for informations in the books, documents, and other documents), and as a interface for advanced grounding methods for Large Language Models (LLMs) projects line Retrieval Augmented Generation (RAG) or LoRA/QLoRA (Low-Rank Adaptation). As the main goal is avability, the system will support various ways of interactions from GUI, through CLI with Click, and REST API with FastAPI to gRPC with protobuf. | ||
This project aims to develop an easy-in-use automated system for Semantic searching through the files. Two main scenarios for which the project wants to expand are personal semantic searching (searching for information in books, documents, and other documents), and as an interface for advanced grounding methods for Large Language Models (LLMs) projects line Retrieval Augmented Generation (RAG) or LoRA/QLoRA (Low-Rank Adaptation). As the main goal is availability, the system will support various ways of interactions from GUI, through CLI with Click, and REST API with FastAPI to gRPC with protobuf. | ||
|
||
The Context Search functionality searches for documents in the given directories and read them for further processing. When a scanned document occures in the set, Tesseract 5 OCR is used to extract the text from the image. Later, the data is trnsformed, preprocessed and submitted to the Neo4j database. During the retrieval phase, the exact chunk of text is returned together with the simillarity score. | ||
The Context Search functionality searches for documents in the given directories and read them for further processing. When a scanned document occures in the set, Tesseract 5 OCR is used to extract the text from the image. Later, the data is trnsformed, preprocessed, and submitted to the Neo4j database. During the retrieval phase, the exact chunk of text is returned together with the simillarity score. | ||
|
||
![alt text](./docs/images/FirstGraphFromAutomaticGeneration.png) | ||
First Graph created with automatic generation. Graph of three ai-generated articles (pink node), text chunks with embedding (orange nodes) and keywords/tags with embeddings (blue nodes). | ||
First Graph created with automatic generation. Graph of three ai-generated articles (pink node), text chunks with embedding (orange nodes), and keywords/tags with embeddings (blue nodes). | ||
|
||
## Features | ||
- **Automatic file processing**: One function to submit all files, one function to search for answers. | ||
- **Automatic file processing**: One function to submit all files and one function to search for answers. | ||
- **OCR Integration**: Ragger doesn't care if you throw a real or scanned PDF into it. He is here to provide. | ||
- **Neo4j Database Storage**: Store the extracted embeddings in a Neo4j graph database for efficient querying and retrieval. Keep tuned for some interesting optimizations that are planned. | ||
|
||
## Future Plans | ||
- **Julia port**: As Julia programing lenguage annouced the improved exeutable compilation into small files, it will be a perfect use for production deployment of the system. | ||
- **Interfaces**: Currently Context Search is designed to be used in the form of Python package, but the goal is to make it as accessible as possible so that even non-technical users will be able to search through their documents. According to the development plan, the REST API will be the first to be implemented. | ||
- **Julia port**: As Julia programing lenguage annouced the improved executable compilation into small files, it will be a perfect use for production deployment of the system. | ||
- **Interfaces**: Currently, Context Search is designed to be used as a Python package, but the goal is to make it as accessible as possible so that even non-technical users can search through their documents. According to the development plan, the REST API will be the first to be implemented. | ||
- **File type expansion**: The project is starting with minimal complexity, one major filetype (PDF) one database, simple execution paths, and high modularity. With maturation of the system, it will expand the support for other document files (docx, md, pptx). | ||
- **Interface for LLM-grounding**: Even though the proejct started as strictly tool for LLM, the intrest about the sematnic search possibility amongs my friends and family made me focus more on this side of the proejct. However, the plan did not changed, system will support direct LLM integration. | ||
- **Database independence**: The end goal is to prepare adapters and adapter creation manual for integrating different databases. The chages in the industry make it curently impossible to predict what will be the new state-of-the-art when it comes to vector and graph databases. | ||
- **Interface for LLM-grounding**: Even though the project started strictly as a tool for LLM, the interest in the semantic search possibility amongst my friends and family made me focus more on this side of the project. However, the plan did not change, the system will support direct LLM integration. | ||
- **Database independence**: The end goal is to prepare adapters and adapter creation manuals for integrating different databases. The chages in the industry make it curently impossible to predict what will be the new state-of-the-art when it comes to vector and graph databases. | ||
|
||
## Getting Started | ||
### Prerequisites | ||
|
@@ -44,21 +44,21 @@ The Context Search functionality searches for documents in the given directories | |
|
||
### Build the project | ||
|
||
The project currently is not mature enough to be sumbitted to PYPI, thats why the preferable way of running Ragger is to clone the solution, run `pip install .` and modify the config to match your system needs. If you have tesseract and poppler in the PATH, the Ragger will read it by itself. | ||
The project currently is not mature enough to be submitted to PYPI, that's why the preferable way of running Ragger is to clone the solution, run `pip install .` and modify the config to match your system needs. If you have tesseract and poppler in the PATH, the Ragger will read it by itself. | ||
|
||
### Alternative build (Docker) | ||
|
||
Use docker compose files which are ready to use without any tinkering with config file. | ||
Use docker-compose files which are ready to use without any tinkering with the config file. | ||
|
||
### Run seamntic search | ||
|
||
Refer to `examples/example_submit.py` on how to upload files to the database and to `examples/example_retrieve.py` on how to retrieve data. The extensive tutorial is "in-progress". | ||
Refer to `examples/example_submit.py` on how to upload files to the database and to `examples/example_retrieve.py` on how to retrieve data. The extensive tutorial is "in progress". | ||
|
||
## Contributing | ||
Please, hold on with the contribution until the first major release. Feel free to fork and initize discussion if you want to. Always happy to hear some voices of reason! | ||
Please, hold on with the contribution until the first major release. Feel free to fork and initiate discussion if you want to. Always happy to hear some voices of reason! | ||
|
||
## License | ||
This project is licensed under the GPL-3.0 License. See the LICENSE file for details. | ||
|
||
## Contact | ||
For any questions or suggestions, please approach me directly, open issue or e-mail me via mail [email protected]. | ||
For any questions or suggestions, please approach me directly, open the issue or e-mail me via mail [email protected]. |