We're Lumina. We've built a search engine that's 5x more relevant than Google Scholar. You can check us out at lumina.sh. We achieved this by bringing state of the art search technology (the best in dense and sparse vector embeddings) to academic research.
While search is one problem, sourcing high quality data is another. We needed to process millions of PDFs in house to build Lumina, and we found out that existing solutions to extract structured information from PDFs were too slow and too expensive ($$ per page).
Chunk my docs provides a self-hostable solution that leverages state-of-the-art (SOTA) vision models for segment extraction and OCR, unifying the output through a Rust Actix server. This setup allows you to process PDFs and extract segments at an impressive speed of approximately 5 pages per second on a single NVIDIA L4 instance, offering a cost-effective and scalable solution for high-accuracy bounding box segment extraction and OCR. This solution has models that accommodate for both GPU and CPU environments. Try the UI on chunkr.ai!
https://docs.chunkr.ai/introduction
- Go to chunkr.ai
- Make an account and copy your API key
- Create a task:
curl -X POST https://api.chunkr.ai/api/v1/task \ -H "Content-Type: multipart/form-data" \ -H "Authorization: ${YOUR_API_KEY}" \ -F "file=@/path/to/your/file" \ -F "model=HighQuality" \ -F "target_chunk_length=512" \ -F "ocr_strategy=Auto"
- Poll your created task:
curl -X GET https://api.chunkr.ai/api/v1/task/${TASK_ID} \ -H "Authorization: ${YOUR_API_KEY}"
- You'll need K8s and docker.
- Follow the steps in
self-deployment.md
This project is dual-licensed:
- GNU Affero General Public License v3.0 (AGPL-3.0)
- Commercial License
To use Chunkr privately without complying to the AGPL-3.0 license terms you can contact us or visit our website.