Chunk My Docs

We're Lumina. We've built a search engine that's 5x more relevant than Google Scholar. You can check us out at lumina.sh. We achieved this by bringing state of the art search technology (the best in dense and sparse vector embeddings) to academic research.

While search is one problem, sourcing high quality data is another. We needed to process millions of PDFs in house to build Lumina, and we found out that existing solutions to extract structured information from PDFs were too slow and too expensive ($$ per page).

Chunk my docs provides a self-hostable solution that leverages state-of-the-art (SOTA) vision models for segment extraction and OCR, unifying the output through a Rust Actix server. This setup allows you to process PDFs and extract segments at an impressive speed of approximately 5 pages per second on a single NVIDIA L4 instance, offering a cost-effective and scalable solution for high-accuracy bounding box segment extraction and OCR. This solution has models that accommodate for both GPU and CPU environments. Try the UI on chunkr.ai!

Docs

https://docs.chunkr.ai/introduction

(Super) Quick Start

Go to chunkr.ai
Make an account and copy your API key

Create a task:

curl -X POST https://api.chunkr.ai/api/v1/task \
   -H "Content-Type: multipart/form-data" \
   -H "Authorization: ${YOUR_API_KEY}" \
   -F "file=@/path/to/your/file" \
   -F "model=HighQuality" \
   -F "target_chunk_length=512" \
   -F "ocr_strategy=Auto"

Poll your created task:

curl -X GET https://api.chunkr.ai/api/v1/task/${TASK_ID} \
  -H "Authorization: ${YOUR_API_KEY}"

Self Deployments

You'll need K8s and docker.
Follow the steps in self-deployment.md

Licensing

This project is dual-licensed:

GNU Affero General Public License v3.0 (AGPL-3.0)
Commercial License

To use Chunkr privately without complying to the AGPL-3.0 license terms you can contact us or visit our website.

Want to talk to a founder?

https://cal.com/ishaank99/15min

Name		Name	Last commit message	Last commit date
Latest commit History 1,557 Commits
.github/workflows		.github/workflows
.vscode		.vscode
apps/web		apps/web
chunkmydocs		chunkmydocs
docker		docker
kube		kube
packages		packages
pyscripts		pyscripts
services		services
terraform		terraform
.dockerignore		.dockerignore
.env.chunkmydocs		.env.chunkmydocs
.env.docker-compose		.env.docker-compose
.env.pyscripts		.env.pyscripts
.gitignore		.gitignore
.npmrc		.npmrc
COMMERCIAL_LICENSE.md		COMMERCIAL_LICENSE.md
LICENSE		LICENSE
README.md		README.md
THIRD-PARTY-NOTICES.md		THIRD-PARTY-NOTICES.md
compose.yml		compose.yml
git.sh		git.sh
meta.json		meta.json
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
pnpm-workspace.yaml		pnpm-workspace.yaml
postcss.config.cjs		postcss.config.cjs
pr-branch.sh		pr-branch.sh
self-deployment.md		self-deployment.md
tailwind.config.cjs		tailwind.config.cjs
turbo.json		turbo.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Chunk My Docs

Docs

(Super) Quick Start

Self Deployments

Licensing

Want to talk to a founder?

About

Releases

Packages

Languages

License

VertexStudio/chunkr

Folders and files

Latest commit

History

Repository files navigation

Chunk My Docs

Docs

(Super) Quick Start

Self Deployments

Licensing

Want to talk to a founder?

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages