LLM Document Extraction

This is a proof of concept tool for transforming messy text documents into structured data using local large language models (LLMs) and JSON schema.

This only has one requirement: llama-cpp-py

Quickstart

You need to install the python dependency: pip install -r requirements.txt

Then you need to do a few things:

Download a GGUF model you'll use. Perhaps Mixtral 8x7B?
Split your source text file into individual records. You can use the splitdoc.py script to help do that.
Create a JSON schema describing the fields you'd like the model to extract from each text record. You can see an example in schema.json

Once you've done this, then you simply need to transform each record into a JSON record, using the llm_extract.py script.

./llm_extract.py --model ../llama/models/mistral-7b-v0.1.Q5_K_M.gguf data/records/document_1.json data/records_json/document_1.records.json

This will iterate over each record inside document_1.json, ask the LLM to extract a record matching the JSON schema in schema.json and write it out to document_1.records.json.

You can see a full example pipeline of how I've turned a PDF into structured data in the Makefile as well.

Caveats

This will trim any text records that are longer than the context length specified (via --n_ctx). If your documents are too long try a model capable of larger context or manually truncate them yourself. The default strategy will trim from the end, leaving the last 50 tokens.

The prompt is baked into the llm_extract.py script and is oriented toward completion models. If you want to use instruct models or models that require a specific format you'll need to change the prompt yourself.

Some models are better at this kind of task than others. If you aren't getting the results you want try something else. You may also get better results by giving an example. The current prompt is one shot learning. An example can help your model learn how to extract details from your specific text, but can also cause errors (e.g., the model starts to repeat the example rather than pull from the document).

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
florence-ocr @ b5ad6f2		florence-ocr @ b5ad6f2
.gitignore		.gitignore
.gitmodules		.gitmodules
.python-version		.python-version
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
combine_jsons.py		combine_jsons.py
llm_extract.py		llm_extract.py
muckrock_pdfs.hext		muckrock_pdfs.hext
requirements.txt		requirements.txt
schema.json		schema.json
splitdoc.py		splitdoc.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LLM Document Extraction

Quickstart

Caveats

About

Releases

Packages

Languages

License

brandonrobertz/llm-document-extraction

Folders and files

Latest commit

History

Repository files navigation

LLM Document Extraction

Quickstart

Caveats

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages