AutoBench: Benchmarking Automation for Intelligent Document Processing (IDP) with confidence

AutoBench is an automation-cenrtic benchmark designed to evaluate Large Models (LLMs and VLMs) for Intelligent Document Processing (IDP).
In addition to performance assessment of LMs for structured information extraction and delivering key metrics, it uniquely emphasizes prediction confidence scores to enhance automation in IDP pipelines.

What sets AutoBench apart is its focus on confidence scores, essential for automating IDP workflows. Confidence scores enable systems to:

Automate validation of high-confidence extractions, minimizing manual review.
Intelligently route low-confidence predictions for human-in-the-loop verification, streamlining exception handling.
Optimize IDP pipelines by leveraging confidence to prioritize document routing and manual intervention.

Figure: Benchmark Results – A comparison of confidence benchmarks and detailed performance metrics.
For a full analysis, visit this detailed report.

Setup

1. Install Dependencies

To install the required dependencies, run:

pip install -e .

2. Configure API Keys and Base URLs

API keys and base URLs must be set using an .env file.

Steps to Configure:

Create the .env file: Copy .env.example to .env in the project root directory.
```
cp .env.example .env
```

Set API keys and base URLs: Edit the .env file and provide the necessary values for the models you intend to benchmark.

Example .env file:

OPENAI_API_KEY=sk-...  # Your OpenAI API Key
CLAUDE_API_KEY=sk-ant-...  # Your Anthropic Claude API Key
QWEN2_API_BASE_URL=http://your-qwen2-api:8000/v1  # API Base URL for Qwen2
GPT4V_API_BASE_URL=https://api.openai.com/v1  # API Base URL for GPT-4V
# Add API keys and base URLs for other models as needed

Important: Ensure both API keys and base URLs are correctly set for each model before running benchmarks. Refer to .env.example for required variable names.

3. Dataset Download

AutoBench uses a publicly available dataset on Hugging Face Hub. You can download the dataset using the provided download_dataset.sh script.

Dataset Download Steps:

Make the script executable: Open your terminal, navigate to the tools/ directory, and make the download_dataset.sh script executable:
```
chmod +x download_dataset.sh
```
Run the script: Execute the script from the tools/ directory:
```
./download_dataset.sh
```
The script will:
- Check if the data/ directory already exists and ask if you want to remove and re-download the dataset if it does.
- Download the nanonets/nn-auto-bench-ds dataset from Hugging Face Hub to the data/ directory in the project root.
After successful execution, the dataset will be located in the data/ subdirectory.

4. Run Benchmarks

The benchmarking process is executed using benchmark.py.

Usage

Run the benchmark script with the following command:

python tools/benchmark.py <model_name> --input_file <path_to_input_jsonl_file> [options]

Before running benchmarks, ensure you have downloaded the dataset using the download_dataset.sh script as described in the "Dataset Download" section above. The --input_file argument in the benchmark.py command should then point to the appropriate JSONL file within the downloaded dataset directory (e.g., data/metadata.jsonl).

Arguments:

<model_name>: The model to evaluate (qwen2, gpt4v, gpt4o, etc.).
--input_file <path>: Path to the JSONL dataset containing input data.

Optional Parameters:

--max_workers <int>: Number of worker threads (default: 16).
--few_shot <int>: Number of few-shot examples (default: 1).
--conf_score_method <string>: Method for computing confidence scores (prob, yes_no, consistency, default: prob).
--limit <int>: Number of document samples to benchmark.

Example:

python tools/benchmark.py gpt4o --input_file data/metadata.jsonl --max_workers 32 --few_shot 1 --conf_score_method prob --limit 10

Output

Benchmark results are saved as JSONL files in the results/ directory, following the naming convention:

benchmark_results_<model_name>_<dataset_name>_<layout>_<conf_score_method>_<timestamp>.jsonl

Each result entry includes:

Execution time and API usage.
Input paths and annotations.
Prompts and raw model responses.
Performance metrics: parsing_accuracy, predicted_field_conf_scores, file_accuracy, etc.

Summary metrics are also printed to the console.

Note on Result Variability: Due to the inherent stochastic nature of Large Language Models, slight variations in benchmark results may be observed across different runs. For reference, our benchmark results are available in the results folder, providing a consistent baseline for comparison.

Code Structure

The repository is organized as follows:

nnautobench/config/ – Configuration files (config.py).
nnautobench/inference/ – Inference logic (predictor.py).
nnautobench/models/ – Vision-Language model implementations.
nnautobench/utils/ – Utility functions for JSON handling, metric computations, and prompt management.
results/ – Directory for storing benchmark outputs.
tools/benchmark.py – Main script for running benchmarks.

Model Versions and Benchmark Results

The benchmark was conducted using the following model versions. Links to our benchmark result files are provided for each model to facilitate result verification and comparison. Note: Due to the stochastic nature of LLMs, your benchmark runs may exhibit slight variations from our provided results.

Model Name	Specific Version	Model Type	Benchmark Results
Qwen2	`Qwen2.5_72B`	LLM	Qwen2.5 (prob)
Pixtral	`Pixtral-12B-2409`	VLM	Pixtral (prob)
GPT-4V	`gpt-4o-2024-11-20`	LLM	GPT4V
GPT-4o	`gpt-4o-2024-11-20`	LLM	GPT4o (Prob)
DSv3	`deepseekv3`	LLM	DeepSeekV3 ()
Gemini Flash 2	`gemini-2.0-flash`	LLM	Gemini Flash 2.0 (prob)
Claude 3.5	`claude-3-5-sonnet-20241022`	LLM	Claude 3.5 (prob)
Claude 3.7	`claude-3-7-sonnet-20250219`	LLM	Claude 3.7 (prob)
Mistral Large	`mistral-large-latest`	LLM	Mistral Large (prob)
Nanonets	`nanonets-internal-model`	Prop.	Nanonets

Future Improvements

Add more models to the benchmark.
Add more confidence scoring methods.

Reachout to us at [email protected] for any questions or feedback.

AutoBench provides a enhanced approach to evaluating Large Models for automation of document intelligence tasks

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
assets		assets
imagesets/f70eae06-45ce-4a2a-9671-d17ea405aed7		imagesets/f70eae06-45ce-4a2a-9671-d17ea405aed7
nnautobench		nnautobench
results		results
tools		tools
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
README.md		README.md
metrics.ipynb		metrics.ipynb
requirements.txt		requirements.txt
sample_data.jsonl		sample_data.jsonl
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AutoBench: Benchmarking Automation for Intelligent Document Processing (IDP) with confidence

Setup

1. Install Dependencies

2. Configure API Keys and Base URLs

Steps to Configure:

3. Dataset Download

Dataset Download Steps:

4. Run Benchmarks

Usage

Arguments:

Optional Parameters:

Example:

Output

Code Structure

Model Versions and Benchmark Results

Future Improvements

About

Releases

Packages

Contributors 2

Languages

NanoNets/nn-auto-bench

Folders and files

Latest commit

History

Repository files navigation

AutoBench: Benchmarking Automation for Intelligent Document Processing (IDP) with confidence

Setup

1. Install Dependencies

2. Configure API Keys and Base URLs

Steps to Configure:

3. Dataset Download

Dataset Download Steps:

4. Run Benchmarks

Usage

Arguments:

Optional Parameters:

Example:

Output

Code Structure

Model Versions and Benchmark Results

Future Improvements

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages