Skip to content

AutoBench: Benchmarking Automation for Intelligent Document Processing (IDP) with confidence

Notifications You must be signed in to change notification settings

NanoNets/nn-auto-bench

Repository files navigation

AutoBench: Benchmarking Automation for Intelligent Document Processing (IDP) with confidence

AutoBench is an automation-cenrtic benchmark designed to evaluate Large Models (LLMs and VLMs) for Intelligent Document Processing (IDP).
In addition to performance assessment of LMs for structured information extraction and delivering key metrics, it uniquely emphasizes prediction confidence scores to enhance automation in IDP pipelines.

What sets AutoBench apart is its focus on confidence scores, essential for automating IDP workflows. Confidence scores enable systems to:

  • Automate validation of high-confidence extractions, minimizing manual review.
  • Intelligently route low-confidence predictions for human-in-the-loop verification, streamlining exception handling.
  • Optimize IDP pipelines by leveraging confidence to prioritize document routing and manual intervention.

Figure: Benchmark Results – A comparison of confidence benchmarks and detailed performance metrics.
For a full analysis, visit this detailed report.

Setup

1. Install Dependencies

To install the required dependencies, run:

pip install -e .

2. Configure API Keys and Base URLs

API keys and base URLs must be set using an .env file.

Steps to Configure:

  1. Create the .env file: Copy .env.example to .env in the project root directory.

    cp .env.example .env
  2. Set API keys and base URLs: Edit the .env file and provide the necessary values for the models you intend to benchmark.

    Example .env file:

    OPENAI_API_KEY=sk-...  # Your OpenAI API Key
    CLAUDE_API_KEY=sk-ant-...  # Your Anthropic Claude API Key
    QWEN2_API_BASE_URL=http://your-qwen2-api:8000/v1  # API Base URL for Qwen2
    GPT4V_API_BASE_URL=https://api.openai.com/v1  # API Base URL for GPT-4V
    # Add API keys and base URLs for other models as needed
    

    Important: Ensure both API keys and base URLs are correctly set for each model before running benchmarks. Refer to .env.example for required variable names.

3. Dataset Download

AutoBench uses a publicly available dataset on Hugging Face Hub. You can download the dataset using the provided download_dataset.sh script.

Dataset Download Steps:

  1. Make the script executable: Open your terminal, navigate to the tools/ directory, and make the download_dataset.sh script executable:

    chmod +x download_dataset.sh
  2. Run the script: Execute the script from the tools/ directory:

    ./download_dataset.sh

    The script will:

    • Check if the data/ directory already exists and ask if you want to remove and re-download the dataset if it does.
    • Download the nanonets/nn-auto-bench-ds dataset from Hugging Face Hub to the data/ directory in the project root.

    After successful execution, the dataset will be located in the data/ subdirectory.

4. Run Benchmarks

The benchmarking process is executed using benchmark.py.

Usage

Run the benchmark script with the following command:

python tools/benchmark.py <model_name> --input_file <path_to_input_jsonl_file> [options]

Before running benchmarks, ensure you have downloaded the dataset using the download_dataset.sh script as described in the "Dataset Download" section above. The --input_file argument in the benchmark.py command should then point to the appropriate JSONL file within the downloaded dataset directory (e.g., data/metadata.jsonl).

Arguments:

  • <model_name>: The model to evaluate (qwen2, gpt4v, gpt4o, etc.).
  • --input_file <path>: Path to the JSONL dataset containing input data.

Optional Parameters:

  • --max_workers <int>: Number of worker threads (default: 16).
  • --few_shot <int>: Number of few-shot examples (default: 1).
  • --conf_score_method <string>: Method for computing confidence scores (prob, yes_no, consistency, default: prob).
  • --limit <int>: Number of document samples to benchmark.

Example:

python tools/benchmark.py gpt4o --input_file data/metadata.jsonl --max_workers 32 --few_shot 1 --conf_score_method prob --limit 10

Output

Benchmark results are saved as JSONL files in the results/ directory, following the naming convention:

benchmark_results_<model_name>_<dataset_name>_<layout>_<conf_score_method>_<timestamp>.jsonl

Each result entry includes:

  • Execution time and API usage.
  • Input paths and annotations.
  • Prompts and raw model responses.
  • Performance metrics: parsing_accuracy, predicted_field_conf_scores, file_accuracy, etc.

Summary metrics are also printed to the console.

Note on Result Variability: Due to the inherent stochastic nature of Large Language Models, slight variations in benchmark results may be observed across different runs. For reference, our benchmark results are available in the results folder, providing a consistent baseline for comparison.

Code Structure

The repository is organized as follows:

  • nnautobench/config/ – Configuration files (config.py).
  • nnautobench/inference/ – Inference logic (predictor.py).
  • nnautobench/models/ – Vision-Language model implementations.
  • nnautobench/utils/ – Utility functions for JSON handling, metric computations, and prompt management.
  • results/ – Directory for storing benchmark outputs.
  • tools/benchmark.py – Main script for running benchmarks.

Model Versions and Benchmark Results

The benchmark was conducted using the following model versions. Links to our benchmark result files are provided for each model to facilitate result verification and comparison. Note: Due to the stochastic nature of LLMs, your benchmark runs may exhibit slight variations from our provided results.

Model Name Specific Version Model Type Benchmark Results
Qwen2 Qwen2.5_72B LLM Qwen2.5 (prob)
Pixtral Pixtral-12B-2409 VLM Pixtral (prob)
GPT-4V gpt-4o-2024-11-20 LLM GPT4V
GPT-4o gpt-4o-2024-11-20 LLM GPT4o (Prob)
DSv3 deepseekv3 LLM DeepSeekV3 ()
Gemini Flash 2 gemini-2.0-flash LLM Gemini Flash 2.0 (prob)
Claude 3.5 claude-3-5-sonnet-20241022 LLM Claude 3.5 (prob)
Claude 3.7 claude-3-7-sonnet-20250219 LLM Claude 3.7 (prob)
Mistral Large mistral-large-latest LLM Mistral Large (prob)
Nanonets nanonets-internal-model Prop. Nanonets

Future Improvements

  • Add more models to the benchmark.
  • Add more confidence scoring methods.

Reachout to us at [email protected] for any questions or feedback.


AutoBench provides a enhanced approach to evaluating Large Models for automation of document intelligence tasks

About

AutoBench: Benchmarking Automation for Intelligent Document Processing (IDP) with confidence

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published