Skip to content

Commit

Permalink
Invoice Extraction (#84)
Browse files Browse the repository at this point in the history
* Create temp

Signed-off-by: Surya Deep Singh <[email protected]>

* Add files via upload

Signed-off-by: Surya Deep Singh <[email protected]>

* Delete recipes/invoice_extractions/temp

Signed-off-by: Surya Deep Singh <[email protected]>

* Delete recipes/invoice_extractions directory

Signed-off-by: Surya Deep Singh <[email protected]>

* Create Invoices Extraction

Signed-off-by: Surya Deep Singh <[email protected]>

* Delete recipes/Invoices Extraction

Signed-off-by: Surya Deep Singh <[email protected]>

* Create Test

Signed-off-by: Surya Deep Singh <[email protected]>

* Add files via upload

Signed-off-by: Surya Deep Singh <[email protected]>

* Create test

Signed-off-by: Surya Deep Singh <[email protected]>

* Add files via upload

Signed-off-by: Surya Deep Singh <[email protected]>

* Delete recipes/Invoice-Extraction/Invoices/test

Signed-off-by: Surya Deep Singh <[email protected]>

* Delete recipes/Invoice-Extraction/Test

Signed-off-by: Surya Deep Singh <[email protected]>

* Delete recipes/Invoice-Extraction/Invoices/12855.pdf

Signed-off-by: Surya Deep Singh <[email protected]>

* Delete recipes/Invoice-Extraction/Invoices/2006969.pdf

Signed-off-by: Surya Deep Singh <[email protected]>

* Delete recipes/Invoice-Extraction/Invoices/6900026063.pdf

Signed-off-by: Surya Deep Singh <[email protected]>

* Delete recipes/Invoice-Extraction/Invoices/6900026069.pdf

Signed-off-by: Surya Deep Singh <[email protected]>

* Delete recipes/Invoice-Extraction/Invoices/6905212892.pdf

Signed-off-by: Surya Deep Singh <[email protected]>

* Delete recipes/Invoice-Extraction/Invoices/904000640.pdf

Signed-off-by: Surya Deep Singh <[email protected]>

* Add files via upload

Signed-off-by: Surya Deep Singh <[email protected]>

* Delete recipes/Invoice-Extraction/Invoices/PL_IERPIC_MISSING.pdf

Signed-off-by: Surya Deep Singh <[email protected]>

* Add files via upload

Signed-off-by: Surya Deep Singh <[email protected]>

* Delete recipes/Invoice-Extraction/Granite_Recipes_Invoices.ipynb

Signed-off-by: Surya Deep Singh <[email protected]>

* Add files via upload

Signed-off-by: Surya Deep Singh <[email protected]>

* Delete recipes/Invoice-Extraction/Granite_Recipes_Invoices.ipynb

Signed-off-by: Surya Deep Singh <[email protected]>

* Add files via upload

Signed-off-by: Surya Deep Singh <[email protected]>

* Update vanilla_notebooks.txt

Signed-off-by: Surya Deep Singh <[email protected]>

* Delete recipes/Invoice-Extraction/Granite_Recipes_Invoices.ipynb

Signed-off-by: Surya Deep Singh <[email protected]>

* Add files via upload

Signed-off-by: Surya Deep Singh <[email protected]>

* Update README.md

Signed-off-by: Surya Deep Singh <[email protected]>

* Update README.md

Signed-off-by: Surya Deep Singh <[email protected]>

* Update Granite_Recipes_Invoices.ipynb

Signed-off-by: Surya Deep Singh <[email protected]>

* Delete recipes/Invoice-Extraction/Granite_Recipes_Invoices.ipynb

Signed-off-by: Surya Deep Singh <[email protected]>

* Add files via upload

Signed-off-by: Surya Deep Singh <[email protected]>

---------

Signed-off-by: Surya Deep Singh <[email protected]>
Co-authored-by: BJ Hargrave <[email protected]>
  • Loading branch information
SinghSuryaDeep and bjhargrave authored Jan 9, 2025
1 parent 075a03d commit b42e55b
Show file tree
Hide file tree
Showing 8 changed files with 383 additions and 0 deletions.
1 change: 1 addition & 0 deletions .github/notebook_lists/vanilla_notebooks.txt
Original file line number Diff line number Diff line change
Expand Up @@ -10,3 +10,4 @@ recipes/Granite_Guardian/Granite_Guardian_Detailed_Guide.ipynb
recipes/Granite_Guardian/Granite_Guardian_Usage_Governance_Workflow.ipynb
recipes/Granite_Guardian/Granite_Guardian_Watsonx_Usage.ipynb
recipes/AI-Agents/travel_planner_agent.ipynb
recipes/Invoice-Extraction/Granite_Recipes_Invoices.ipynb
4 changes: 4 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,10 @@ The "Recipes" in the Granite Snack Cookbook showcase the essential capabilities
<a target="_blank" href="https://colab.research.google.com/github/ibm-granite-community/granite-snack-cookbook/blob/main/recipes/Function-Calling/Function_Calling.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>
1. [Invoice Extraction](recipes/Invoice-Extraction/Granite_Recipes_Invoices.ipynb)
<a target="_blank" href="https://colab.research.google.com/github/ibm-granite-community/granite-snack-cookbook/blob/main/recipes/Invoice-Extraction/Granite_Recipes_Invoices.ipynb">
<img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
</a>

### Compound Systems

Expand Down
378 changes: 378 additions & 0 deletions recipes/Invoice-Extraction/Granite_Recipes_Invoices.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,378 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "766392dd-c422-424e-a752-36209712ba2f",
"metadata": {},
"source": [
"# Invoice Data Extraction using IBM Granite LLM from watsonx\n",
"Author: [@Surya Deep Singh](https://www.linkedin.com/in/surya-deep-singh-b9b94813a/)\n"
]
},
{
"cell_type": "markdown",
"id": "05da18f3-f2a4-42ab-b4b6-fa910140560c",
"metadata": {},
"source": [
"## Description"
]
},
{
"cell_type": "markdown",
"id": "40c261d0-1b1a-4333-9f22-a4a90102934b",
"metadata": {},
"source": [
"This Jupyter Notebook is designed to process and extract key details from invoices stored as PDFs. It uses IBM watsonx's granite-3-8b-instruct LLM to interpret the invoice data and extract specific fields such as invoice number, total amounts and customer details, etc. Extracted data is validated and saved into a structured DataFrame for further analysis."
]
},
{
"cell_type": "markdown",
"id": "23ecec96-cb84-4a33-9c36-f2670970f8e5",
"metadata": {},
"source": [
"## Step 1: Setup and Installation"
]
},
{
"cell_type": "markdown",
"id": "841608fe-6ffd-41b7-a0b4-0de98a6f1869",
"metadata": {},
"source": [
"We start by installing the required dependencies. This includes libraries for document conversion, IBM watsonx LLM integration, and data handling."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1104fe3c-46ff-4e3b-9781-6aa87f20037f",
"metadata": {},
"outputs": [],
"source": [
"!pip install -q git+https://github.com/ibm-granite-community/utils \\\n",
" docling==2.14.0 \\\n",
" langchain==0.2.12 \\\n",
" langchain-ibm==0.1.11 \\\n",
" langchain-community==0.2.11 \\\n",
" langchain-core==0.2.28 \\\n",
" ibm-watsonx-ai==1.1.2 \\\n",
" transformers==4.47.1\n"
]
},
{
"cell_type": "markdown",
"id": "8772cd59-7ba2-40f2-adc5-5e07467d4876",
"metadata": {},
"source": [
"## Step 2. Import Libraries and Load Environment Variables"
]
},
{
"cell_type": "markdown",
"id": "fde5509a-e557-4896-8ee8-f41c5146ee01",
"metadata": {},
"source": [
"Import necessary libraries and load API credentials and configurations using the dotenv library."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d85822d3-6b83-4a93-840f-f3d9001bf9df",
"metadata": {},
"outputs": [],
"source": [
"from docling.document_converter import DocumentConverter\n",
"from langchain_ibm import WatsonxLLM\n",
"from langchain_core.prompts import PromptTemplate\n",
"import os\n",
"import json\n",
"import pandas as pd\n",
"import re\n",
"import os\n",
"import requests\n",
"from dotenv import load_dotenv, dotenv_values \n",
"load_dotenv()\n",
"from ibm_granite_community.notebook_utils import get_env_var"
]
},
{
"cell_type": "markdown",
"id": "10cdbf8b-a495-4159-a2fc-f49803610e73",
"metadata": {},
"source": [
"## Step 3: Define the InvoiceProcessor Class"
]
},
{
"cell_type": "markdown",
"id": "cc609a2b-8344-4bea-9adc-164fc3e4623d",
"metadata": {},
"source": [
"The InvoiceProcessor class handles the processing of invoices, including document conversion, text extraction using IBM Docling and processing using Granite LLM"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9821baf1-2b02-4983-ae09-466442007ba2",
"metadata": {},
"outputs": [],
"source": [
"# Define the InvoiceProcessor class\n",
"class InvoiceProcessor:\n",
" def __init__(self, ibm_cloud_api_key, project_id, watson_url):\n",
" self.llm = WatsonxLLM(\n",
" model_id='ibm/granite-3-8b-instruct',\n",
" apikey=ibm_cloud_api_key,\n",
" project_id=project_id,\n",
" params={\n",
" \"decoding_method\": \"greedy\",\n",
" \"max_new_tokens\": 8000,\n",
" \"min_new_tokens\": 1,\n",
" \"repetition_penalty\": 1.01\n",
" },\n",
" url=watson_url\n",
" )\n",
" self.converter = DocumentConverter()\n",
"\n",
" \n",
"\n",
" def extract_invoice_data(self, source):\n",
" result = self.converter.convert(source)\n",
" markdown_output = result.document.export_to_markdown()\n",
" \n",
"\n",
" #prompt_template = PromptTemplate.from_template('''\n",
" prompt_template = PromptTemplate(\n",
" input_variables=[\"DOCUMENT\"],\n",
" template='''\n",
" <|start_of_role|>System<|end_of_role|> You are an AI assistant for processing invoices. Based on the provided invoice data, extract the 'Invoice Number', 'Total Net Amount', 'Total VAT or TAX or GST Amount', 'Total Amount' , 'Invoice Date', 'Purchase Order Number' and 'Customer number', without the currency values.\n",
"\n",
" |Instructions|\n",
" Identify and extract the following information:\n",
" - **Invoice Number**: The unique identifier for the invoice.\n",
" - **Net Amount**: The Total Net Amount indicated on the invoice.\n",
" - **VAT or TAX or GST Amount**: The Total VAT or TAX or GST Amount indicated on the invoice.\n",
" - **Total Amount**: The Total Cost indicated on the invoice.\n",
" - **Invoice Date**: The date the invoice was issued.\n",
" - **Purchase Order Number**: The unique identifier for the purchase order.\n",
" - **Customer Number**: The unique identifier for the customer.\n",
"\n",
" Invoice Data:\n",
" {DOCUMENT}\n",
"\n",
"\n",
" Strictly provide the extracted information in the following JSON format:\n",
"\n",
" ```json\n",
" {{\n",
" \"invoice_number\": \"extracted_invoice_number\",\n",
" \"net_amount\": \"extracted_new_amount\",\n",
" \"vat_or_tax_or_gst_amount\" : \"extracted_vat_or_tax_or_gst_amount\",\n",
" \"total_amount\": \"extracted_total_amount\",\n",
" \"invoice_date\": \"extracted_invoice_date\",\n",
" \"purchase_order_number\": \"extracted_purchase_order_number\",\n",
" \"customer_number\": \"extracted_customer_number\"\n",
" }}\n",
"\n",
" <|end_of_text|>\n",
"\n",
" <|start_of_role|>assistant<|end_of_role|>\n",
" ''')\n",
"\n",
" prompt = prompt_template.format(DOCUMENT=str(markdown_output).strip())\n",
" answer = self.llm.invoke(prompt)\n",
" #print(answer)\n",
"\n",
" json_string = re.search(r'\\{.*\\}', answer, re.DOTALL).group(0).replace('\\n', '')\n",
" data = json.loads(json_string)\n",
"\n",
" try:\n",
" net_amount = round(float(data['net_amount'].replace(\",\", \"\").replace(\"$\", \"\").strip()), 2)\n",
" vat_or_tax_or_gst_amount = round(float(data['vat_or_tax_or_gst_amount'].replace(\",\", \"\").replace(\"$\", \"\").strip()), 2)\n",
" total_amount = round(float(data['total_amount'].replace(\",\", \"\").replace(\"$\", \"\").strip()), 2)\n",
"\n",
" data['Validation'] = 'correct' if round(net_amount + vat_or_tax_or_gst_amount, 2) == total_amount else 'check'\n",
" print(\"Processed -- \", source)\n",
" except (ValueError, KeyError):\n",
" data['Validation'] = 'check'\n",
"\n",
" return data\n",
"\n",
" def process_invoices(self, folder_path):\n",
" columns = ['File_Name', 'Invoice_Number', 'Net_Amount', 'TAX_Amount', 'Total_Amount', 'Validation', 'Invoice_Date', 'Purchase_Order_Number', 'Customer_Number']\n",
" df_invoice = pd.DataFrame(columns=columns)\n",
"\n",
" for filename in os.listdir(folder_path):\n",
" if filename.endswith('.pdf'):\n",
" pdf_path = os.path.join(folder_path, filename)\n",
" try:\n",
" data = self.extract_invoice_data(pdf_path)\n",
" data['FileName'] = filename\n",
"\n",
" new_row = {\n",
" 'File_Name': data['FileName'],\n",
" 'Invoice_Number': data['invoice_number'],\n",
" 'Net_Amount': data['net_amount'],\n",
" 'TAX_Amount': data['vat_or_tax_or_gst_amount'],\n",
" 'Total_Amount': data['total_amount'],\n",
" 'Validation': data['Validation'],\n",
" 'Invoice_Date': data['invoice_date'],\n",
" 'Purchase_Order_Number': data['purchase_order_number'],\n",
" 'Customer_Number': data['customer_number']\n",
" }\n",
"\n",
" df_invoice = pd.concat([df_invoice, pd.DataFrame([new_row])], ignore_index=True)\n",
" except Exception:\n",
" pass\n",
"\n",
" return df_invoice\n"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "9c706488-d739-476b-8fd8-98f7ebba9756",
"metadata": {},
"outputs": [],
"source": [
"\n",
"def setup_directory(directory):\n",
" \"\"\"\n",
" Ensure the specified directory exists. Create it if it doesn't.\n",
" \"\"\"\n",
" os.makedirs(directory, exist_ok=True)\n",
" print(f\"Directory '{directory}' is ready.\")\n",
"\n",
"def download_files(file_list, base_url, directory):\n",
" \"\"\"\n",
" Download files from a given base URL into a specified directory.\n",
" \"\"\"\n",
" for file_name in file_list:\n",
" file_url = base_url + file_name\n",
" local_file_path = os.path.join(directory, file_name)\n",
" try:\n",
" # Download the file\n",
" response = requests.get(file_url)\n",
" if response.status_code == 200:\n",
" # Save to the specified directory\n",
" with open(local_file_path, \"wb\") as file:\n",
" file.write(response.content)\n",
" print(f\"Downloaded: {file_name}\")\n",
" else:\n",
" print(f\"Failed to download {file_name}. Status code: {response.status_code}\")\n",
" except Exception as e:\n",
" print(f\"Error downloading {file_name}: {e}\")\n",
"\n",
"def delete_files(directory):\n",
" \"\"\"\n",
" Delete all files in the specified directory.\n",
" \"\"\"\n",
" try:\n",
" for file_name in os.listdir(directory):\n",
" file_path = os.path.join(directory, file_name)\n",
" if os.path.isfile(file_path):\n",
" os.remove(file_path)\n",
" print(f\"Deleted: {file_name}\")\n",
" except Exception as e:\n",
" print(f\"Error deleting files: {e}\")\n",
"\n",
"def cleanup_directory(directory):\n",
" \"\"\"\n",
" Remove the directory if it is empty.\n",
" \"\"\"\n",
" try:\n",
" os.rmdir(directory)\n",
" print(f\"Directory '{directory}' removed successfully.\")\n",
" except OSError as e:\n",
" print(f\"Error removing directory '{directory}': {e}\")\n",
"\n",
"def download_invoice():\n",
" \"\"\"\n",
" Main workflow for downloading, processing, and cleaning up files.\n",
" \"\"\"\n",
" # List of file names\n",
" files = [\n",
" \"6900026063.pdf\",\n",
" \"6900026069.pdf\",\n",
" \"6905212892.pdf\",\n",
" \"904000640.pdf\",\n",
" \"PL_IERPIC_MISSING.pdf\"\n",
" ]\n",
"\n",
" # Base URL for the raw files\n",
" #base_url = \"https://raw.githubusercontent.com/SinghSuryaDeep/Granite_Recipes_Invoices/main/Invoices/\"\n",
" base_url = \"https://raw.githubusercontent.com/SinghSuryaDeep/granite-snack-cookbook/refs/heads/main/recipes/Invoice-Extraction/Invoices/\"\n",
" \n",
"\n",
" \n",
" #base_url = \"https://raw.githubusercontent.com/Surya-Deep-Singh/Granite_Recipes_Invoices\"\n",
" data_dir = \"data\"\n",
" setup_directory(data_dir)\n",
"\n",
" # Step 2: Download the files\n",
" print(\"Downloading files...\")\n",
" download_files(files, base_url, data_dir)\n",
"\n",
"\n"
]
},
{
"cell_type": "markdown",
"id": "088b6d7d-5fae-4497-87a8-4df073475a7e",
"metadata": {},
"source": [
"## Step 4. Initialize and Process Invoices"
]
},
{
"cell_type": "markdown",
"id": "2cec1921-e6b2-4fdf-bcb4-d7102656039c",
"metadata": {},
"source": [
"Set up the necessary credentials for IBM watsonx, process invoices using the InvoiceProcessor class, and store the results in a pandas DataFrame."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1972be9b-59e6-427b-b071-c3047bb15666",
"metadata": {},
"outputs": [],
"source": [
"if __name__ == \"__main__\":\n",
" download_invoice()\n",
" # Main script to initialize and process invoices\n",
" ibm_cloud_api_key = get_env_var('WATSONX_APIKEY')\n",
" project_id = get_env_var('WATSONX_PROJECT_ID')\n",
" watson_url = get_env_var('WATSONX_URL')\n",
" folder_path = os.getenv('folder_path')\n",
" invoice_processor = InvoiceProcessor(ibm_cloud_api_key, project_id, watson_url)\n",
" df_invoice = invoice_processor.process_invoices(\"data\")\n",
" delete_files('data')\n",
"df_invoice"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.9"
}
},
"nbformat": 4,
"nbformat_minor": 5
}
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file added recipes/Invoice-Extraction/Invoices/904000640.pdf
Binary file not shown.
Binary file not shown.

0 comments on commit b42e55b

Please sign in to comment.