-
Notifications
You must be signed in to change notification settings - Fork 28
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Create temp Signed-off-by: Surya Deep Singh <[email protected]> * Add files via upload Signed-off-by: Surya Deep Singh <[email protected]> * Delete recipes/invoice_extractions/temp Signed-off-by: Surya Deep Singh <[email protected]> * Delete recipes/invoice_extractions directory Signed-off-by: Surya Deep Singh <[email protected]> * Create Invoices Extraction Signed-off-by: Surya Deep Singh <[email protected]> * Delete recipes/Invoices Extraction Signed-off-by: Surya Deep Singh <[email protected]> * Create Test Signed-off-by: Surya Deep Singh <[email protected]> * Add files via upload Signed-off-by: Surya Deep Singh <[email protected]> * Create test Signed-off-by: Surya Deep Singh <[email protected]> * Add files via upload Signed-off-by: Surya Deep Singh <[email protected]> * Delete recipes/Invoice-Extraction/Invoices/test Signed-off-by: Surya Deep Singh <[email protected]> * Delete recipes/Invoice-Extraction/Test Signed-off-by: Surya Deep Singh <[email protected]> * Delete recipes/Invoice-Extraction/Invoices/12855.pdf Signed-off-by: Surya Deep Singh <[email protected]> * Delete recipes/Invoice-Extraction/Invoices/2006969.pdf Signed-off-by: Surya Deep Singh <[email protected]> * Delete recipes/Invoice-Extraction/Invoices/6900026063.pdf Signed-off-by: Surya Deep Singh <[email protected]> * Delete recipes/Invoice-Extraction/Invoices/6900026069.pdf Signed-off-by: Surya Deep Singh <[email protected]> * Delete recipes/Invoice-Extraction/Invoices/6905212892.pdf Signed-off-by: Surya Deep Singh <[email protected]> * Delete recipes/Invoice-Extraction/Invoices/904000640.pdf Signed-off-by: Surya Deep Singh <[email protected]> * Add files via upload Signed-off-by: Surya Deep Singh <[email protected]> * Delete recipes/Invoice-Extraction/Invoices/PL_IERPIC_MISSING.pdf Signed-off-by: Surya Deep Singh <[email protected]> * Add files via upload Signed-off-by: Surya Deep Singh <[email protected]> * Delete recipes/Invoice-Extraction/Granite_Recipes_Invoices.ipynb Signed-off-by: Surya Deep Singh <[email protected]> * Add files via upload Signed-off-by: Surya Deep Singh <[email protected]> * Delete recipes/Invoice-Extraction/Granite_Recipes_Invoices.ipynb Signed-off-by: Surya Deep Singh <[email protected]> * Add files via upload Signed-off-by: Surya Deep Singh <[email protected]> * Update vanilla_notebooks.txt Signed-off-by: Surya Deep Singh <[email protected]> * Delete recipes/Invoice-Extraction/Granite_Recipes_Invoices.ipynb Signed-off-by: Surya Deep Singh <[email protected]> * Add files via upload Signed-off-by: Surya Deep Singh <[email protected]> * Update README.md Signed-off-by: Surya Deep Singh <[email protected]> * Update README.md Signed-off-by: Surya Deep Singh <[email protected]> * Update Granite_Recipes_Invoices.ipynb Signed-off-by: Surya Deep Singh <[email protected]> * Delete recipes/Invoice-Extraction/Granite_Recipes_Invoices.ipynb Signed-off-by: Surya Deep Singh <[email protected]> * Add files via upload Signed-off-by: Surya Deep Singh <[email protected]> --------- Signed-off-by: Surya Deep Singh <[email protected]> Co-authored-by: BJ Hargrave <[email protected]>
- Loading branch information
1 parent
075a03d
commit b42e55b
Showing
8 changed files
with
383 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
378 changes: 378 additions & 0 deletions
378
recipes/Invoice-Extraction/Granite_Recipes_Invoices.ipynb
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,378 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"id": "766392dd-c422-424e-a752-36209712ba2f", | ||
"metadata": {}, | ||
"source": [ | ||
"# Invoice Data Extraction using IBM Granite LLM from watsonx\n", | ||
"Author: [@Surya Deep Singh](https://www.linkedin.com/in/surya-deep-singh-b9b94813a/)\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "05da18f3-f2a4-42ab-b4b6-fa910140560c", | ||
"metadata": {}, | ||
"source": [ | ||
"## Description" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "40c261d0-1b1a-4333-9f22-a4a90102934b", | ||
"metadata": {}, | ||
"source": [ | ||
"This Jupyter Notebook is designed to process and extract key details from invoices stored as PDFs. It uses IBM watsonx's granite-3-8b-instruct LLM to interpret the invoice data and extract specific fields such as invoice number, total amounts and customer details, etc. Extracted data is validated and saved into a structured DataFrame for further analysis." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "23ecec96-cb84-4a33-9c36-f2670970f8e5", | ||
"metadata": {}, | ||
"source": [ | ||
"## Step 1: Setup and Installation" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "841608fe-6ffd-41b7-a0b4-0de98a6f1869", | ||
"metadata": {}, | ||
"source": [ | ||
"We start by installing the required dependencies. This includes libraries for document conversion, IBM watsonx LLM integration, and data handling." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "1104fe3c-46ff-4e3b-9781-6aa87f20037f", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"!pip install -q git+https://github.com/ibm-granite-community/utils \\\n", | ||
" docling==2.14.0 \\\n", | ||
" langchain==0.2.12 \\\n", | ||
" langchain-ibm==0.1.11 \\\n", | ||
" langchain-community==0.2.11 \\\n", | ||
" langchain-core==0.2.28 \\\n", | ||
" ibm-watsonx-ai==1.1.2 \\\n", | ||
" transformers==4.47.1\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "8772cd59-7ba2-40f2-adc5-5e07467d4876", | ||
"metadata": {}, | ||
"source": [ | ||
"## Step 2. Import Libraries and Load Environment Variables" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "fde5509a-e557-4896-8ee8-f41c5146ee01", | ||
"metadata": {}, | ||
"source": [ | ||
"Import necessary libraries and load API credentials and configurations using the dotenv library." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "d85822d3-6b83-4a93-840f-f3d9001bf9df", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"from docling.document_converter import DocumentConverter\n", | ||
"from langchain_ibm import WatsonxLLM\n", | ||
"from langchain_core.prompts import PromptTemplate\n", | ||
"import os\n", | ||
"import json\n", | ||
"import pandas as pd\n", | ||
"import re\n", | ||
"import os\n", | ||
"import requests\n", | ||
"from dotenv import load_dotenv, dotenv_values \n", | ||
"load_dotenv()\n", | ||
"from ibm_granite_community.notebook_utils import get_env_var" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "10cdbf8b-a495-4159-a2fc-f49803610e73", | ||
"metadata": {}, | ||
"source": [ | ||
"## Step 3: Define the InvoiceProcessor Class" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "cc609a2b-8344-4bea-9adc-164fc3e4623d", | ||
"metadata": {}, | ||
"source": [ | ||
"The InvoiceProcessor class handles the processing of invoices, including document conversion, text extraction using IBM Docling and processing using Granite LLM" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "9821baf1-2b02-4983-ae09-466442007ba2", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"# Define the InvoiceProcessor class\n", | ||
"class InvoiceProcessor:\n", | ||
" def __init__(self, ibm_cloud_api_key, project_id, watson_url):\n", | ||
" self.llm = WatsonxLLM(\n", | ||
" model_id='ibm/granite-3-8b-instruct',\n", | ||
" apikey=ibm_cloud_api_key,\n", | ||
" project_id=project_id,\n", | ||
" params={\n", | ||
" \"decoding_method\": \"greedy\",\n", | ||
" \"max_new_tokens\": 8000,\n", | ||
" \"min_new_tokens\": 1,\n", | ||
" \"repetition_penalty\": 1.01\n", | ||
" },\n", | ||
" url=watson_url\n", | ||
" )\n", | ||
" self.converter = DocumentConverter()\n", | ||
"\n", | ||
" \n", | ||
"\n", | ||
" def extract_invoice_data(self, source):\n", | ||
" result = self.converter.convert(source)\n", | ||
" markdown_output = result.document.export_to_markdown()\n", | ||
" \n", | ||
"\n", | ||
" #prompt_template = PromptTemplate.from_template('''\n", | ||
" prompt_template = PromptTemplate(\n", | ||
" input_variables=[\"DOCUMENT\"],\n", | ||
" template='''\n", | ||
" <|start_of_role|>System<|end_of_role|> You are an AI assistant for processing invoices. Based on the provided invoice data, extract the 'Invoice Number', 'Total Net Amount', 'Total VAT or TAX or GST Amount', 'Total Amount' , 'Invoice Date', 'Purchase Order Number' and 'Customer number', without the currency values.\n", | ||
"\n", | ||
" |Instructions|\n", | ||
" Identify and extract the following information:\n", | ||
" - **Invoice Number**: The unique identifier for the invoice.\n", | ||
" - **Net Amount**: The Total Net Amount indicated on the invoice.\n", | ||
" - **VAT or TAX or GST Amount**: The Total VAT or TAX or GST Amount indicated on the invoice.\n", | ||
" - **Total Amount**: The Total Cost indicated on the invoice.\n", | ||
" - **Invoice Date**: The date the invoice was issued.\n", | ||
" - **Purchase Order Number**: The unique identifier for the purchase order.\n", | ||
" - **Customer Number**: The unique identifier for the customer.\n", | ||
"\n", | ||
" Invoice Data:\n", | ||
" {DOCUMENT}\n", | ||
"\n", | ||
"\n", | ||
" Strictly provide the extracted information in the following JSON format:\n", | ||
"\n", | ||
" ```json\n", | ||
" {{\n", | ||
" \"invoice_number\": \"extracted_invoice_number\",\n", | ||
" \"net_amount\": \"extracted_new_amount\",\n", | ||
" \"vat_or_tax_or_gst_amount\" : \"extracted_vat_or_tax_or_gst_amount\",\n", | ||
" \"total_amount\": \"extracted_total_amount\",\n", | ||
" \"invoice_date\": \"extracted_invoice_date\",\n", | ||
" \"purchase_order_number\": \"extracted_purchase_order_number\",\n", | ||
" \"customer_number\": \"extracted_customer_number\"\n", | ||
" }}\n", | ||
"\n", | ||
" <|end_of_text|>\n", | ||
"\n", | ||
" <|start_of_role|>assistant<|end_of_role|>\n", | ||
" ''')\n", | ||
"\n", | ||
" prompt = prompt_template.format(DOCUMENT=str(markdown_output).strip())\n", | ||
" answer = self.llm.invoke(prompt)\n", | ||
" #print(answer)\n", | ||
"\n", | ||
" json_string = re.search(r'\\{.*\\}', answer, re.DOTALL).group(0).replace('\\n', '')\n", | ||
" data = json.loads(json_string)\n", | ||
"\n", | ||
" try:\n", | ||
" net_amount = round(float(data['net_amount'].replace(\",\", \"\").replace(\"$\", \"\").strip()), 2)\n", | ||
" vat_or_tax_or_gst_amount = round(float(data['vat_or_tax_or_gst_amount'].replace(\",\", \"\").replace(\"$\", \"\").strip()), 2)\n", | ||
" total_amount = round(float(data['total_amount'].replace(\",\", \"\").replace(\"$\", \"\").strip()), 2)\n", | ||
"\n", | ||
" data['Validation'] = 'correct' if round(net_amount + vat_or_tax_or_gst_amount, 2) == total_amount else 'check'\n", | ||
" print(\"Processed -- \", source)\n", | ||
" except (ValueError, KeyError):\n", | ||
" data['Validation'] = 'check'\n", | ||
"\n", | ||
" return data\n", | ||
"\n", | ||
" def process_invoices(self, folder_path):\n", | ||
" columns = ['File_Name', 'Invoice_Number', 'Net_Amount', 'TAX_Amount', 'Total_Amount', 'Validation', 'Invoice_Date', 'Purchase_Order_Number', 'Customer_Number']\n", | ||
" df_invoice = pd.DataFrame(columns=columns)\n", | ||
"\n", | ||
" for filename in os.listdir(folder_path):\n", | ||
" if filename.endswith('.pdf'):\n", | ||
" pdf_path = os.path.join(folder_path, filename)\n", | ||
" try:\n", | ||
" data = self.extract_invoice_data(pdf_path)\n", | ||
" data['FileName'] = filename\n", | ||
"\n", | ||
" new_row = {\n", | ||
" 'File_Name': data['FileName'],\n", | ||
" 'Invoice_Number': data['invoice_number'],\n", | ||
" 'Net_Amount': data['net_amount'],\n", | ||
" 'TAX_Amount': data['vat_or_tax_or_gst_amount'],\n", | ||
" 'Total_Amount': data['total_amount'],\n", | ||
" 'Validation': data['Validation'],\n", | ||
" 'Invoice_Date': data['invoice_date'],\n", | ||
" 'Purchase_Order_Number': data['purchase_order_number'],\n", | ||
" 'Customer_Number': data['customer_number']\n", | ||
" }\n", | ||
"\n", | ||
" df_invoice = pd.concat([df_invoice, pd.DataFrame([new_row])], ignore_index=True)\n", | ||
" except Exception:\n", | ||
" pass\n", | ||
"\n", | ||
" return df_invoice\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "9c706488-d739-476b-8fd8-98f7ebba9756", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"\n", | ||
"def setup_directory(directory):\n", | ||
" \"\"\"\n", | ||
" Ensure the specified directory exists. Create it if it doesn't.\n", | ||
" \"\"\"\n", | ||
" os.makedirs(directory, exist_ok=True)\n", | ||
" print(f\"Directory '{directory}' is ready.\")\n", | ||
"\n", | ||
"def download_files(file_list, base_url, directory):\n", | ||
" \"\"\"\n", | ||
" Download files from a given base URL into a specified directory.\n", | ||
" \"\"\"\n", | ||
" for file_name in file_list:\n", | ||
" file_url = base_url + file_name\n", | ||
" local_file_path = os.path.join(directory, file_name)\n", | ||
" try:\n", | ||
" # Download the file\n", | ||
" response = requests.get(file_url)\n", | ||
" if response.status_code == 200:\n", | ||
" # Save to the specified directory\n", | ||
" with open(local_file_path, \"wb\") as file:\n", | ||
" file.write(response.content)\n", | ||
" print(f\"Downloaded: {file_name}\")\n", | ||
" else:\n", | ||
" print(f\"Failed to download {file_name}. Status code: {response.status_code}\")\n", | ||
" except Exception as e:\n", | ||
" print(f\"Error downloading {file_name}: {e}\")\n", | ||
"\n", | ||
"def delete_files(directory):\n", | ||
" \"\"\"\n", | ||
" Delete all files in the specified directory.\n", | ||
" \"\"\"\n", | ||
" try:\n", | ||
" for file_name in os.listdir(directory):\n", | ||
" file_path = os.path.join(directory, file_name)\n", | ||
" if os.path.isfile(file_path):\n", | ||
" os.remove(file_path)\n", | ||
" print(f\"Deleted: {file_name}\")\n", | ||
" except Exception as e:\n", | ||
" print(f\"Error deleting files: {e}\")\n", | ||
"\n", | ||
"def cleanup_directory(directory):\n", | ||
" \"\"\"\n", | ||
" Remove the directory if it is empty.\n", | ||
" \"\"\"\n", | ||
" try:\n", | ||
" os.rmdir(directory)\n", | ||
" print(f\"Directory '{directory}' removed successfully.\")\n", | ||
" except OSError as e:\n", | ||
" print(f\"Error removing directory '{directory}': {e}\")\n", | ||
"\n", | ||
"def download_invoice():\n", | ||
" \"\"\"\n", | ||
" Main workflow for downloading, processing, and cleaning up files.\n", | ||
" \"\"\"\n", | ||
" # List of file names\n", | ||
" files = [\n", | ||
" \"6900026063.pdf\",\n", | ||
" \"6900026069.pdf\",\n", | ||
" \"6905212892.pdf\",\n", | ||
" \"904000640.pdf\",\n", | ||
" \"PL_IERPIC_MISSING.pdf\"\n", | ||
" ]\n", | ||
"\n", | ||
" # Base URL for the raw files\n", | ||
" #base_url = \"https://raw.githubusercontent.com/SinghSuryaDeep/Granite_Recipes_Invoices/main/Invoices/\"\n", | ||
" base_url = \"https://raw.githubusercontent.com/SinghSuryaDeep/granite-snack-cookbook/refs/heads/main/recipes/Invoice-Extraction/Invoices/\"\n", | ||
" \n", | ||
"\n", | ||
" \n", | ||
" #base_url = \"https://raw.githubusercontent.com/Surya-Deep-Singh/Granite_Recipes_Invoices\"\n", | ||
" data_dir = \"data\"\n", | ||
" setup_directory(data_dir)\n", | ||
"\n", | ||
" # Step 2: Download the files\n", | ||
" print(\"Downloading files...\")\n", | ||
" download_files(files, base_url, data_dir)\n", | ||
"\n", | ||
"\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "088b6d7d-5fae-4497-87a8-4df073475a7e", | ||
"metadata": {}, | ||
"source": [ | ||
"## Step 4. Initialize and Process Invoices" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"id": "2cec1921-e6b2-4fdf-bcb4-d7102656039c", | ||
"metadata": {}, | ||
"source": [ | ||
"Set up the necessary credentials for IBM watsonx, process invoices using the InvoiceProcessor class, and store the results in a pandas DataFrame." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": null, | ||
"id": "1972be9b-59e6-427b-b071-c3047bb15666", | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"if __name__ == \"__main__\":\n", | ||
" download_invoice()\n", | ||
" # Main script to initialize and process invoices\n", | ||
" ibm_cloud_api_key = get_env_var('WATSONX_APIKEY')\n", | ||
" project_id = get_env_var('WATSONX_PROJECT_ID')\n", | ||
" watson_url = get_env_var('WATSONX_URL')\n", | ||
" folder_path = os.getenv('folder_path')\n", | ||
" invoice_processor = InvoiceProcessor(ibm_cloud_api_key, project_id, watson_url)\n", | ||
" df_invoice = invoice_processor.process_invoices(\"data\")\n", | ||
" delete_files('data')\n", | ||
"df_invoice" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3 (ipykernel)", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.11.9" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 5 | ||
} |
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.
Binary file not shown.