Invoice Extraction (#84)

* Create temp Signed-off-by: Surya Deep Singh <[email protected]> * Add files via upload Signed-off-by: Surya Deep Singh <[email protected]> * Delete recipes/invoice_extractions/temp Signed-off-by: Surya Deep Singh <[email protected]> * Delete recipes/invoice_extractions directory Signed-off-by: Surya Deep Singh <[email protected]> * Create Invoices Extraction Signed-off-by: Surya Deep Singh <[email protected]> * Delete recipes/Invoices Extraction Signed-off-by: Surya Deep Singh <[email protected]> * Create Test Signed-off-by: Surya Deep Singh <[email protected]> * Add files via upload Signed-off-by: Surya Deep Singh <[email protected]> * Create test Signed-off-by: Surya Deep Singh <[email protected]> * Add files via upload Signed-off-by: Surya Deep Singh <[email protected]> * Delete recipes/Invoice-Extraction/Invoices/test Signed-off-by: Surya Deep Singh <[email protected]> * Delete recipes/Invoice-Extraction/Test Signed-off-by: Surya Deep Singh <[email protected]> * Delete recipes/Invoice-Extraction/Invoices/12855.pdf Signed-off-by: Surya Deep Singh <[email protected]> * Delete recipes/Invoice-Extraction/Invoices/2006969.pdf Signed-off-by: Surya Deep Singh <[email protected]> * Delete recipes/Invoice-Extraction/Invoices/6900026063.pdf Signed-off-by: Surya Deep Singh <[email protected]> * Delete recipes/Invoice-Extraction/Invoices/6900026069.pdf Signed-off-by: Surya Deep Singh <[email protected]> * Delete recipes/Invoice-Extraction/Invoices/6905212892.pdf Signed-off-by: Surya Deep Singh <[email protected]> * Delete recipes/Invoice-Extraction/Invoices/904000640.pdf Signed-off-by: Surya Deep Singh <[email protected]> * Add files via upload Signed-off-by: Surya Deep Singh <[email protected]> * Delete recipes/Invoice-Extraction/Invoices/PL_IERPIC_MISSING.pdf Signed-off-by: Surya Deep Singh <[email protected]> * Add files via upload Signed-off-by: Surya Deep Singh <[email protected]> * Delete recipes/Invoice-Extraction/Granite_Recipes_Invoices.ipynb Signed-off-by: Surya Deep Singh <[email protected]> * Add files via upload Signed-off-by: Surya Deep Singh <[email protected]> * Delete recipes/Invoice-Extraction/Granite_Recipes_Invoices.ipynb Signed-off-by: Surya Deep Singh <[email protected]> * Add files via upload Signed-off-by: Surya Deep Singh <[email protected]> * Update vanilla_notebooks.txt Signed-off-by: Surya Deep Singh <[email protected]> * Delete recipes/Invoice-Extraction/Granite_Recipes_Invoices.ipynb Signed-off-by: Surya Deep Singh <[email protected]> * Add files via upload Signed-off-by: Surya Deep Singh <[email protected]> * Update README.md Signed-off-by: Surya Deep Singh <[email protected]> * Update README.md Signed-off-by: Surya Deep Singh <[email protected]> * Update Granite_Recipes_Invoices.ipynb Signed-off-by: Surya Deep Singh <[email protected]> * Delete recipes/Invoice-Extraction/Granite_Recipes_Invoices.ipynb Signed-off-by: Surya Deep Singh <[email protected]> * Add files via upload Signed-off-by: Surya Deep Singh <[email protected]> --------- Signed-off-by: Surya Deep Singh <[email protected]> Co-authored-by: BJ Hargrave <[email protected]>
ibm-granite-community · Jan 9, 2025 · b42e55b · b42e55b
1 parent 075a03d
commit b42e55b
Show file tree

Hide file tree

Showing 8 changed files with 383 additions and 0 deletions.
diff --git a/.github/notebook_lists/vanilla_notebooks.txt b/.github/notebook_lists/vanilla_notebooks.txt
@@ -10,3 +10,4 @@ recipes/Granite_Guardian/Granite_Guardian_Detailed_Guide.ipynb
 recipes/Granite_Guardian/Granite_Guardian_Usage_Governance_Workflow.ipynb
 recipes/Granite_Guardian/Granite_Guardian_Watsonx_Usage.ipynb
 recipes/AI-Agents/travel_planner_agent.ipynb
+recipes/Invoice-Extraction/Granite_Recipes_Invoices.ipynb
diff --git a/README.md b/README.md
@@ -18,6 +18,10 @@ The "Recipes" in the Granite Snack Cookbook showcase the essential capabilities
    <a target="_blank" href="https://colab.research.google.com/github/ibm-granite-community/granite-snack-cookbook/blob/main/recipes/Function-Calling/Function_Calling.ipynb">
    <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
    </a>
+1. [Invoice Extraction](recipes/Invoice-Extraction/Granite_Recipes_Invoices.ipynb)
+   <a target="_blank" href="https://colab.research.google.com/github/ibm-granite-community/granite-snack-cookbook/blob/main/recipes/Invoice-Extraction/Granite_Recipes_Invoices.ipynb">
+   <img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/>
+   </a>
 
 ### Compound Systems
 

diff --git a/recipes/Invoice-Extraction/Granite_Recipes_Invoices.ipynb b/recipes/Invoice-Extraction/Granite_Recipes_Invoices.ipynb
@@ -0,0 +1,378 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "766392dd-c422-424e-a752-36209712ba2f",
+   "metadata": {},
+   "source": [
+    "# Invoice Data Extraction using IBM Granite LLM from watsonx\n",
+    "Author: [@Surya Deep Singh](https://www.linkedin.com/in/surya-deep-singh-b9b94813a/)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "05da18f3-f2a4-42ab-b4b6-fa910140560c",
+   "metadata": {},
+   "source": [
+    "## Description"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "40c261d0-1b1a-4333-9f22-a4a90102934b",
+   "metadata": {},
+   "source": [
+    "This Jupyter Notebook is designed to process and extract key details from invoices stored as PDFs. It uses IBM watsonx's granite-3-8b-instruct LLM to interpret the invoice data and extract specific fields such as invoice number, total amounts and customer details, etc. Extracted data is validated and saved into a structured DataFrame for further analysis."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "23ecec96-cb84-4a33-9c36-f2670970f8e5",
+   "metadata": {},
+   "source": [
+    "## Step 1: Setup and Installation"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "841608fe-6ffd-41b7-a0b4-0de98a6f1869",
+   "metadata": {},
+   "source": [
+    "We start by installing the required dependencies. This includes libraries for document conversion, IBM watsonx LLM integration, and data handling."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1104fe3c-46ff-4e3b-9781-6aa87f20037f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install -q git+https://github.com/ibm-granite-community/utils \\\n",
+    "    docling==2.14.0 \\\n",
+    "    langchain==0.2.12 \\\n",
+    "    langchain-ibm==0.1.11 \\\n",
+    "    langchain-community==0.2.11 \\\n",
+    "    langchain-core==0.2.28 \\\n",
+    "    ibm-watsonx-ai==1.1.2 \\\n",
+    "    transformers==4.47.1\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "8772cd59-7ba2-40f2-adc5-5e07467d4876",
+   "metadata": {},
+   "source": [
+    "## Step 2. Import Libraries and Load Environment Variables"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "fde5509a-e557-4896-8ee8-f41c5146ee01",
+   "metadata": {},
+   "source": [
+    "Import necessary libraries and load API credentials and configurations using the dotenv library."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "d85822d3-6b83-4a93-840f-f3d9001bf9df",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from docling.document_converter import DocumentConverter\n",
+    "from langchain_ibm import WatsonxLLM\n",
+    "from langchain_core.prompts import PromptTemplate\n",
+    "import os\n",
+    "import json\n",
+    "import pandas as pd\n",
+    "import re\n",
+    "import os\n",
+    "import requests\n",
+    "from dotenv import load_dotenv, dotenv_values \n",
+    "load_dotenv()\n",
+    "from ibm_granite_community.notebook_utils import get_env_var"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "10cdbf8b-a495-4159-a2fc-f49803610e73",
+   "metadata": {},
+   "source": [
+    "## Step 3: Define the InvoiceProcessor Class"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "cc609a2b-8344-4bea-9adc-164fc3e4623d",
+   "metadata": {},
+   "source": [
+    "The InvoiceProcessor class handles the processing of invoices, including document conversion, text extraction using IBM Docling and processing using Granite LLM"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9821baf1-2b02-4983-ae09-466442007ba2",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Define the InvoiceProcessor class\n",
+    "class InvoiceProcessor:\n",
+    "    def __init__(self, ibm_cloud_api_key, project_id, watson_url):\n",
+    "        self.llm = WatsonxLLM(\n",
+    "            model_id='ibm/granite-3-8b-instruct',\n",
+    "            apikey=ibm_cloud_api_key,\n",
+    "            project_id=project_id,\n",
+    "            params={\n",
+    "                \"decoding_method\": \"greedy\",\n",
+    "                \"max_new_tokens\": 8000,\n",
+    "                \"min_new_tokens\": 1,\n",
+    "                \"repetition_penalty\": 1.01\n",
+    "            },\n",
+    "            url=watson_url\n",
+    "        )\n",
+    "        self.converter = DocumentConverter()\n",
+    "\n",
+    "        \n",
+    "\n",
+    "    def extract_invoice_data(self, source):\n",
+    "        result = self.converter.convert(source)\n",
+    "        markdown_output =  result.document.export_to_markdown()\n",
+    "      \n",
+    "\n",
+    "        #prompt_template = PromptTemplate.from_template('''\n",
+    "        prompt_template = PromptTemplate(\n",
+    "            input_variables=[\"DOCUMENT\"],\n",
+    "            template='''\n",
+    "            <|start_of_role|>System<|end_of_role|> You are an AI assistant for processing invoices. Based on the provided invoice data, extract the 'Invoice Number', 'Total Net Amount', 'Total VAT or TAX or GST Amount', 'Total Amount' , 'Invoice Date', 'Purchase Order Number' and 'Customer number', without the currency values.\n",
+    "\n",
+    "            |Instructions|\n",
+    "            Identify and extract the following information:\n",
+    "            - **Invoice Number**: The unique identifier for the invoice.\n",
+    "            - **Net Amount**: The Total Net Amount indicated on the invoice.\n",
+    "            - **VAT or TAX or GST Amount**: The Total VAT or TAX or GST Amount indicated on the invoice.\n",
+    "            - **Total Amount**: The Total Cost indicated on the invoice.\n",
+    "            - **Invoice Date**: The date the invoice was issued.\n",
+    "            - **Purchase Order Number**: The unique identifier for the purchase order.\n",
+    "            - **Customer Number**: The unique identifier for the customer.\n",
+    "\n",
+    "            Invoice Data:\n",
+    "            {DOCUMENT}\n",
+    "\n",
+    "\n",
+    "            Strictly provide the extracted information in the following JSON format:\n",
+    "\n",
+    "            ```json\n",
+    "            {{\n",
+    "              \"invoice_number\": \"extracted_invoice_number\",\n",
+    "              \"net_amount\": \"extracted_new_amount\",\n",
+    "              \"vat_or_tax_or_gst_amount\" : \"extracted_vat_or_tax_or_gst_amount\",\n",
+    "              \"total_amount\": \"extracted_total_amount\",\n",
+    "              \"invoice_date\": \"extracted_invoice_date\",\n",
+    "              \"purchase_order_number\": \"extracted_purchase_order_number\",\n",
+    "              \"customer_number\": \"extracted_customer_number\"\n",
+    "            }}\n",
+    "\n",
+    "            <|end_of_text|>\n",
+    "\n",
+    "            <|start_of_role|>assistant<|end_of_role|>\n",
+    "        ''')\n",
+    "\n",
+    "        prompt = prompt_template.format(DOCUMENT=str(markdown_output).strip())\n",
+    "        answer = self.llm.invoke(prompt)\n",
+    "        #print(answer)\n",
+    "\n",
+    "        json_string = re.search(r'\\{.*\\}', answer, re.DOTALL).group(0).replace('\\n', '')\n",
+    "        data = json.loads(json_string)\n",
+    "\n",
+    "        try:\n",
+    "            net_amount = round(float(data['net_amount'].replace(\",\", \"\").replace(\"$\", \"\").strip()), 2)\n",
+    "            vat_or_tax_or_gst_amount = round(float(data['vat_or_tax_or_gst_amount'].replace(\",\", \"\").replace(\"$\", \"\").strip()), 2)\n",
+    "            total_amount = round(float(data['total_amount'].replace(\",\", \"\").replace(\"$\", \"\").strip()), 2)\n",
+    "\n",
+    "            data['Validation'] = 'correct' if round(net_amount + vat_or_tax_or_gst_amount, 2) == total_amount else 'check'\n",
+    "            print(\"Processed -- \", source)\n",
+    "        except (ValueError, KeyError):\n",
+    "            data['Validation'] = 'check'\n",
+    "\n",
+    "        return data\n",
+    "\n",
+    "    def process_invoices(self, folder_path):\n",
+    "        columns = ['File_Name', 'Invoice_Number', 'Net_Amount', 'TAX_Amount', 'Total_Amount', 'Validation', 'Invoice_Date', 'Purchase_Order_Number', 'Customer_Number']\n",
+    "        df_invoice = pd.DataFrame(columns=columns)\n",
+    "\n",
+    "        for filename in os.listdir(folder_path):\n",
+    "            if filename.endswith('.pdf'):\n",
+    "                pdf_path = os.path.join(folder_path, filename)\n",
+    "                try:\n",
+    "                    data = self.extract_invoice_data(pdf_path)\n",
+    "                    data['FileName'] = filename\n",
+    "\n",
+    "                    new_row = {\n",
+    "                        'File_Name': data['FileName'],\n",
+    "                        'Invoice_Number': data['invoice_number'],\n",
+    "                        'Net_Amount': data['net_amount'],\n",
+    "                        'TAX_Amount': data['vat_or_tax_or_gst_amount'],\n",
+    "                        'Total_Amount': data['total_amount'],\n",
+    "                        'Validation': data['Validation'],\n",
+    "                        'Invoice_Date': data['invoice_date'],\n",
+    "                        'Purchase_Order_Number': data['purchase_order_number'],\n",
+    "                        'Customer_Number': data['customer_number']\n",
+    "                    }\n",
+    "\n",
+    "                    df_invoice = pd.concat([df_invoice, pd.DataFrame([new_row])], ignore_index=True)\n",
+    "                except Exception:\n",
+    "                    pass\n",
+    "\n",
+    "        return df_invoice\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "9c706488-d739-476b-8fd8-98f7ebba9756",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "\n",
+    "def setup_directory(directory):\n",
+    "    \"\"\"\n",
+    "    Ensure the specified directory exists. Create it if it doesn't.\n",
+    "    \"\"\"\n",
+    "    os.makedirs(directory, exist_ok=True)\n",
+    "    print(f\"Directory '{directory}' is ready.\")\n",
+    "\n",
+    "def download_files(file_list, base_url, directory):\n",
+    "    \"\"\"\n",
+    "    Download files from a given base URL into a specified directory.\n",
+    "    \"\"\"\n",
+    "    for file_name in file_list:\n",
+    "        file_url = base_url + file_name\n",
+    "        local_file_path = os.path.join(directory, file_name)\n",
+    "        try:\n",
+    "            # Download the file\n",
+    "            response = requests.get(file_url)\n",
+    "            if response.status_code == 200:\n",
+    "                # Save to the specified directory\n",
+    "                with open(local_file_path, \"wb\") as file:\n",
+    "                    file.write(response.content)\n",
+    "                print(f\"Downloaded: {file_name}\")\n",
+    "            else:\n",
+    "                print(f\"Failed to download {file_name}. Status code: {response.status_code}\")\n",
+    "        except Exception as e:\n",
+    "            print(f\"Error downloading {file_name}: {e}\")\n",
+    "\n",
+    "def delete_files(directory):\n",
+    "    \"\"\"\n",
+    "    Delete all files in the specified directory.\n",
+    "    \"\"\"\n",
+    "    try:\n",
+    "        for file_name in os.listdir(directory):\n",
+    "            file_path = os.path.join(directory, file_name)\n",
+    "            if os.path.isfile(file_path):\n",
+    "                os.remove(file_path)\n",
+    "                print(f\"Deleted: {file_name}\")\n",
+    "    except Exception as e:\n",
+    "        print(f\"Error deleting files: {e}\")\n",
+    "\n",
+    "def cleanup_directory(directory):\n",
+    "    \"\"\"\n",
+    "    Remove the directory if it is empty.\n",
+    "    \"\"\"\n",
+    "    try:\n",
+    "        os.rmdir(directory)\n",
+    "        print(f\"Directory '{directory}' removed successfully.\")\n",
+    "    except OSError as e:\n",
+    "        print(f\"Error removing directory '{directory}': {e}\")\n",
+    "\n",
+    "def download_invoice():\n",
+    "    \"\"\"\n",
+    "    Main workflow for downloading, processing, and cleaning up files.\n",
+    "    \"\"\"\n",
+    "    # List of file names\n",
+    "    files = [\n",
+    "        \"6900026063.pdf\",\n",
+    "        \"6900026069.pdf\",\n",
+    "        \"6905212892.pdf\",\n",
+    "        \"904000640.pdf\",\n",
+    "        \"PL_IERPIC_MISSING.pdf\"\n",
+    "    ]\n",
+    "\n",
+    "    # Base URL for the raw files\n",
+    "    #base_url = \"https://raw.githubusercontent.com/SinghSuryaDeep/Granite_Recipes_Invoices/main/Invoices/\"\n",
+    "    base_url = \"https://raw.githubusercontent.com/SinghSuryaDeep/granite-snack-cookbook/refs/heads/main/recipes/Invoice-Extraction/Invoices/\"\n",
+    "  \n",
+    "\n",
+    "    \n",
+    "    #base_url = \"https://raw.githubusercontent.com/Surya-Deep-Singh/Granite_Recipes_Invoices\"\n",
+    "    data_dir = \"data\"\n",
+    "    setup_directory(data_dir)\n",
+    "\n",
+    "    # Step 2: Download the files\n",
+    "    print(\"Downloading files...\")\n",
+    "    download_files(files, base_url, data_dir)\n",
+    "\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "088b6d7d-5fae-4497-87a8-4df073475a7e",
+   "metadata": {},
+   "source": [
+    "## Step 4. Initialize and Process Invoices"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2cec1921-e6b2-4fdf-bcb4-d7102656039c",
+   "metadata": {},
+   "source": [
+    "Set up the necessary credentials for IBM watsonx, process invoices using the InvoiceProcessor class, and store the results in a pandas DataFrame."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "1972be9b-59e6-427b-b071-c3047bb15666",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "if __name__ == \"__main__\":\n",
+    "    download_invoice()\n",
+    "    # Main script to initialize and process invoices\n",
+    "    ibm_cloud_api_key = get_env_var('WATSONX_APIKEY')\n",
+    "    project_id = get_env_var('WATSONX_PROJECT_ID')\n",
+    "    watson_url = get_env_var('WATSONX_URL')\n",
+    "    folder_path = os.getenv('folder_path')\n",
+    "    invoice_processor = InvoiceProcessor(ibm_cloud_api_key, project_id, watson_url)\n",
+    "    df_invoice = invoice_processor.process_invoices(\"data\")\n",
+    "    delete_files('data')\n",
+    "df_invoice"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.9"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/recipes/Invoice-Extraction/Invoices/6900026063.pdf b/recipes/Invoice-Extraction/Invoices/6900026063.pdf
diff --git a/recipes/Invoice-Extraction/Invoices/6900026069.pdf b/recipes/Invoice-Extraction/Invoices/6900026069.pdf
diff --git a/recipes/Invoice-Extraction/Invoices/6905212892.pdf b/recipes/Invoice-Extraction/Invoices/6905212892.pdf
diff --git a/recipes/Invoice-Extraction/Invoices/904000640.pdf b/recipes/Invoice-Extraction/Invoices/904000640.pdf
diff --git a/recipes/Invoice-Extraction/Invoices/PL_IERPIC_MISSING.pdf b/recipes/Invoice-Extraction/Invoices/PL_IERPIC_MISSING.pdf