Skip to content

thongn98/receipt-information-extraction

Repository files navigation

Receipt Information Extraction using LayoutLM

Introduction

This repository is an attempt to implement Microsoft's LayoutLM model on the SROIE2019 dataset of receipts. The foucs of this project is specifically on Task 3 of ICDAR 2019 Robust Reading Challenge on Scanned Receipts OCR and Information Extraction, which is extracting key information, specifically the company, address, date, and total, from scanned receipts.

Approach

This repository approaches the problem as a token classification problem. For each word on the scanned receipt image, the model will attempt to classify whether it is part of the company, address, date, or total sequence. If it doesn't belong to any of these fields, we want to classify it as "Other". At the end, we will sequence all the words with the same class together to produce the complete extracted fields.

An example of the same problem executed on a different dataset can be found in LayoutLM's own repository.

Dataset

Thanks to zzzDavid's repository on the same topic, I was able to get the dataset

Google Drive | Baidu NetDisk

Data cleaning/preprocessing

A big problem of this dataset is that it is not in the form that can be readily put into the model. These include:

  1. OCR result is provided at line-level(meaning that the bounding boxes are of an entire line) while the model requires token-level OCCR
  2. Dataset doesn't provide token-level labels

From a process of trials and errors, I'll outline the process I take to clean and preprocess the data to be put into the model. For the first problems, I will process the line-level OCR provided by the data to get the token-level bounding box for each word. Some observations:

  • most of the font used are monospaced, meaning that every character in that font family has the same size.
  • words on the same line usually have the same font and size

From these two observation, we will make the assumption that all characters (including space) on the same line have the same size. Therefore, we can calculate the width of one character from the width of the bounding box of the line. From this, we can calculate the width of the bounding box for each word based on how many chacracters that word consist of. Of course, this will not produce the ground truth OCR for every word (example shown below). However, the bounding boxes relative position is good enough since what we ultimately want is the word's relative 2D position in the scanned image.

For the second problem, the process to clean the data is a bit more complicated involves some manual labelling. For each image, I assume that all words in one line belongs to the same field. If a line contains one of the extracted information field as a substring or vice versa, I (programmatically) label that entire line with that field's name. This assumption is far more problematic than the other and resulted in a lot of mislabelled data. I recorded all the mismatched label in a text file and fixed them manually in the label file.

The code to clean/preprocess the data as outlined is in preprocess_receipt.py. Before you can run the script, you will need to download the data into /data, split the data up into 3 folders, namely images/, annotations/, and extract/ which contain the images, their OCR result, and their extracted information respectively. You will also need to change extracted information files' extention from .txt into .json (this can be done programmatically using Python.You can rename the file to the appropriate extension without corrupting the file). I have done the honor of cleaning the first 22,000 words and putting them into formatted .txt files. You can use these files to train and test the model yourself following the instruction of the Sequence Labelling task mentioned as one of the example in LayoutLM's own repository.

Training

We will follow the same training procedure outlined in LayoutLM Sequence Labelling example with the .txt files we have created. These files can be found under /data in this repostiory.

Result

With relatively few training (~20000 words and bounding boxes), the model is able to achieve ~91.5% token-level accuracy on a ~2000-word test set. We can use the classification result and sequence the words of the same class back together for to get the full extracted information for each receipt.

Model

The trained model can be found here.

Demo

This repository also contains a simple flask application to demonstrate the power of this model. The application will take the words and their bounding boxes from a collection of receipt images and put the extracted information (from our model) into a google sheet. In a real world scenario, this Excel sheet can be processed further, eliminating the need for manual data entry.

Installation

  1. Clone this repository
git clone https://github.com/thongn98/receipt-information-extraction.git
  1. Make a virtual environment and activate it

You will need to navigate to where you want to put the environment files.

python3 -m venv [ENV-NAME]
source [ENV-NAME]/bin/activate
  1. Install requirements

Navigate back to this repository folder

pip install -r requirements.txt
  1. Install LayoutLM

Clone the LayoutLM repository

git clone https://github.com/microsoft/unilm.git
  1. Install LayoutLM
pip install unilm/layoutlm
  1. Download credentials.json file for Google Sheet API. You can find the instruction to do so here. You'll need to move this credentials file into this repository root folder
  2. Create a folder for the model inside the repository root folder
mkdir model/
  1. Download the model from the link above and put all the files there into model/
  2. Run flask app
python view.py

When you first run this application, you will need to authorize this application to use Google Sheet API with your Google account.

This will open port 5000 on your local machine. You can try out the application out by going to http://127.0.0.1:5000/ on your browser. You can use the /data/test_image.txt as the input with some sheet name to your choosing. The app will process the .txt file and give back the Google Sheet URL with the extracted information. The UI of this application looks like this

My image

File Structure

This file is responsible for preprocessing the input data

This file is responsible for predicting the labels for the input data

This file is responsible for defining all miscellaneous helper functions

This file is responsible for defining all the helper functions related to Google Sheet API.

This file is responsible for defining all the flask application views

This file is responsible for preprocessing the original data into .txt file for the model to train/predict

This folder contains all the assets for the flask application

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published