Receipt Information Extraction using LayoutLM

Introduction

This repository is an attempt to implement Microsoft's LayoutLM model on the SROIE2019 dataset of receipts. The foucs of this project is specifically on Task 3 of ICDAR 2019 Robust Reading Challenge on Scanned Receipts OCR and Information Extraction, which is extracting key information, specifically the company, address, date, and total, from scanned receipts.

Approach

This repository approaches the problem as a token classification problem. For each word on the scanned receipt image, the model will attempt to classify whether it is part of the company, address, date, or total sequence. If it doesn't belong to any of these fields, we want to classify it as "Other". At the end, we will sequence all the words with the same class together to produce the complete extracted fields.

An example of the same problem executed on a different dataset can be found in LayoutLM's own repository.

Dataset

Thanks to zzzDavid's repository on the same topic, I was able to get the dataset

Google Drive | Baidu NetDisk

Data cleaning/preprocessing

A big problem of this dataset is that it is not in the form that can be readily put into the model. These include:

OCR result is provided at line-level(meaning that the bounding boxes are of an entire line) while the model requires token-level OCCR
Dataset doesn't provide token-level labels

From a process of trials and errors, I'll outline the process I take to clean and preprocess the data to be put into the model. For the first problems, I will process the line-level OCR provided by the data to get the token-level bounding box for each word. Some observations:

most of the font used are monospaced, meaning that every character in that font family has the same size.
words on the same line usually have the same font and size

From these two observation, we will make the assumption that all characters (including space) on the same line have the same size. Therefore, we can calculate the width of one character from the width of the bounding box of the line. From this, we can calculate the width of the bounding box for each word based on how many chacracters that word consist of. Of course, this will not produce the ground truth OCR for every word (example shown below). However, the bounding boxes relative position is good enough since what we ultimately want is the word's relative 2D position in the scanned image.

For the second problem, the process to clean the data is a bit more complicated involves some manual labelling. For each image, I assume that all words in one line belongs to the same field. If a line contains one of the extracted information field as a substring or vice versa, I (programmatically) label that entire line with that field's name. This assumption is far more problematic than the other and resulted in a lot of mislabelled data. I recorded all the mismatched label in a text file and fixed them manually in the label file.

The code to clean/preprocess the data as outlined is in preprocess_receipt.py. Before you can run the script, you will need to download the data into /data, split the data up into 3 folders, namely images/, annotations/, and extract/ which contain the images, their OCR result, and their extracted information respectively. You will also need to change extracted information files' extention from .txt into .json (this can be done programmatically using Python.You can rename the file to the appropriate extension without corrupting the file). I have done the honor of cleaning the first 22,000 words and putting them into formatted .txt files. You can use these files to train and test the model yourself following the instruction of the Sequence Labelling task mentioned as one of the example in LayoutLM's own repository.

Training

We will follow the same training procedure outlined in LayoutLM Sequence Labelling example with the .txt files we have created. These files can be found under /data in this repostiory.

Result

With relatively few training (~20000 words and bounding boxes), the model is able to achieve ~91.5% token-level accuracy on a ~2000-word test set. We can use the classification result and sequence the words of the same class back together for to get the full extracted information for each receipt.

Model

The trained model can be found here.

Demo

This repository also contains a simple flask application to demonstrate the power of this model. The application will take the words and their bounding boxes from a collection of receipt images and put the extracted information (from our model) into a google sheet. In a real world scenario, this Excel sheet can be processed further, eliminating the need for manual data entry.

Installation

Clone this repository

git clone https://github.com/thongn98/receipt-information-extraction.git

Make a virtual environment and activate it

You will need to navigate to where you want to put the environment files.

python3 -m venv [ENV-NAME]
source [ENV-NAME]/bin/activate

Install requirements

Navigate back to this repository folder

pip install -r requirements.txt

Install LayoutLM

Clone the LayoutLM repository

git clone https://github.com/microsoft/unilm.git

Install LayoutLM

pip install unilm/layoutlm

Download credentials.json file for Google Sheet API. You can find the instruction to do so here. You'll need to move this credentials file into this repository root folder
Create a folder for the model inside the repository root folder

mkdir model/

Download the model from the link above and put all the files there into model/
Run flask app

python view.py

When you first run this application, you will need to authorize this application to use Google Sheet API with your Google account.

This will open port 5000 on your local machine. You can try out the application out by going to http://127.0.0.1:5000/ on your browser. You can use the /data/test_image.txt as the input with some sheet name to your choosing. The app will process the .txt file and give back the Google Sheet URL with the extracted information. The UI of this application looks like this

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
data		data
templates		templates
.gitignore		.gitignore
README.md		README.md
demo.png		demo.png
helpers.py		helpers.py
predict.py		predict.py
preprocess.py		preprocess.py
preprocess_receipt.py		preprocess_receipt.py
requirements.txt		requirements.txt
sheet_helpers.py		sheet_helpers.py
view.py		view.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Receipt Information Extraction using LayoutLM

Introduction

Approach

Dataset

Data cleaning/preprocessing

Training

Result

Model

Demo

Installation

File Structure

`preprocess.py`

`predict.py`

`helpers.py`

`sheet_helpers.py`

`view.py`

`preprocess_receipt.py`

`templates/`

About

Releases

Packages

Languages

thongn98/receipt-information-extraction

Folders and files

Latest commit

History

Repository files navigation

Receipt Information Extraction using LayoutLM

Introduction

Approach

Dataset

Data cleaning/preprocessing

Training

Result

Model

Demo

Installation

File Structure

About

Resources

Stars

Watchers

Forks

Languages