Skip to content

Latest commit

 

History

History
57 lines (40 loc) · 2.18 KB

README.md

File metadata and controls

57 lines (40 loc) · 2.18 KB

Thai National Document Optical Character Recognition (THND OCR)

DOI

Tesseract OCR tools for read Thai National Document used TH Sarabun National Font trained and fine-tuned. Read README.md to see about my process.

0. Information

0.1 Tool

0.2 Datasets

0.3 Project information

  • Ai Builders 2021
  • Kampanart Chaimooltan

01.Performance tested

I used Character Errorate and leght string (OCR & Correct Text) and output result testing (.csv file)

02.Generated datasets

I used PIL library. in addtion, I used TH Sarabun formart font 72 px to create datasets.

Link : https://www.kaggle.com/copninich/thaienglish-character-in-th-sarabun-font

03.Tested trained and fine-tuned Tesseract (default langdata_lstm)

Requirements

  1. langdata_lstm
  2. tesseract v.4
  3. tessdata_best

Load file to your folder and extract : https://drive.google.com/drive/folders/1ABo7ooO62Tb03RR_VvkdshRVG9vz23sl?usp=sharing

04.Trained and fine-tuned

Open In Colab

Run script script_basic.ipynb or script_config_error.ipynb

Requirements

  1. langdata_lstm
  2. tesseract v.4
  3. tessdata_best

Custom tha.training_text with my own datasets more than 1.9 M sentences

05.Performance tested

report_performace_final.csv