Optical Character Recognition Post Processing

IITB-intern

Designed an algorithm for word corrections of post OCR Sanskrit scripts through statistical language models using edit distance on a known dictionary. Improved the accuracy of the algorithm up by 10%, by introducing morphological language constraints from the grammar, using Sandhi. Demonstrated different machine learning and deep learning approaches to improve features for ameliorating the error corrections in Sanskrit OCR documents like LSTM and attention models-RNN. We simulated Subanta Prakaraṇam of Aṣṭādhyāyī for synthesizing off-the-shelf dictionary. Adapted various auxiliary sources with plug-in classifiers and achieved the best F-score of 93.72 with the approach of LSTM networks. Our paper was presented at the 17th WSC, Vancouver in July 2018.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Optical Character Recognition Post Processing

Files

README.md

Latest commit

History

README.md

File metadata and controls

Optical Character Recognition Post Processing