Skip to content

Latest commit

 

History

History
5 lines (4 loc) · 834 Bytes

README.md

File metadata and controls

5 lines (4 loc) · 834 Bytes

Optical Character Recognition Post Processing

IITB-intern

Designed an algorithm for word corrections of post OCR Sanskrit scripts through statistical language models using edit distance on a known dictionary. Improved the accuracy of the algorithm up by 10%, by introducing morphological language constraints from the grammar, using Sandhi. Demonstrated different machine learning and deep learning approaches to improve features for ameliorating the error corrections in Sanskrit OCR documents like LSTM and attention models-RNN. We simulated Subanta Prakaraṇam of Aṣṭādhyāyī for synthesizing off-the-shelf dictionary. Adapted various auxiliary sources with plug-in classifiers and achieved the best F-score of 93.72 with the approach of LSTM networks. Our paper was presented at the 17th WSC, Vancouver in July 2018.