Optical Character Recognition Post Processing

IITB-intern

Designed an algorithm for word corrections of post OCR Sanskrit scripts through statistical language models using edit distance on a known dictionary. Improved the accuracy of the algorithm up by 10%, by introducing morphological language constraints from the grammar, using Sandhi. Demonstrated different machine learning and deep learning approaches to improve features for ameliorating the error corrections in Sanskrit OCR documents like LSTM and attention models-RNN. We simulated Subanta Prakaraṇam of Aṣṭādhyāyī for synthesizing off-the-shelf dictionary. Adapted various auxiliary sources with plug-in classifiers and achieved the best F-score of 93.72 with the approach of LSTM networks. Our paper was presented at the 17th WSC, Vancouver in July 2018.

Name		Name	Last commit message	Last commit date
Latest commit History 60 Commits
Data		Data
fstRelated		fstRelated
sandhiRelated		sandhiRelated
OCR.cpp		OCR.cpp
README.md		README.md
confusions.h		confusions.h
editDistance.h		editDistance.h
newConfusions.h		newConfusions.h

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Optical Character Recognition Post Processing

About

Releases

Packages

Languages

avaibh/Post-OCR-Processing

Folders and files

Latest commit

History

Repository files navigation

Optical Character Recognition Post Processing

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages