-
Notifications
You must be signed in to change notification settings - Fork 9.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C-API] TessBaseAPIRecognize() segfaults when no language data files is available #779
Comments
@jflesch : Problem is that you are not using c-api as designed ;-) : you forget to check return value. So instead of
you use something like this:
|
Oops :-/ |
no problem. With testing you issue thanks for test case) I found and fix another problem ;-) BTW: maybe it would be good if you join your effort with @jbarlow83 : he is wrapping leptonica in python in his project OCRmyPDF. leptonica provides several great function for ocr preprocesing like descew and dewarp removing background) |
Interesting. I did write my own image manipulation library. It contains reimplementations of unpaper's algorithms. The idea at the time was mainly to reduce dependency on stuff like OpenCV as much as possible because they made installing my project a lot more complicated (it was before Flatpak). Unpaper's unskewing is the only algo I haven't re-implemented yet. I'll have a look later at what leptonica and OCRmyPDF provide exactly. |
I spoke with some developer (not python) and he was quite surprised why tesseract users are using/looking for other libraries for OCR image preprocessing if all really needed is in leptonica ;-) |
Actually, to be honest, in Paperwork, I don't really worry much about accuracy. OCR is used for indexing documents only. Fuzzy-searching takes care of non-entirely-accurate OCR results. So at the moment I don't even bother pre-processing images at all before passing them to Tesseract. But I'm wondering: Is there an official recommendation for set of pre-processing algorithms to apply before passing an image to Tesseract ? Are there some algorithms that would make images match more closely data that was used for training Tesseract ? (grayscale ? pure b&w ? unskewing ? ...) |
It depends on which ocr engine in 4.0.0 you want to use (lstm/legacy). |
Some recommendations is on ImproveQuality - not sure how many of them are must for 4.0. Deskew, dewarp, noise removing and crop should also applicable general also for your project. |
Thanks :) |
@jflesch There's some overlap in what we're doing but it's not duplication either. ocrmypdf focuses on doing PDF to OCR PDF. It necessarily includes a tesseract executable wrapper similar to pyocr, with some differences in approach (not that there is much going here here). I've never done much with libtesseract. Leptonica is a large library and as far as I know the kind of document analysis and cleanup filters are a fair bit more sophisticated than unpaper has. I can't say I've studied it in great detail, but unpaper strikes me as generally following a homebrewed approach - it relies on a lot of assumptions ("there are either one or two columns of text") and basic methods like counting the number of black/white pixels in an area, and thresholds. Leptonica has features to do things like generate masks that cover all of the text and image regions on a page, which is done with morphology. Its author, Dan Bloomberg, has an academic background in the topic. (ocrmypdf uses unpaper too, but mainly for historical reasons - the original author added it and I maintained it as, but I've never really given it a hard look.) My wrapper doesn't cover all of Leptonica, but it's fairly easy to pull in more functions. I'd certainly consider spinning it off to a new project. (There is also a "pyleptonica" wrapper that is derived from parsing the leptonica source, but it has not been updated for several years and is stuck in Python 2. In a way this is the right thing to do to wrap such a large library. However, pyleptonica uses its own custom C parser (!), and outputs a massive 2.6 MB Python script for its wrappers. I decided this was madness, especially the huge pure Python script. Anyway, it's there, and forking it and taming it might be an approach to consider.) |
Hello,
When working on openpaperwork/pyocr#51 , someone reported to me the following issue:
When no language data file is available at all (not even English), TessBaseAPIRecognize() segfaults.
Also happens when TESSDATA_PREFIX is set to an invalid directory.
Tested with libtesseract 3.04.01 (Debian Sid).
Example code:
Stacktrace:
PS: This is a very low-priority issue for me. I can work around it really easily on Python side.
The text was updated successfully, but these errors were encountered: