-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Line segmentation in tables #190
Comments
Please don't use this – the old Ocropus nlbin is wrapped much better by
Please don't use this – it is buggy and the Ocropus projection-profile based deskewing is wrapped much better by
Please consider making this your first workflow step. For background on that choice, see OCR-D/ocrd_tesserocr#144. Your images look quite clean (except for the show-through), so I would expect Tesseract page segmentation to cope better on raw RGB (doing its own implicit binarization). You can add binarization after segmentation in OCR-D.
Please consider using a line segmenter producing dense polygon outlines (like
If you don't expect actual tables in your document, only multi-column layouts, you should use
I personally would not see it that way, but let's assume this is correct:
Yes, that's because for table regions you need another, intermediate workflow step between page and text region: a table segmentation (producing a text region for each cell, which can then be line-segmented). See OCR-D/ocrd_tesserocr#134 for how to do this. (Also, you might want to upvote OCR-D/spec#150 for a better specification and documentation in that respect.) |
Thank you for your recommendations! Running We tried a workflow with exactly |
If you mean the workflow guide on ocr-d.de, there's a repo for that, but I have limited influence. Documentation seems to be a slow process. Perhaps we should start a wiki topic on that repo? I guess table processing first needs to be fully acknowledged by the spec.
Yes, Tesseract is not particularly good at recognizing column layouts in historical prints. Try |
For one of our examples,that is http://tudigit.ulb.tu-darmstadt.de/show/Gue-11660-24 , we have a struggle with line segmentation.
Of course, being three columned and separated by lines doesn't make things easy.
If we run e.g. the following workflow
we get a table for the main content. We find this reasonable. However, the line-segmentation is not performed for any of the table cells. This is independent of the line-segmentation processor, i.e. happens with
cis-ocropy-segmet
, too.Is this an expected behaviour?
The text was updated successfully, but these errors were encountered: