Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Line segmentation in tables #190

Closed
beckstefan opened this issue Sep 11, 2020 · 3 comments
Closed

Line segmentation in tables #190

beckstefan opened this issue Sep 11, 2020 · 3 comments

Comments

@beckstefan
Copy link

For one of our examples,that is http://tudigit.ulb.tu-darmstadt.de/show/Gue-11660-24 , we have a struggle with line segmentation.

Of course, being three columned and separated by lines doesn't make things easy.

If we run e.g. the following workflow

'anybaseocr-binarize -I OCR-D-IMG -O OCR-D-BIN' \
'anybaseocr-crop -I OCR-D-BIN -O OCR-D-CROP' \
'anybaseocr-deskew -I OCR-D-CROP -O OCR-D-DESKEW' \
'tesserocr-segment-region -I OCR-D-DESKEW -O OCR-D-PAGE-SEG' \
'segment-repair -I OCR-D-PAGE-SEG -O OCR-D-PAGE-SEG-REPAIR -P plausibilize true' \
'tesserocr-deskew -I OCR-D-PAGE-SEG-REPAIR -O OCR-D-REG-DESKEW' \
'tesserocr-segment-line -I OCR-D-REG-DESKEW -O OCR-D-LINE-SEG' \
'tesserocr-recognize -I OCR-D-LINE-SEG -O OCR-D-OCR -P model Fraktur'

we get a table for the main content. We find this reasonable. However, the line-segmentation is not performed for any of the table cells. This is independent of the line-segmentation processor, i.e. happens with cis-ocropy-segmet, too.

Is this an expected behaviour?

@bertsky
Copy link
Collaborator

bertsky commented Sep 11, 2020

anybaseocr-binarize

Please don't use this – the old Ocropus nlbin is wrapped much better by ocrd-cis-ocropy-binarize. Also, ocrd-olena-binarize is usually much better.

anybaseocr-deskew

Please don't use this – it is buggy and the Ocropus projection-profile based deskewing is wrapped much better by ocrd-cis-ocropy-deskew.

tesserocr-segment-region

Please consider making this your first workflow step. For background on that choice, see OCR-D/ocrd_tesserocr#144. Your images look quite clean (except for the show-through), so I would expect Tesseract page segmentation to cope better on raw RGB (doing its own implicit binarization). You can add binarization after segmentation in OCR-D.

tesserocr-segment-line

Please consider using a line segmenter producing dense polygon outlines (like ocrd-cis-ocropy-segment on the region level) instead of coarse bounding boxes. With tight line spacing in blackletter fonts, bboxes necessarily have lots of overlap between neighbours.

we get a table for the main content.

If you don't expect actual tables in your document, only multi-column layouts, you should use ocrd-tesserocr-segment-region -P find_tables false.

We find this reasonable.

I personally would not see it that way, but let's assume this is correct:

However, the line-segmentation is not performed for any of the table cells. This is independent of the line-segmentation processor, i.e. happens with cis-ocropy-segmet, too.

Yes, that's because for table regions you need another, intermediate workflow step between page and text region: a table segmentation (producing a text region for each cell, which can then be line-segmented). See OCR-D/ocrd_tesserocr#134 for how to do this.

(Also, you might want to upvote OCR-D/spec#150 for a better specification and documentation in that respect.)

@beckstefan
Copy link
Author

Thank you for your recommendations!

Running tesserocr-segment-table yielded a line segmentation inside the table cells. Probably you should mention this processor in the workflow section.

We tried a workflow with exactly ocrd-tesserocr-segment-region -P find_tables false, however, this produced paragraphs running along all three columns, but working on that is another thing, where we will need to try with some more workflows.

@bertsky
Copy link
Collaborator

bertsky commented Sep 15, 2020

Probably you should mention this processor in the workflow section.

If you mean the workflow guide on ocr-d.de, there's a repo for that, but I have limited influence. Documentation seems to be a slow process. Perhaps we should start a wiki topic on that repo?

I guess table processing first needs to be fully acknowledged by the spec.

We tried a workflow with exactly ocrd-tesserocr-segment-region -P find_tables false, however, this produced paragraphs running along all three columns, but working on that is another thing, where we will need to try with some more workflows.

Yes, Tesseract is not particularly good at recognizing column layouts in historical prints. Try ocrd-cis-ocropy-segment or one of the NN page segmenters. Also, see this fantastic overview of currently wrapped page segmenters by @jbarth-ubhd ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants