Line segmentation in tables #190

beckstefan · 2020-09-11T10:42:37Z

For one of our examples,that is http://tudigit.ulb.tu-darmstadt.de/show/Gue-11660-24 , we have a struggle with line segmentation.

Of course, being three columned and separated by lines doesn't make things easy.

If we run e.g. the following workflow

'anybaseocr-binarize -I OCR-D-IMG -O OCR-D-BIN' \
'anybaseocr-crop -I OCR-D-BIN -O OCR-D-CROP' \
'anybaseocr-deskew -I OCR-D-CROP -O OCR-D-DESKEW' \
'tesserocr-segment-region -I OCR-D-DESKEW -O OCR-D-PAGE-SEG' \
'segment-repair -I OCR-D-PAGE-SEG -O OCR-D-PAGE-SEG-REPAIR -P plausibilize true' \
'tesserocr-deskew -I OCR-D-PAGE-SEG-REPAIR -O OCR-D-REG-DESKEW' \
'tesserocr-segment-line -I OCR-D-REG-DESKEW -O OCR-D-LINE-SEG' \
'tesserocr-recognize -I OCR-D-LINE-SEG -O OCR-D-OCR -P model Fraktur'

we get a table for the main content. We find this reasonable. However, the line-segmentation is not performed for any of the table cells. This is independent of the line-segmentation processor, i.e. happens with cis-ocropy-segmet, too.

Is this an expected behaviour?

The text was updated successfully, but these errors were encountered:

bertsky · 2020-09-11T21:20:40Z

anybaseocr-binarize

Please don't use this – the old Ocropus nlbin is wrapped much better by ocrd-cis-ocropy-binarize. Also, ocrd-olena-binarize is usually much better.

anybaseocr-deskew

Please don't use this – it is buggy and the Ocropus projection-profile based deskewing is wrapped much better by ocrd-cis-ocropy-deskew.

tesserocr-segment-region

Please consider making this your first workflow step. For background on that choice, see OCR-D/ocrd_tesserocr#144. Your images look quite clean (except for the show-through), so I would expect Tesseract page segmentation to cope better on raw RGB (doing its own implicit binarization). You can add binarization after segmentation in OCR-D.

tesserocr-segment-line

Please consider using a line segmenter producing dense polygon outlines (like ocrd-cis-ocropy-segment on the region level) instead of coarse bounding boxes. With tight line spacing in blackletter fonts, bboxes necessarily have lots of overlap between neighbours.

we get a table for the main content.

If you don't expect actual tables in your document, only multi-column layouts, you should use ocrd-tesserocr-segment-region -P find_tables false.

We find this reasonable.

I personally would not see it that way, but let's assume this is correct:

However, the line-segmentation is not performed for any of the table cells. This is independent of the line-segmentation processor, i.e. happens with cis-ocropy-segmet, too.

Yes, that's because for table regions you need another, intermediate workflow step between page and text region: a table segmentation (producing a text region for each cell, which can then be line-segmented). See OCR-D/ocrd_tesserocr#134 for how to do this.

(Also, you might want to upvote OCR-D/spec#150 for a better specification and documentation in that respect.)

beckstefan · 2020-09-15T15:01:06Z

Thank you for your recommendations!

Running tesserocr-segment-table yielded a line segmentation inside the table cells. Probably you should mention this processor in the workflow section.

We tried a workflow with exactly ocrd-tesserocr-segment-region -P find_tables false, however, this produced paragraphs running along all three columns, but working on that is another thing, where we will need to try with some more workflows.

bertsky · 2020-09-15T15:33:15Z

Probably you should mention this processor in the workflow section.

If you mean the workflow guide on ocr-d.de, there's a repo for that, but I have limited influence. Documentation seems to be a slow process. Perhaps we should start a wiki topic on that repo?

I guess table processing first needs to be fully acknowledged by the spec.

We tried a workflow with exactly ocrd-tesserocr-segment-region -P find_tables false, however, this produced paragraphs running along all three columns, but working on that is another thing, where we will need to try with some more workflows.

Yes, Tesseract is not particularly good at recognizing column layouts in historical prints. Try ocrd-cis-ocropy-segment or one of the NN page segmenters. Also, see this fantastic overview of currently wrapped page segmenters by @jbarth-ubhd ...

beckstefan closed this as completed Sep 15, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Line segmentation in tables #190

Line segmentation in tables #190

beckstefan commented Sep 11, 2020

bertsky commented Sep 11, 2020 •

edited

Loading

beckstefan commented Sep 15, 2020

bertsky commented Sep 15, 2020

Line segmentation in tables #190

Line segmentation in tables #190

Comments

beckstefan commented Sep 11, 2020

bertsky commented Sep 11, 2020 • edited Loading

beckstefan commented Sep 15, 2020

bertsky commented Sep 15, 2020

bertsky commented Sep 11, 2020 •

edited

Loading