Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"Tesseract --psm Impact on Scanned Books with Two Facing Pages" / update Documentation #4389

Closed
dooha89 opened this issue Feb 21, 2025 · 1 comment

Comments

@dooha89
Copy link

dooha89 commented Feb 21, 2025

Current Behavior

Understanding Tesseract OCR and --psm: Why Removing It Can Improve Accuracy for Scanned Books

Introduction

Tesseract OCR is a powerful tool for extracting text from images, but selecting the right parameters is critical for accuracy. One commonly misunderstood parameter is --psm (Page Segmentation Mode). In this guide, we'll discuss a real-world issue where --psm caused incorrect text extraction for scanned books and how removing it led to better results.

The Problem: Mixed Text from Two Facing Pages

Many scanned books and documents contain two facing pages in a single image. When processed with --psm, Tesseract sometimes misinterprets the structure, causing the extracted text to be jumbled. Instead of reading one page at a time, Tesseract would mix text from both pages, extracting:

Line 1 from right page + Line 1 from left page
Line 2 from right page + Line 2 from left page
...

This happens because --psm forces Tesseract to assume a specific layout, which can conflict with the actual structure of the scanned document.

The Solution: Removing --psm

By removing --psm, Tesseract processed the right page first in order, then moved to the left page. This resulted in a natural reading order and a significantly better OCR result:

Line 1 from right page
Line 2 from right page
...
(Line 1 from left page follows after the right page is complete)

This confirms that in some cases, manually setting --psm can do more harm than good.

When to Avoid --psm

  • When processing scanned books or documents with two facing pages.
  • When text is misaligned or mixed in the OCR output.
  • When dealing with complex layouts where Tesseract's automatic handling works better.

When to Use --psm

There are cases where --psm is still useful, such as:

  • Single-column printed text (--psm 6)
  • Sparse text (--psm 11)
  • Images containing only a single word (--psm 8)

Recommended OCR Settings

For scanned books or multi-column text, a safer approach is:

pytesseract.image_to_string(image, config='--oem 1 -c preserve_interword_spaces=1')

This avoids forcing a layout assumption while keeping Tesseract optimized for text extraction. Users can specify the language(s) as needed (e.g., -l eng, -l ara, or -l ara+eng).

Conclusion

This discovery highlights why experimentation is key when working with OCR. If your text output appears mixed or out of order, try removing --psm and letting Tesseract handle the layout automatically. Hopefully, this guide helps others facing similar issues!

Have you encountered other OCR challenges? Share your experience in the comments or discussion forums!

Expected Behavior

No response

Suggested Fix

No response

tesseract -v

tesseract 5.3.4
leptonica-1.82.0
libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 2.1.5) : libpng 1.6.43 : libtiff 4.5.1 : zlib 1.3 : libwebp 1.3.2 : libopenjp2 2.5.0
Found OpenMP 201511
Found libarchive 3.7.2 zlib/1.3 liblzma/5.4.5 bz2lib/1.0.8 liblz4/1.9.4 libzstd/1.5.5
Found libcurl/8.5.0 OpenSSL/3.0.13 zlib/1.3 brotli/1.1.0 zstd/1.5.5 libidn2/2.3.7 libpsl/0.21.2 (+libidn2/2.3.7) libssh/0.10.6/openssl/zlib nghttp2/1.59.0 librtmp/2.3 OpenLDAP/2.6.7

Operating System

Windows 11, Ubuntu 24.04 Noble

Other Operating System

No response

uname -a

Linux omar 6.8.0-53-generic #55-Ubuntu SMP PREEMPT_DYNAMIC Fri Jan 17 15:37:52 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux

Compiler

No response

CPU

No response

Virtualization / Containers

No response

Other Information

No response

@amitdo
Copy link
Collaborator

amitdo commented Feb 22, 2025

You don't need to 'remove' it. Ypu just don't need to add this command line flag.

tesseract --help does not even mention it. You need to use tesseract --help-extra to see the info about psm. As part of this info, it is mentioned that 3|auto is the default value for psm. This means that not using the psm flag is equal to using --psm 3 or --psm auto.

We are not responsible for third party software documentation and support (pytesseract).

This is a bug tracker. I think your message is more suitable for our forum.

@amitdo amitdo closed this as completed Feb 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants