"Tesseract --psm Impact on Scanned Books with Two Facing Pages" / update Documentation #4389

dooha89 · 2025-02-21T04:34:15Z

Current Behavior

Understanding Tesseract OCR and --psm: Why Removing It Can Improve Accuracy for Scanned Books

Introduction

Tesseract OCR is a powerful tool for extracting text from images, but selecting the right parameters is critical for accuracy. One commonly misunderstood parameter is --psm (Page Segmentation Mode). In this guide, we'll discuss a real-world issue where --psm caused incorrect text extraction for scanned books and how removing it led to better results.

The Problem: Mixed Text from Two Facing Pages

Many scanned books and documents contain two facing pages in a single image. When processed with --psm, Tesseract sometimes misinterprets the structure, causing the extracted text to be jumbled. Instead of reading one page at a time, Tesseract would mix text from both pages, extracting:

Line 1 from right page + Line 1 from left page
Line 2 from right page + Line 2 from left page
...

This happens because --psm forces Tesseract to assume a specific layout, which can conflict with the actual structure of the scanned document.

The Solution: Removing `--psm`

By removing --psm, Tesseract processed the right page first in order, then moved to the left page. This resulted in a natural reading order and a significantly better OCR result:

Line 1 from right page
Line 2 from right page
...
(Line 1 from left page follows after the right page is complete)

This confirms that in some cases, manually setting --psm can do more harm than good.

When to Avoid `--psm`

When processing scanned books or documents with two facing pages.
When text is misaligned or mixed in the OCR output.
When dealing with complex layouts where Tesseract's automatic handling works better.

When to Use `--psm`

There are cases where --psm is still useful, such as:

Single-column printed text (--psm 6)
Sparse text (--psm 11)
Images containing only a single word (--psm 8)

Recommended OCR Settings

For scanned books or multi-column text, a safer approach is:

pytesseract.image_to_string(image, config='--oem 1 -c preserve_interword_spaces=1')

This avoids forcing a layout assumption while keeping Tesseract optimized for text extraction. Users can specify the language(s) as needed (e.g., -l eng, -l ara, or -l ara+eng).

Conclusion

This discovery highlights why experimentation is key when working with OCR. If your text output appears mixed or out of order, try removing --psm and letting Tesseract handle the layout automatically. Hopefully, this guide helps others facing similar issues!

Have you encountered other OCR challenges? Share your experience in the comments or discussion forums!

Expected Behavior

No response

Suggested Fix

No response

tesseract -v

tesseract 5.3.4
leptonica-1.82.0
libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 2.1.5) : libpng 1.6.43 : libtiff 4.5.1 : zlib 1.3 : libwebp 1.3.2 : libopenjp2 2.5.0
Found OpenMP 201511
Found libarchive 3.7.2 zlib/1.3 liblzma/5.4.5 bz2lib/1.0.8 liblz4/1.9.4 libzstd/1.5.5
Found libcurl/8.5.0 OpenSSL/3.0.13 zlib/1.3 brotli/1.1.0 zstd/1.5.5 libidn2/2.3.7 libpsl/0.21.2 (+libidn2/2.3.7) libssh/0.10.6/openssl/zlib nghttp2/1.59.0 librtmp/2.3 OpenLDAP/2.6.7

Operating System

Windows 11, Ubuntu 24.04 Noble

Other Operating System

No response

uname -a

Linux omar 6.8.0-53-generic #55-Ubuntu SMP PREEMPT_DYNAMIC Fri Jan 17 15:37:52 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux

Compiler

No response

CPU

No response

Virtualization / Containers

No response

Other Information

No response

The text was updated successfully, but these errors were encountered:

amitdo · 2025-02-22T17:11:04Z

You don't need to 'remove' it. Ypu just don't need to add this command line flag.

tesseract --help does not even mention it. You need to use tesseract --help-extra to see the info about psm. As part of this info, it is mentioned that 3|auto is the default value for psm. This means that not using the psm flag is equal to using --psm 3 or --psm auto.

We are not responsible for third party software documentation and support (pytesseract).

This is a bug tracker. I think your message is more suitable for our forum.

amitdo closed this as completed Feb 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

"Tesseract --psm Impact on Scanned Books with Two Facing Pages" / update Documentation #4389

"Tesseract --psm Impact on Scanned Books with Two Facing Pages" / update Documentation #4389

dooha89 commented Feb 21, 2025

amitdo commented Feb 22, 2025 •

edited

Loading

"Tesseract --psm Impact on Scanned Books with Two Facing Pages" / update Documentation #4389

"Tesseract --psm Impact on Scanned Books with Two Facing Pages" / update Documentation #4389

Comments

dooha89 commented Feb 21, 2025

Current Behavior

Introduction

The Problem: Mixed Text from Two Facing Pages

The Solution: Removing --psm

When to Avoid --psm

When to Use --psm

Recommended OCR Settings

Conclusion

Have you encountered other OCR challenges? Share your experience in the comments or discussion forums!

Expected Behavior

Suggested Fix

tesseract -v

Operating System

Other Operating System

uname -a

Compiler

CPU

Virtualization / Containers

Other Information

amitdo commented Feb 22, 2025 • edited Loading

The Solution: Removing `--psm`

When to Avoid `--psm`

When to Use `--psm`

amitdo commented Feb 22, 2025 •

edited

Loading