You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Understanding Tesseract OCR and --psm: Why Removing It Can Improve Accuracy for Scanned Books
Introduction
Tesseract OCR is a powerful tool for extracting text from images, but selecting the right parameters is critical for accuracy. One commonly misunderstood parameter is --psm (Page Segmentation Mode). In this guide, we'll discuss a real-world issue where --psm caused incorrect text extraction for scanned books and how removing it led to better results.
The Problem: Mixed Text from Two Facing Pages
Many scanned books and documents contain two facing pages in a single image. When processed with --psm, Tesseract sometimes misinterprets the structure, causing the extracted text to be jumbled. Instead of reading one page at a time, Tesseract would mix text from both pages, extracting:
Line 1 from right page + Line 1 from left page
Line 2 from right page + Line 2 from left page
...
This happens because --psm forces Tesseract to assume a specific layout, which can conflict with the actual structure of the scanned document.
The Solution: Removing --psm
By removing --psm, Tesseract processed the right page first in order, then moved to the left page. This resulted in a natural reading order and a significantly better OCR result:
Line 1 from right page
Line 2 from right page
...
(Line 1 from left page follows after the right page is complete)
This confirms that in some cases, manually setting --psm can do more harm than good.
When to Avoid --psm
When processing scanned books or documents with two facing pages.
When text is misaligned or mixed in the OCR output.
When dealing with complex layouts where Tesseract's automatic handling works better.
When to Use --psm
There are cases where --psm is still useful, such as:
Single-column printed text (--psm 6)
Sparse text (--psm 11)
Images containing only a single word (--psm 8)
Recommended OCR Settings
For scanned books or multi-column text, a safer approach is:
This avoids forcing a layout assumption while keeping Tesseract optimized for text extraction. Users can specify the language(s) as needed (e.g., -l eng, -l ara, or -l ara+eng).
Conclusion
This discovery highlights why experimentation is key when working with OCR. If your text output appears mixed or out of order, try removing --psm and letting Tesseract handle the layout automatically. Hopefully, this guide helps others facing similar issues!
Have you encountered other OCR challenges? Share your experience in the comments or discussion forums!
You don't need to 'remove' it. Ypu just don't need to add this command line flag.
tesseract --help does not even mention it. You need to use tesseract --help-extra to see the info about psm. As part of this info, it is mentioned that 3|auto is the default value for psm. This means that not using the psm flag is equal to using --psm 3 or --psm auto.
We are not responsible for third party software documentation and support (pytesseract).
This is a bug tracker. I think your message is more suitable for our forum.
Current Behavior
Understanding Tesseract OCR and
--psm
: Why Removing It Can Improve Accuracy for Scanned BooksIntroduction
Tesseract OCR is a powerful tool for extracting text from images, but selecting the right parameters is critical for accuracy. One commonly misunderstood parameter is
--psm
(Page Segmentation Mode). In this guide, we'll discuss a real-world issue where--psm
caused incorrect text extraction for scanned books and how removing it led to better results.The Problem: Mixed Text from Two Facing Pages
Many scanned books and documents contain two facing pages in a single image. When processed with
--psm
, Tesseract sometimes misinterprets the structure, causing the extracted text to be jumbled. Instead of reading one page at a time, Tesseract would mix text from both pages, extracting:This happens because
--psm
forces Tesseract to assume a specific layout, which can conflict with the actual structure of the scanned document.The Solution: Removing
--psm
By removing
--psm
, Tesseract processed the right page first in order, then moved to the left page. This resulted in a natural reading order and a significantly better OCR result:This confirms that in some cases, manually setting
--psm
can do more harm than good.When to Avoid
--psm
When to Use
--psm
There are cases where
--psm
is still useful, such as:--psm 6
)--psm 11
)--psm 8
)Recommended OCR Settings
For scanned books or multi-column text, a safer approach is:
This avoids forcing a layout assumption while keeping Tesseract optimized for text extraction. Users can specify the language(s) as needed (e.g.,
-l eng
,-l ara
, or-l ara+eng
).Conclusion
This discovery highlights why experimentation is key when working with OCR. If your text output appears mixed or out of order, try removing
--psm
and letting Tesseract handle the layout automatically. Hopefully, this guide helps others facing similar issues!Have you encountered other OCR challenges? Share your experience in the comments or discussion forums!
Expected Behavior
No response
Suggested Fix
No response
tesseract -v
tesseract 5.3.4
leptonica-1.82.0
libgif 5.2.1 : libjpeg 8d (libjpeg-turbo 2.1.5) : libpng 1.6.43 : libtiff 4.5.1 : zlib 1.3 : libwebp 1.3.2 : libopenjp2 2.5.0
Found OpenMP 201511
Found libarchive 3.7.2 zlib/1.3 liblzma/5.4.5 bz2lib/1.0.8 liblz4/1.9.4 libzstd/1.5.5
Found libcurl/8.5.0 OpenSSL/3.0.13 zlib/1.3 brotli/1.1.0 zstd/1.5.5 libidn2/2.3.7 libpsl/0.21.2 (+libidn2/2.3.7) libssh/0.10.6/openssl/zlib nghttp2/1.59.0 librtmp/2.3 OpenLDAP/2.6.7
Operating System
Windows 11, Ubuntu 24.04 Noble
Other Operating System
No response
uname -a
Linux omar 6.8.0-53-generic #55-Ubuntu SMP PREEMPT_DYNAMIC Fri Jan 17 15:37:52 UTC 2025 x86_64 x86_64 x86_64 GNU/Linux
Compiler
No response
CPU
No response
Virtualization / Containers
No response
Other Information
No response
The text was updated successfully, but these errors were encountered: