You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
When users attempt to use unsupported image formats (like PDF) with Tesseract.js, they receive a cryptic error message:
Error in pixReadStream: Pdf reading is not supported
Error: Error attempting to read image.
This error message doesn't clearly tell users what formats are supported, leading to confusion and unnecessary debugging time.
Describe the solution you'd like
Add early validation of image formats before processing begins, with a clear error message that lists all supported formats. The error message should look like:
Error: Unsupported image format: pdf. Tesseract.js supports: png, jpg, bmp, pbm, webp, gif
This would:
Fail fast before unnecessary processing
Clearly indicate what went wrong
Show users which formats they can use instead
Describe alternatives you've considered
Add format documentation to README (but users might not see it)
Update the existing error message in pixReadStream (but that happens too late in the process)
Add format validation in the example scripts (but that wouldn't help library users)
Additional context
I would like to submit for a review a pull request ready that implements this feature by adding format validation in createWorker.js, using the existing FORMATS constant from tests/constants.js.
The text was updated successfully, but these errors were encountered:
Although there is a section in the FAQ that discusses PDF support, I agree that the subject comes up enough to warrant mentioning it in the readme. I will edit at some point in the next week. I also agree that it could be useful to have an error message when users attempt to recognize a file that is clearly a .pdf that lists supported formats and/or links to the FAQ.
Regarding input validation in general, I would only want to throw new errors in cases where we are extremely confident that the input would be rejected by Tesseract. For example, the case where the input is a file with a .pdf extension would qualify. Especially given the number of supported image formats and data types, throwing more input validation errors is inherently high-risk, as any bug or unforeseen edge-case that results in valid inputs being incorrectly rejected would break somebody's application. For example, it looks like the checks implemented in your PR cause several automated tests to fail due to incorrectly rejecting valid inputs.
Is your feature request related to a problem? Please describe.
When users attempt to use unsupported image formats (like PDF) with Tesseract.js, they receive a cryptic error message:
Error in pixReadStream: Pdf reading is not supported
Error: Error attempting to read image.
This error message doesn't clearly tell users what formats are supported, leading to confusion and unnecessary debugging time.
Describe the solution you'd like
Add early validation of image formats before processing begins, with a clear error message that lists all supported formats. The error message should look like:
Error: Unsupported image format: pdf. Tesseract.js supports: png, jpg, bmp, pbm, webp, gif
This would:
Describe alternatives you've considered
Additional context
I would like to submit for a review a pull request ready that implements this feature by adding format validation in createWorker.js, using the existing FORMATS constant from tests/constants.js.
The text was updated successfully, but these errors were encountered: