Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

win10 chinese filename. TesseractException: Error during processing page. #75

Open
dgq420377903 opened this issue Nov 13, 2017 · 7 comments

Comments

@dgq420377903
Copy link

I run tess4j on win10 TesseractException: Error during processing error page.
Tesseract.createDocuments (Tesseract.java:565)
I think the reason is Chinese file name, but do not know how to solve

    File imageFile1 = new File(config.getOcrSrcDir(), "2014-中文名-13783.jpg");
    File pdfFile1 = new File(config.getOcrDesDir(), "2014-中文名-13783");
    ITesseract tess = new Tesseract();
    tess.setLanguage("chi_sim");
    try {
      List<RenderedFormat> formats = new ArrayList<RenderedFormat>();
      formats.add(RenderedFormat.PDF);
      String[] images = new String[] {imageFile1.getAbsolutePath()};
      String[] pdfs = new String[] {pdfFile1.getAbsolutePath()};
      tess.createDocuments(images, pdfs, formats);
    } catch (TesseractException e) {
      e.printStackTrace();
    }
@nguyenq
Copy link
Owner

nguyenq commented Nov 13, 2017

That's more on the Java side than Tess4J. It's suggested that you check for the file existence before attempting to do OCR on the image file. If Java does not support the current file name, you may have to use a different file naming that Java supports.

@nguyenq nguyenq closed this as completed Dec 14, 2017
@dgq420377903
Copy link
Author

thx

@maherm
Copy link

maherm commented Jun 21, 2018

This is not a Java issue. Having the same problem here using german umlauts (äüö) in paths. The files definitely exist and no other part of the software has a problem with it.
Seems more like an encoding problem somewhere along the way in JNA, converting non-ascii filename java.lang.Strings into char* for TessBaseAPIProcessPages. After some googling I've already tried setting the "jna.encoding" property, without success.

Platform: Windows 7
Java Version: 1.8.0_171-b11
tess4j Version: 4.0.3-SNAPSHOT (also tested in 2.0.1 with Tesseract 3.05.01)
tesseract Version: 4.0.0-beta.1.20180608

I'll try to provide a testcase for you to reproduce.

@maherm
Copy link

maherm commented Jun 21, 2018

I added unit tests to help you reproduce the error at https://github.com/maherm/tess4j

@nguyenq
Copy link
Owner

nguyenq commented Jul 4, 2018

@maherm I confirm your findings. TessBaseAPIProcessPages would immediately return when processing a non-ascii filename. It's something inside JNA.

An interim workaround I can see is rename the file to an ascii name (utilizing File.createTempFile ?) and rename it back -- a bit of hassles.

Or use TessBaseAPIProcessPage method if you really need TessResultRenderer API.

@maherm
Copy link

maherm commented Jul 11, 2018

@nguyenq Thanks for having a look at this.

An interim workaround I can see is rename the file to an ascii name (utilizing File.createTempFile ?) and rename it back -- a bit of hassles.

That is kind of what I do at the moment: making sure there is never a path passed to tess4j that contains non-ascii symbols. It's a rather ugly workaround, but it does the trick at the moment.

I propose reopening this issue until it is fixed.

@nguyenq
Copy link
Owner

nguyenq commented Nov 26, 2020

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants