win10 chinese filename. TesseractException: Error during processing page. #75

dgq420377903 · 2017-11-13T09:15:57Z

I run tess4j on win10 TesseractException: Error during processing error page.
Tesseract.createDocuments (Tesseract.java:565)
I think the reason is Chinese file name, but do not know how to solve

    File imageFile1 = new File(config.getOcrSrcDir(), "2014-中文名-13783.jpg");
    File pdfFile1 = new File(config.getOcrDesDir(), "2014-中文名-13783");
    ITesseract tess = new Tesseract();
    tess.setLanguage("chi_sim");
    try {
      List<RenderedFormat> formats = new ArrayList<RenderedFormat>();
      formats.add(RenderedFormat.PDF);
      String[] images = new String[] {imageFile1.getAbsolutePath()};
      String[] pdfs = new String[] {pdfFile1.getAbsolutePath()};
      tess.createDocuments(images, pdfs, formats);
    } catch (TesseractException e) {
      e.printStackTrace();
    }

The text was updated successfully, but these errors were encountered:

nguyenq · 2017-11-13T13:36:11Z

That's more on the Java side than Tess4J. It's suggested that you check for the file existence before attempting to do OCR on the image file. If Java does not support the current file name, you may have to use a different file naming that Java supports.

dgq420377903 · 2017-12-15T09:32:32Z

thx

maherm · 2018-06-21T14:00:53Z

This is not a Java issue. Having the same problem here using german umlauts (äüö) in paths. The files definitely exist and no other part of the software has a problem with it.
Seems more like an encoding problem somewhere along the way in JNA, converting non-ascii filename java.lang.Strings into char* for TessBaseAPIProcessPages. After some googling I've already tried setting the "jna.encoding" property, without success.

Platform: Windows 7
Java Version: 1.8.0_171-b11
tess4j Version: 4.0.3-SNAPSHOT (also tested in 2.0.1 with Tesseract 3.05.01)
tesseract Version: 4.0.0-beta.1.20180608

I'll try to provide a testcase for you to reproduce.

maherm · 2018-06-21T14:21:41Z

I added unit tests to help you reproduce the error at https://github.com/maherm/tess4j

nguyenq · 2018-07-04T23:27:36Z

@maherm I confirm your findings. TessBaseAPIProcessPages would immediately return when processing a non-ascii filename. It's something inside JNA.

An interim workaround I can see is rename the file to an ascii name (utilizing File.createTempFile ?) and rename it back -- a bit of hassles.

Or use TessBaseAPIProcessPage method if you really need TessResultRenderer API.

maherm · 2018-07-11T08:01:41Z

@nguyenq Thanks for having a look at this.

An interim workaround I can see is rename the file to an ascii name (utilizing File.createTempFile ?) and rename it back -- a bit of hassles.

That is kind of what I do at the moment: making sure there is never a path passed to tess4j that contains non-ascii symbols. It's a rather ugly workaround, but it does the trick at the moment.

I propose reopening this issue until it is fixed.

nguyenq · 2020-11-26T00:33:07Z

DanBloomberg/leptonica#537

nguyenq closed this as completed Dec 14, 2017

nguyenq reopened this Jul 11, 2018

nguyenq mentioned this issue Jul 25, 2018

detect orientation fail if image path contains Chinese #89

Closed

nguyenq mentioned this issue Jul 23, 2020

Tess4j - Error opening tessdata file by non-ASCII path #190

Open

mugitya26 mentioned this issue Jul 12, 2023

学習データを読み込めない場合がある mugitya26/Kakei-Bo#59

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

win10 chinese filename. TesseractException: Error during processing page. #75

win10 chinese filename. TesseractException: Error during processing page. #75

dgq420377903 commented Nov 13, 2017

nguyenq commented Nov 13, 2017

dgq420377903 commented Dec 15, 2017

maherm commented Jun 21, 2018

maherm commented Jun 21, 2018

nguyenq commented Jul 4, 2018 •

edited

Loading

maherm commented Jul 11, 2018

nguyenq commented Nov 26, 2020

win10 chinese filename. TesseractException: Error during processing page. #75

win10 chinese filename. TesseractException: Error during processing page. #75

Comments

dgq420377903 commented Nov 13, 2017

nguyenq commented Nov 13, 2017

dgq420377903 commented Dec 15, 2017

maherm commented Jun 21, 2018

maherm commented Jun 21, 2018

nguyenq commented Jul 4, 2018 • edited Loading

maherm commented Jul 11, 2018

nguyenq commented Nov 26, 2020

nguyenq commented Jul 4, 2018 •

edited

Loading