warnings in processor output garble dump-json #1223

bertsky · 2024-05-14T23:34:11Z

We have…

ocrd process \
"cis-ocropy-binarize -I DEFAULT -O OCR-D-BIN"
"anybaseocr-crop -I OCR-D-BIN -O OCR-D-CROP"
"skimage-binarize -I OCR-D-CROP -O OCR-D-BIN2 -P method li"
"skimage-denoise -I OCR-D-BIN2 -O OCR-D-BIN-DENOISE -P level-of-operation page"
"tesserocr-deskew -I OCR-D-BIN-DENOISE -O OCR-D-BIN-DENOISE-DESKEW -P operation_level page"
"cis-ocropy-segment -I OCR-D-BIN-DENOISE-DESKEW -O OCR-D-SEG -P level-of-operation page"
"cis-ocropy-dewarp -I OCR-D-SEG -O OCR-D-SEG-LINE-RESEG-DEWARP"
"calamari-recognize -I OCR-D-SEG-LINE-RESEG-DEWARP -O OCR-D-OCR -P checkpoint_dir qurator-gt4histocr-1.0"
The TensorFlow library was compiled to use AVX instructions, but these aren't available on your machine.
09:00:38.996 ERROR ocrd.utils.get_ocrd_tool_json - ocrd-calamari-recognize --dump-json produced invalid JSON: Expecting value: line 1 column 1 (char 0)
Traceback (most recent call last):
File "/usr/local/bin/ocrd", line 8, in <module>
sys.exit(cli())
File "/usr/local/lib/python3.8/site-packages/click/[core.py](http://core.py/)", line 1157, in call
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/click/[core.py](http://core.py/)", line 1078, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.8/site-packages/click/[core.py](http://core.py/)", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.8/site-packages/click/[core.py](http://core.py/)", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.8/site-packages/click/[core.py](http://core.py/)", line 783, in invoke
return __callback(*args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/ocrd/cli/[process.py](http://process.py/)", line 31, in process_cli
run_tasks(mets, log_level, page_id, tasks, overwrite)
File "/usr/local/lib/python3.8/site-packages/ocrd/task_[sequence.py](http://sequence.py/)", line 125, in run_tasks
validate_tasks(tasks, workspace, page_id, overwrite)
File "/usr/local/lib/python3.8/site-packages/ocrd/task_[sequence.py](http://sequence.py/)", line 101, in validate_tasks
task.validate()
File "/usr/local/lib/python3.8/site-packages/ocrd/task_[sequence.py](http://sequence.py/)", line 74, in validate
raise Exception(report.errors)
Exception: ["[] Additional properties are not allowed ('checkpoint_dir' was unexpected)"]

So because TF spew a warning about a CPU mismatch, the --dump-json output is broken and ocrd process cannot validate the workflow.

IMO this likely affects all processors in one form or another. When we instantiate a processor in the CLI decorator, there is no initLogging call for the --dump-json target, but TF can still do its own basicConfig, which will use stdout.

So perhaps we should do our own basicConfig(handlers=[logging.NullHandler()], force=True), or alternatively just call logging.disable(sys.maxsize) in core's ocrd_cli_wrap_processor (which will be before any processor-specific imports)?

The text was updated successfully, but these errors were encountered:

bertsky · 2024-05-16T09:53:01Z

To be precise, in this case the message from TF logging causing the hickup was not a warning (as the issue suggests) but an error. All our TF-based processors now suppress log messages below that level anyway. But my point is about log messages from imported modules in general, and if we find a solution here, then these special TF rules may not even be necessary anymore.

bertsky · 2024-05-16T17:17:30Z

So perhaps we should do our own basicConfig(handlers=[logging.NullHandler()], force=True), or alternatively just call logging.disable(sys.maxsize) in core's ocrd_cli_wrap_processor (which will be before any processor-specific imports)?

Duh! That by itself won't suffice, of course: when the processor wrapper gets used, TF usually is already imported and thus initialised.

So perhaps we should make sure that our processors don't make imports other than OCR-D related – until required to do process(). That would also speed up responses for --list-resources, --dump-json, --help a lot – TF and Pytorch take quite some time.

bertsky · 2024-05-16T17:25:36Z

So perhaps we should make sure that our processors don't make imports other than OCR-D related

…which in the case of ocrd_calamari is tricky: it wants to from tensorflow import __version__ to show that in the --version (as well as in the METS agent entry after processing).

bertsky · 2024-05-29T15:17:16Z

In another example, I get garbled JSON from the following error:

/ocrd_all/venv38/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.12) or chardet (5.2.0)/charset_normalizer (2.0.12) doesn't match a supported version!
  warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "

bertsky · 2024-05-29T18:28:27Z

Regarding the urllib3 problem – for anyone who encounters this, the fix is to update requests as specified by core:

pip install -U "requests<2.30"

bertsky · 2024-06-10T09:27:04Z

So perhaps we should make sure that our processors don't make imports other than OCR-D related – until required to do process(). That would also speed up responses for --list-resources, --dump-json, --help a lot – TF and Pytorch take quite some time.

We should implement this along with the #322 API changes – so we don't need touch the processor code bases twice.

…which in the case of ocrd_calamari is tricky: it wants to from tensorflow import __version__ to show that in the --version (as well as in the METS agent entry after processing).

In this case it is still possible to initialise logging before importing Tensorflow, so even that should not be a problem.

As to the logging setup itself – it would not even have to be logging.disable(...) or basicConfig with no handlers or a null handler: AFAICS it would be fully sufficient if we did our normal initLogging prior to all other imports – as long as our ocrd_logging.config separates logging from stdout.

bertsky · 2024-11-12T13:52:16Z

@kba @MehmedGIT what do you think?

Should I move initLogging to the top of ocrd_cli_wrap_processor (covering all CLI use cases of a processor, not just processing)?

Only in v3, or also in v2 (since this looks similar in both versions), perhaps as part of #1288?

kba self-assigned this Jul 3, 2024

This was referenced Nov 20, 2024

run initLogging unconditionally #1295

Closed

ocrd_cli_wrap_processor: always do initLogging #1296

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

warnings in processor output garble dump-json #1223

warnings in processor output garble dump-json #1223

bertsky commented May 14, 2024

bertsky commented May 16, 2024

bertsky commented May 16, 2024

bertsky commented May 16, 2024

bertsky commented May 29, 2024

bertsky commented May 29, 2024

bertsky commented Jun 10, 2024

bertsky commented Nov 12, 2024

warnings in processor output garble dump-json #1223

warnings in processor output garble dump-json #1223

Comments

bertsky commented May 14, 2024

bertsky commented May 16, 2024

bertsky commented May 16, 2024

bertsky commented May 16, 2024

bertsky commented May 29, 2024

bertsky commented May 29, 2024

bertsky commented Jun 10, 2024

bertsky commented Nov 12, 2024