Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

warnings in processor output garble dump-json #1223

Open
bertsky opened this issue May 14, 2024 · 7 comments · May be fixed by #1296
Open

warnings in processor output garble dump-json #1223

bertsky opened this issue May 14, 2024 · 7 comments · May be fixed by #1296
Assignees

Comments

@bertsky
Copy link
Collaborator

bertsky commented May 14, 2024

We have…

ocrd process \
"cis-ocropy-binarize -I DEFAULT -O OCR-D-BIN"
"anybaseocr-crop -I OCR-D-BIN -O OCR-D-CROP"
"skimage-binarize -I OCR-D-CROP -O OCR-D-BIN2 -P method li"
"skimage-denoise -I OCR-D-BIN2 -O OCR-D-BIN-DENOISE -P level-of-operation page"
"tesserocr-deskew -I OCR-D-BIN-DENOISE -O OCR-D-BIN-DENOISE-DESKEW -P operation_level page"
"cis-ocropy-segment -I OCR-D-BIN-DENOISE-DESKEW -O OCR-D-SEG -P level-of-operation page"
"cis-ocropy-dewarp -I OCR-D-SEG -O OCR-D-SEG-LINE-RESEG-DEWARP"
"calamari-recognize -I OCR-D-SEG-LINE-RESEG-DEWARP -O OCR-D-OCR -P checkpoint_dir qurator-gt4histocr-1.0"
The TensorFlow library was compiled to use AVX instructions, but these aren't available on your machine.
09:00:38.996 ERROR ocrd.utils.get_ocrd_tool_json - ocrd-calamari-recognize --dump-json produced invalid JSON: Expecting value: line 1 column 1 (char 0)
Traceback (most recent call last):
File "/usr/local/bin/ocrd", line 8, in <module>
sys.exit(cli())
File "/usr/local/lib/python3.8/site-packages/click/[core.py](http://core.py/)", line 1157, in call
return self.main(*args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/click/[core.py](http://core.py/)", line 1078, in main
rv = self.invoke(ctx)
File "/usr/local/lib/python3.8/site-packages/click/[core.py](http://core.py/)", line 1688, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/usr/local/lib/python3.8/site-packages/click/[core.py](http://core.py/)", line 1434, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/local/lib/python3.8/site-packages/click/[core.py](http://core.py/)", line 783, in invoke
return __callback(*args, **kwargs)
File "/usr/local/lib/python3.8/site-packages/ocrd/cli/[process.py](http://process.py/)", line 31, in process_cli
run_tasks(mets, log_level, page_id, tasks, overwrite)
File "/usr/local/lib/python3.8/site-packages/ocrd/task_[sequence.py](http://sequence.py/)", line 125, in run_tasks
validate_tasks(tasks, workspace, page_id, overwrite)
File "/usr/local/lib/python3.8/site-packages/ocrd/task_[sequence.py](http://sequence.py/)", line 101, in validate_tasks
task.validate()
File "/usr/local/lib/python3.8/site-packages/ocrd/task_[sequence.py](http://sequence.py/)", line 74, in validate
raise Exception(report.errors)
Exception: ["[] Additional properties are not allowed ('checkpoint_dir' was unexpected)"]

So because TF spew a warning about a CPU mismatch, the --dump-json output is broken and ocrd process cannot validate the workflow.

IMO this likely affects all processors in one form or another. When we instantiate a processor in the CLI decorator, there is no initLogging call for the --dump-json target, but TF can still do its own basicConfig, which will use stdout.

So perhaps we should do our own basicConfig(handlers=[logging.NullHandler()], force=True), or alternatively just call logging.disable(sys.maxsize) in core's ocrd_cli_wrap_processor (which will be before any processor-specific imports)?

@bertsky
Copy link
Collaborator Author

bertsky commented May 16, 2024

To be precise, in this case the message from TF logging causing the hickup was not a warning (as the issue suggests) but an error. All our TF-based processors now suppress log messages below that level anyway. But my point is about log messages from imported modules in general, and if we find a solution here, then these special TF rules may not even be necessary anymore.

@bertsky
Copy link
Collaborator Author

bertsky commented May 16, 2024

So perhaps we should do our own basicConfig(handlers=[logging.NullHandler()], force=True), or alternatively just call logging.disable(sys.maxsize) in core's ocrd_cli_wrap_processor (which will be before any processor-specific imports)?

Duh! That by itself won't suffice, of course: when the processor wrapper gets used, TF usually is already imported and thus initialised.

So perhaps we should make sure that our processors don't make imports other than OCR-D related – until required to do process(). That would also speed up responses for --list-resources, --dump-json, --help a lot – TF and Pytorch take quite some time.

@bertsky
Copy link
Collaborator Author

bertsky commented May 16, 2024

So perhaps we should make sure that our processors don't make imports other than OCR-D related

…which in the case of ocrd_calamari is tricky: it wants to from tensorflow import __version__ to show that in the --version (as well as in the METS agent entry after processing).

@bertsky
Copy link
Collaborator Author

bertsky commented May 29, 2024

In another example, I get garbled JSON from the following error:

/ocrd_all/venv38/lib/python3.8/site-packages/requests/__init__.py:102: RequestsDependencyWarning: urllib3 (1.26.12) or chardet (5.2.0)/charset_normalizer (2.0.12) doesn't match a supported version!
  warnings.warn("urllib3 ({}) or chardet ({})/charset_normalizer ({}) doesn't match a supported "

@bertsky
Copy link
Collaborator Author

bertsky commented May 29, 2024

Regarding the urllib3 problem – for anyone who encounters this, the fix is to update requests as specified by core:

pip install -U "requests<2.30"

@bertsky
Copy link
Collaborator Author

bertsky commented Jun 10, 2024

So perhaps we should make sure that our processors don't make imports other than OCR-D related – until required to do process(). That would also speed up responses for --list-resources, --dump-json, --help a lot – TF and Pytorch take quite some time.

We should implement this along with the #322 API changes – so we don't need touch the processor code bases twice.

…which in the case of ocrd_calamari is tricky: it wants to from tensorflow import __version__ to show that in the --version (as well as in the METS agent entry after processing).

In this case it is still possible to initialise logging before importing Tensorflow, so even that should not be a problem.

As to the logging setup itself – it would not even have to be logging.disable(...) or basicConfig with no handlers or a null handler: AFAICS it would be fully sufficient if we did our normal initLogging prior to all other imports – as long as our ocrd_logging.config separates logging from stdout.

@kba kba self-assigned this Jul 3, 2024
@bertsky
Copy link
Collaborator Author

bertsky commented Nov 12, 2024

@kba @MehmedGIT what do you think?

Should I move initLogging to the top of ocrd_cli_wrap_processor (covering all CLI use cases of a processor, not just processing)?

Only in v3, or also in v2 (since this looks similar in both versions), perhaps as part of #1288?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants