Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

warc-indexer needs file.encoding="UTF8" #311

Open
tokee opened this issue Aug 11, 2023 · 1 comment
Open

warc-indexer needs file.encoding="UTF8" #311

tokee opened this issue Aug 11, 2023 · 1 comment
Labels

Comments

@tokee
Copy link
Collaborator

tokee commented Aug 11, 2023

@trym-b discovered that the warc-indexer needs the environment file encoding to be UTF-8, in order to produce Solr documents with ... UTF-8 encoding.

This can be achieved by setting

JAVA_TOOL_OPTIONS="-Dfile.encoding=UTF8"

before calling the warc-indexer JAR, but the real solution is to explicitly set UTF-8 as the output encoding in the Java code where relevant. On a larger scale, using Forbidden APIs Checker would guard against variations of the problem, but experience says that enabling that check for a large project is a daunting task.

@anjackson
Copy link
Contributor

I'm a bit surprised as I thought the SolrJ clients enforced UTF-8 when talking to Solr?! Do you know where in the Solr client API we are able to set the output charset?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants