You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@trym-b discovered that the warc-indexer needs the environment file encoding to be UTF-8, in order to produce Solr documents with ... UTF-8 encoding.
This can be achieved by setting
JAVA_TOOL_OPTIONS="-Dfile.encoding=UTF8"
before calling the warc-indexer JAR, but the real solution is to explicitly set UTF-8 as the output encoding in the Java code where relevant. On a larger scale, using Forbidden APIs Checker would guard against variations of the problem, but experience says that enabling that check for a large project is a daunting task.
The text was updated successfully, but these errors were encountered:
I'm a bit surprised as I thought the SolrJ clients enforced UTF-8 when talking to Solr?! Do you know where in the Solr client API we are able to set the output charset?
@trym-b discovered that the
warc-indexer
needs the environment file encoding to be UTF-8, in order to produce Solr documents with ... UTF-8 encoding.This can be achieved by setting
before calling the
warc-indexer
JAR, but the real solution is to explicitly set UTF-8 as the output encoding in the Java code where relevant. On a larger scale, using Forbidden APIs Checker would guard against variations of the problem, but experience says that enabling that check for a large project is a daunting task.The text was updated successfully, but these errors were encountered: