-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for Arabic language in warc-indexer -> Solr fields #291
Comments
This appears to be a problem with Apache Tika, as I get the same results using that directly...
Note that the |
I suspect this is down to the buffer size used by the CharsetDetector. There's a lot of gumpf before the UTF-8 shows up. Not sure if this really a bug or if we should find a way to configure a larger buffer/markLimit. Linking the example HTML file: https://gist.github.com/anjackson/5bf6945b8b557ace07f5cd1d64cbcc4f |
I am not sure if this is a duplicate of an existing issue.
When you harvest this url:
https://www.youtube.com/watch?v=Hnrdfb6HiK0
The title field in solr is:
title":"سيدة الصبر - المرأة العراقية - كريم العراقي - اØمد الثرواني - YouTube",
Also other fields such as keywords has the same issue.
The text was updated successfully, but these errors were encountered: