Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Heuristic fix of charset issues #301

Open
tokee opened this issue Nov 14, 2022 · 2 comments
Open

Heuristic fix of charset issues #301

tokee opened this issue Nov 14, 2022 · 2 comments

Comments

@tokee
Copy link
Collaborator

tokee commented Nov 14, 2022

Servers mess up and it is not uncommon that we at the Royal Danish Library encounter pages where the charset is set to one thing in the HTTP headers, another in the HTML and that the stream of bytes defining the text represent non-ASCII characters as something third. It is quite visible for Danish pages as the characters æ ø å are commonly used in our spelling.

The most common problems we see are "UTF-8 read as ISO-8859-1" and the other way around. At least for Danish they are reasonably easy to guess, as the end result are character combinations that are "never" used for real text. I would be surprised if it wasn't already available somewhere on the net. We should perform such guessing during indexing and correct the problem.

We have two fields in Solr: content which is the raw text content and text which is the catch-all search field and contains content along with other field content. The solution could change that so that content holds the raw content, faulty character encodings and all, while the processed content gets added to text for proper search. Or they could both contain the corrected content. I don't know what's best.

Ping @thomasegense as I know he's getting complaints about this.

@anjackson
Copy link
Contributor

Can we get some test WARCs/files so we have examples to drive unit testing?

@thomasegense
Copy link
Contributor

I can not remember this issue. Maybe it is related to this?
#291

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants