Node#visible_text use scrub to replace invalid UTF-8 sequences #76
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Some pages cause a
invalid byte sequence in UTF-8
exception to be raised when callingtext.to_s.gsub(/\A[[:space:]&&[^\u00a0]]+/, '')
. Addingscrub
prevents this.Specific context:
It seems a
HTML entity gets interpreted as "\xA0", or byte 160, which has an invalid encoding. Usingcharlock_homes
the encoding of the entire page is reported asISO-8859-1
with 54% confidence.