Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

encodings: decode utf-8 with errors='replace' when confident #421

Open
wants to merge 1 commit into
base: develop
Choose a base branch
from

Conversation

Rongronggg9
Copy link
Contributor

@Rongronggg9 Rongronggg9 commented Dec 24, 2023

"Confident" means "metadata of the document explicitly indicates that the encoding is UTF-8".

Background of the patch

When a UTF-8 feed has a few invalid characters but the rest is fine, feedparser will only parse it as iso-8859-2 (or other encodings detected by chardet, if installed), even if both the HTTP and XML headers explicitly indicate that its encoding is utf-8.

To handle it better, we should decode the feed as UTF-8 with errors='replace'.

Date: Sun, 24 Dec 2023 16:23:48 GMT
Server: Apache/2.0.59 (Win32) PHP/5.1.6
X-Powered-By: PHP/5.1.6
Cache-Control: no-store, no-cache, must-revalidate
Expires: Sun, 24 Dec 2023 16:38:48 GMT
Set-Cookie: REDACTED
P3P: REDACTED
Access-Control-Allow-Origin: *
Transfer-Encoding: chunked
Content-Type: application/rss+xml; charset=utf-8

@Rongronggg9 Rongronggg9 force-pushed the fix/encoding-confidence branch 2 times, most recently from 9ae8b27 to dd2d6bf Compare December 26, 2023 18:26
"Confident" means "metadata of the document explicitly indicates that
the encoding is UTF-8". This prevents feedparser from falling back to
other encodings when there are only tiny errors.
@Rongronggg9 Rongronggg9 marked this pull request as ready for review December 27, 2023 01:43
@butaford
Copy link

Please accept "Pull requests". Everything works as it should with him!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants