Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a specific TextFixerConfig instance to control ftfy's text fixing process. #298

Merged
merged 2 commits into from
Feb 6, 2025

Conversation

cary-rowen
Copy link
Collaborator

Link to issue number:

Fixed #297

Summary of the issue:

ftfy was incorrectly modifying Chinese punctuation and other text due to aggressive normalization and unwanted fixes, impacting readability.

Description of how this pull request fixes the issue:

This PR adjusts ftfy's configuration to prevent incorrect text modifications:

  • Normalization changed from NFKC to NFC to avoid over-aggressive character replacements.
  • fix_character_width, uncurl_quotes, fix_latin_ligatures, and unescape_html are now disabled.
  • These changes are applied to plain_text, pdf, fitz, and StructuredHtmlParser for consistent text processing.

This prevents ftfy from unintentionally altering punctuation, ligatures, quotes, and HTML entities.

Testing performed:

  • Verified that Chinese full-width punctuation (e.g., ,。、“”‘’ ) remains unchanged in plain_text, pdf, and StructuredHtmlParser output.
  • Confirmed that NFC normalization is used, preventing over-aggressive text changes.

Known issues with pull request:

No known issues. Further testing and review are welcome.

@cary-rowen cary-rowen self-assigned this Jan 30, 2025
@cary-rowen cary-rowen merged commit 110e1f5 into blindpandas:develop Feb 6, 2025
5 of 6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ftfy.fixText Incorrectly Converts Chinese Punctuation to English Equivalents
1 participant