Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EREGCSC-2465 Extract text from RTF files #1122

Merged
merged 1 commit into from
Jan 3, 2024

Conversation

cgodwin1
Copy link
Contributor

Resolves #2465

Description-

We want to extract text from RTF files.

This pull request changes...

  • Add striprtf library for extracting text from RTF files.
  • Extract text with "errors=replace" in case of bad encoding/corruption.
  • Unit tests updated, checks in case of corruption.

Steps to manually verify this change...

  1. Verify unit tests pass.
  2. Log into admin panel and go to Uploaded Files.
  3. Upload a "rtf" file to the panel.
  4. Click "Get content" on the newly uploaded file.
  5. Refresh the page periodically and wait for extraction to complete.
  6. The extracted text should appear on the page under "Index populated".

@cgodwin1 cgodwin1 changed the base branch from main to 2254-extract-webpages December 27, 2023 21:52
@peggles2 peggles2 self-assigned this Jan 2, 2024
Copy link
Contributor

@peggles2 peggles2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍 as long as checks clear.

@cgodwin1 cgodwin1 merged commit 0cd6403 into 2254-extract-webpages Jan 3, 2024
18 of 19 checks passed
cgodwin1 added a commit that referenced this pull request Jan 3, 2024
* Add support for html, htm, xml, xhtml

* Remove inline script/style tags

* Add rtf extraction, update README.md (#1122)
cgodwin1 added a commit that referenced this pull request Jan 5, 2024
* Add near-universal text file extraction with bs4

* EREGCSC-2254 Extract text from webpages/HTML (#1121)

* Add support for html, htm, xml, xhtml

* Remove inline script/style tags

* Add rtf extraction, update README.md (#1122)

* EREGCSC-2444-A Remove Textract dependency (#1123)

* Add support for html, htm, xml, xhtml

* Add rtf extraction, update README.md

* Remove dependency on Textract for Excel files

* Remove textract dependency for docx files

* Remove textract dependency for pptx files

* Remove textract from requirements.txt
cgodwin1 added a commit that referenced this pull request Jan 9, 2024
* Add near-universal text file extraction with bs4

* Add support for html, htm, xml, xhtml

* Add rtf extraction, update README.md

* Remove dependency on Textract for Excel files

* Remove textract dependency for docx files

* Remove textract dependency for pptx files

* Remove textract from requirements.txt

* Clean up tests

* Switch back to handling bytes directly

* Move embedded file extraction

* Use openpyxl for xlsx as xlrd is deprecated

* mock _extract_embedded to eliminate interdependency of extractors during testing

* Add more logging

* EREGCSC-2254 Extract text from webpages/HTML (#1121)

* Add support for html, htm, xml, xhtml

* Remove inline script/style tags

* Add rtf extraction, update README.md (#1122)

* Improve logging

* Add testing for embedded file extraction

* Update README

* Merge cleanup

* Merge cleanup again

* Deploy text extractor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants