EREGCSC-2465 Extract text from RTF files #1122

cgodwin1 · 2023-12-27T21:51:36Z

Resolves #2465

Description-

We want to extract text from RTF files.

This pull request changes...

Add striprtf library for extracting text from RTF files.
Extract text with "errors=replace" in case of bad encoding/corruption.
Unit tests updated, checks in case of corruption.

Steps to manually verify this change...

Verify unit tests pass.
Log into admin panel and go to Uploaded Files.
Upload a "rtf" file to the panel.
Click "Get content" on the newly uploaded file.
Refresh the page periodically and wait for extraction to complete.
The extracted text should appear on the page under "Index populated".

peggles2

LGTM 👍 as long as checks clear.

* Add support for html, htm, xml, xhtml * Remove inline script/style tags * Add rtf extraction, update README.md (#1122)

* Add near-universal text file extraction with bs4 * EREGCSC-2254 Extract text from webpages/HTML (#1121) * Add support for html, htm, xml, xhtml * Remove inline script/style tags * Add rtf extraction, update README.md (#1122) * EREGCSC-2444-A Remove Textract dependency (#1123) * Add support for html, htm, xml, xhtml * Add rtf extraction, update README.md * Remove dependency on Textract for Excel files * Remove textract dependency for docx files * Remove textract dependency for pptx files * Remove textract from requirements.txt

* Add near-universal text file extraction with bs4 * Add support for html, htm, xml, xhtml * Add rtf extraction, update README.md * Remove dependency on Textract for Excel files * Remove textract dependency for docx files * Remove textract dependency for pptx files * Remove textract from requirements.txt * Clean up tests * Switch back to handling bytes directly * Move embedded file extraction * Use openpyxl for xlsx as xlrd is deprecated * mock _extract_embedded to eliminate interdependency of extractors during testing * Add more logging * EREGCSC-2254 Extract text from webpages/HTML (#1121) * Add support for html, htm, xml, xhtml * Remove inline script/style tags * Add rtf extraction, update README.md (#1122) * Improve logging * Add testing for embedded file extraction * Update README * Merge cleanup * Merge cleanup again * Deploy text extractor

Add rtf extraction, update README.md

d77b280

cgodwin1 requested review from thwalker6, PhilR8 and peggles2 as code owners December 27, 2023 21:51

cgodwin1 temporarily deployed to dev December 27, 2023 21:51 — with GitHub Actions Inactive

cgodwin1 had a problem deploying to dev December 27, 2023 21:51 — with GitHub Actions Failure

cgodwin1 changed the base branch from main to 2254-extract-webpages December 27, 2023 21:52

peggles2 self-assigned this Jan 2, 2024

peggles2 approved these changes Jan 2, 2024

View reviewed changes

cgodwin1 merged commit 0cd6403 into 2254-extract-webpages Jan 3, 2024
18 of 19 checks passed

cgodwin1 had a problem deploying to dev January 3, 2024 18:09 — with GitHub Actions Failure

cgodwin1 added a commit that referenced this pull request Jan 3, 2024

EREGCSC-2254 Extract text from webpages/HTML (#1121)

d31b02d

* Add support for html, htm, xml, xhtml * Remove inline script/style tags * Add rtf extraction, update README.md (#1122)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EREGCSC-2465 Extract text from RTF files #1122

EREGCSC-2465 Extract text from RTF files #1122

cgodwin1 commented Dec 27, 2023

peggles2 left a comment •

edited

Loading

EREGCSC-2465 Extract text from RTF files #1122

EREGCSC-2465 Extract text from RTF files #1122

Conversation

cgodwin1 commented Dec 27, 2023

peggles2 left a comment • edited Loading

Choose a reason for hiding this comment

peggles2 left a comment •

edited

Loading