Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parse docx in --doc mode #439

Merged
merged 7 commits into from
Nov 1, 2023
Merged

Parse docx in --doc mode #439

merged 7 commits into from
Nov 1, 2023

Conversation

babenek
Copy link
Contributor

@babenek babenek commented Oct 16, 2023

Description

Please include a summary of the change and which is fixed.

based on #441

  • Add scanner to parse docx files in text representation
  • separated "documents" file types to obtain the results from file extractor
  • refactored test samples for small size to fit in fuzzing process

How has this been tested?

Please describe the tests that you ran to verify your changes.

  • UnitTest
  • Benchmark - no impact on benchmark

@codecov-commenter
Copy link

codecov-commenter commented Oct 16, 2023

Codecov Report

Attention: 1 lines in your changes are missing coverage. Please review.

Comparison is base (e20b010) 90.78% compared to head (af55e38) 90.83%.

❗ Current head af55e38 differs from pull request most recent head 031d582. Consider uploading reports for the commit 031d582 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #439      +/-   ##
==========================================
+ Coverage   90.78%   90.83%   +0.04%     
==========================================
  Files         125      126       +1     
  Lines        4179     4210      +31     
  Branches      662      666       +4     
==========================================
+ Hits         3794     3824      +30     
  Misses        252      252              
- Partials      133      134       +1     
Files Coverage Δ
credsweeper/config/config.py 100.00% <100.00%> (ø)
credsweeper/deep_scanner/deep_scanner.py 95.07% <100.00%> (+0.07%) ⬆️
credsweeper/file_handler/file_path_extractor.py 80.68% <100.00%> (+0.44%) ⬆️
credsweeper/deep_scanner/docx_scanner.py 96.15% <96.15%> (ø)

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@babenek babenek marked this pull request as ready for review October 16, 2023 11:28
@babenek babenek requested a review from a team as a code owner October 16, 2023 11:28
@babenek babenek marked this pull request as draft October 20, 2023 03:08
@babenek babenek force-pushed the docx branch 2 times, most recently from a292344 to ec52ae1 Compare October 21, 2023 11:41
@babenek babenek marked this pull request as ready for review October 22, 2023 04:07
@babenek babenek marked this pull request as draft October 26, 2023 04:38
@babenek babenek marked this pull request as ready for review October 30, 2023 07:12
csh519
csh519 previously approved these changes Nov 1, 2023
Copy link
Collaborator

@csh519 csh519 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

docx and pdf files scanning on --doc option has been added.

LGTM 👍

Copy link
Collaborator

@csh519 csh519 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve again.

Copy link
Contributor

@kmnls kmnls left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree

@babenek babenek merged commit 7a838cf into Samsung:main Nov 1, 2023
29 checks passed
@babenek babenek deleted the docx branch November 1, 2023 10:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants