Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: PDFMinerToDocument initializes documents with content and meta #8708

Merged
merged 3 commits into from
Jan 13, 2025

Conversation

julian-risch
Copy link
Member

Related Issues

Proposed Changes:

  • Initialize the Document returned by PDFMinerToDocument with content and meta so that both are taken into account for document ID generation. Previously only the content was used for the initialization of the Document and the meta data was updated later
  • Extend a test case so that the previous behavior would fail

How did you test it?

Notes for the reviewer

Similar changes were made to PyPDFToDocument in #8698

Checklist

  • I have read the contributors guidelines and the code of conduct
  • I have updated the related issue with new insights and changes
  • I added unit tests and updated the docstrings
  • I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test: and added ! in case the PR includes breaking changes.
  • I documented my code
  • I ran pre-commit hooks and fixed any issue

@julian-risch julian-risch requested a review from a team as a code owner January 12, 2025 16:16
@julian-risch julian-risch requested review from davidsbatista and removed request for a team January 12, 2025 16:16
@github-actions github-actions bot added topic:tests type:documentation Improvements on the docs labels Jan 12, 2025
@julian-risch julian-risch requested a review from a team as a code owner January 12, 2025 16:17
@julian-risch julian-risch requested review from dfokina and removed request for a team January 12, 2025 16:17
@coveralls
Copy link
Collaborator

coveralls commented Jan 12, 2025

Pull Request Test Coverage Report for Build 12744717842

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • 31 unchanged lines in 2 files lost coverage.
  • Overall coverage increased (+0.04%) to 91.303%

Files with Coverage Reduction New Missed Lines %
components/embedders/azure_text_embedder.py 9 71.74%
components/embedders/azure_document_embedder.py 22 57.14%
Totals Coverage Status
Change from base Build 12713355058: 0.04%
Covered Lines: 8850
Relevant Lines: 9693

💛 - Coveralls

Copy link
Contributor

@davidsbatista davidsbatista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM: just a small nit regarding a variable name

Co-authored-by: David S. Batista <[email protected]>
@julian-risch julian-risch enabled auto-merge (squash) January 13, 2025 09:57
@julian-risch julian-risch merged commit 642fa60 into main Jan 13, 2025
18 checks passed
@julian-risch julian-risch deleted the pdfminer-docid branch January 13, 2025 10:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic:tests type:documentation Improvements on the docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PDFMinerToDocument updates Document's meta field after initializing it
3 participants