fix: PDFMinerToDocument initializes documents with content and meta #8708

julian-risch · 2025-01-12T16:16:14Z

Related Issues

Proposed Changes:

Initialize the Document returned by PDFMinerToDocument with content and meta so that both are taken into account for document ID generation. Previously only the content was used for the initialization of the Document and the meta data was updated later
Extend a test case so that the previous behavior would fail

How did you test it?

Notes for the reviewer

Similar changes were made to PyPDFToDocument in #8698

Checklist

I have read the contributors guidelines and the code of conduct
I have updated the related issue with new insights and changes
I added unit tests and updated the docstrings
I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test: and added ! in case the PR includes breaking changes.
I documented my code
I ran pre-commit hooks and fixed any issue

coveralls · 2025-01-12T16:21:22Z

Pull Request Test Coverage Report for Build 12744717842

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

For more information on this, see Tracking coverage changes with pull request builds.
To avoid this issue with future PRs, see these Recommended CI Configurations.
For a quick fix, rebase this PR at GitHub. Your next report should be accurate.

Details

0 of 0 changed or added relevant lines in 0 files are covered.
31 unchanged lines in 2 files lost coverage.
Overall coverage increased (+0.04%) to 91.303%

Files with Coverage Reduction	New Missed Lines	%
components/embedders/azure_text_embedder.py	9	71.74%
components/embedders/azure_document_embedder.py	22	57.14%

Totals
Change from base Build 12713355058:	0.04%
Covered Lines:	8850
Relevant Lines:	9693

💛 - Coveralls

haystack/components/converters/pdfminer.py

davidsbatista

LGTM: just a small nit regarding a variable name

Co-authored-by: David S. Batista <[email protected]>

fix: PDFMinerToDocument initializes documents with content and meta

b216550

julian-risch requested a review from a team as a code owner January 12, 2025 16:16

julian-risch requested review from davidsbatista and removed request for a team January 12, 2025 16:16

github-actions bot added topic:tests type:documentation Improvements on the docs labels Jan 12, 2025

add release note

9aab67b

julian-risch requested a review from a team as a code owner January 12, 2025 16:17

julian-risch requested review from dfokina and removed request for a team January 12, 2025 16:17

davidsbatista reviewed Jan 13, 2025

View reviewed changes

haystack/components/converters/pdfminer.py Outdated Show resolved Hide resolved

davidsbatista reviewed Jan 13, 2025

View reviewed changes

haystack/components/converters/pdfminer.py Outdated Show resolved Hide resolved

davidsbatista approved these changes Jan 13, 2025

View reviewed changes

Apply suggestions from code review

fd0d6aa

Co-authored-by: David S. Batista <[email protected]>

julian-risch enabled auto-merge (squash) January 13, 2025 09:57

julian-risch merged commit 642fa60 into main Jan 13, 2025
18 checks passed

julian-risch deleted the pdfminer-docid branch January 13, 2025 10:12

This was referenced Jan 13, 2025

Document ID doesn't updated upon metadata update #8692

Open

fix: recreate document id if certain attributes are changed #8694

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: PDFMinerToDocument initializes documents with content and meta #8708

fix: PDFMinerToDocument initializes documents with content and meta #8708

julian-risch commented Jan 12, 2025

coveralls commented Jan 12, 2025 •

edited

Loading

davidsbatista left a comment

fix: PDFMinerToDocument initializes documents with content and meta #8708

fix: PDFMinerToDocument initializes documents with content and meta #8708

Conversation

julian-risch commented Jan 12, 2025

Related Issues

Proposed Changes:

How did you test it?

Notes for the reviewer

Checklist

coveralls commented Jan 12, 2025 • edited Loading

Pull Request Test Coverage Report for Build 12744717842

Warning: This coverage report may be inaccurate.

Details

💛 - Coveralls

davidsbatista left a comment

Choose a reason for hiding this comment

coveralls commented Jan 12, 2025 •

edited

Loading