Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DocumentSplitter updates Document's meta data after initializing the Document #8741

Closed
julian-risch opened this issue Jan 17, 2025 · 0 comments · Fixed by #8745
Closed

DocumentSplitter updates Document's meta data after initializing the Document #8741

julian-risch opened this issue Jan 17, 2025 · 0 comments · Fixed by #8745
Labels
P2 Medium priority, add to the next sprint if no P1 available

Comments

@julian-risch
Copy link
Member

Describe the bug
_create_docs_from_splits of the DocumentSplitter initializes a new document and then changes its meta data afterward. This means that the document's ID is created without taking into account the additional meta data. Documents that have the same content and only differ in page number will receive the same Document ID and thus might be unwittingly treated as duplicates in a later stage of the pipeline.

Instead of the current

meta = deepcopy(meta)
doc = Document(content=txt, meta=meta)
doc.meta["page_number"] = splits_pages[i]
doc.meta["split_id"] = i
doc.meta["split_idx_start"] = split_idx
documents.append(doc)

we should change the code to

meta = deepcopy(meta)
meta["page_number"] = splits_pages[i]
meta["split_id"] = i
meta["split_idx_start"] = split_idx
doc = Document(content=txt, meta=meta)
documents.append(doc)
@julian-risch julian-risch added the P2 Medium priority, add to the next sprint if no P1 available label Jan 17, 2025
nickprock added a commit to nickprock/haystack that referenced this issue Jan 17, 2025
julian-risch pushed a commit that referenced this issue Jan 20, 2025
…itter (#8745)

* updated DocumentSplitter

issue #8741

* release note

* updated DocumentSplitter

in _create_docs_from_splits function initialize a new variable copied_mete instead to overwrite meta

* added test

test_duplicate_pages_get_different_doc_id

* fix fmt

---------

Co-authored-by: Stefano Fiorucci <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 Medium priority, add to the next sprint if no P1 available
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant