Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot Upload docx document to Milvus Database because of DOCXMetadata #8727

Open
saikanov opened this issue Jan 16, 2025 · 1 comment
Open
Assignees
Labels
P2 Medium priority, add to the next sprint if no P1 available

Comments

@saikanov
Copy link

saikanov commented Jan 16, 2025

Describe the bug
Cannot upload docx file to Milvus database because of DOCXMetadata

Error message
TypeError: 'DOCXMetadata' object is not subscriptable

Additional context
Add any other context about the problem here, like document types / preprocessing steps / settings of reader etc.

To Reproduce
Use DOCX pipeline with Milvus as vectordb

So i already fix this issue at the time i post this, the issue is about the DOCXMetadata cannot be indexed, and after knowing the issue i try to pop(delete) the metadata and it works fine.

after that i go to [haystack/components/converters/docx.py ](https://github.com/deepset-ai/haystack/blob/main/haystack/components/converters/docx.py)

and edit the merged_metadata variable so it not include the DOCXMetadata
merged_metadata = {**bytestream.meta, **metadata}

and now it work with Pipeline

The thing i want to ask is, what is DOCXMetadata do? does it only error on milvus? and is it fine to not include it to resolve my issue?

Thanks!

@julian-risch julian-risch added the P2 Medium priority, add to the next sprint if no P1 available label Jan 16, 2025
@anakin87
Copy link
Member

Reproducible example

pip install --upgrade pymilvus milvus-haystack
from haystack.components.converters.docx import DOCXToDocument
from milvus_haystack import MilvusDocumentStore

document_store = MilvusDocumentStore(
    connection_args={"uri": "./milvus.db"},
    drop_old=True,
)

converter = DOCXToDocument()

path = "sample_docx.docx"

docs = converter.run(sources=[path])["documents"]

print(docs)

document_store.write_documents(docs)

print(document_store.count_documents())
print(document_store.filter_documents())
[Document(id=841f2916f4d4fe3612dac9490fc3d4ceb78ba76a2f78627413e0f5bcded1a206, content: 'Sample Docx File

The US has "passed the peak" on new coronavirus cases, President Donald Trump said...', meta: {'file_path': 'sample_docx.docx', 'docx': DOCXMetadata(author='Saha, Anirban', category='', comments='', content_status='', created='2020-07-14T08:14:00+00:00', identifier='', keywords='', language='', last_modified_by='Saha, Anirban', last_printed=None, modified='2020-07-14T08:16:00+00:00', revision=1, subject='', title='', version='')})]


Traceback (most recent call last):
  File "/home/anakin87/apps/experiments/milvusdocx/try.py", line 18, in <module>
    document_store.write_documents(docs)
  File "/home/anakin87/apps/experiments/milvusdocx/.venv/lib/python3.10/site-packages/milvus_haystack/document_store.py", line 336, in write_documents
    documents_cp = [MilvusDocumentStore._discard_invalid_meta(doc) for doc in deepcopy(documents)]
  File "/home/anakin87/apps/experiments/milvusdocx/.venv/lib/python3.10/site-packages/milvus_haystack/document_store.py", line 336, in <listcomp>
    documents_cp = [MilvusDocumentStore._discard_invalid_meta(doc) for doc in deepcopy(documents)]
  File "/home/anakin87/apps/experiments/milvusdocx/.venv/lib/python3.10/site-packages/milvus_haystack/document_store.py", line 952, in _discard_invalid_meta
    dtype = infer_dtype_bydata(value)
  File "/home/anakin87/apps/experiments/milvusdocx/.venv/lib/python3.10/site-packages/pymilvus/orm/types.py", line 130, in infer_dtype_bydata
    elem = data[0]
TypeError: 'DOCXMetadata' object is not subscriptable

As @saikanov was suggesting, the issue is related to the DOCXMetadata dataclass being included in meta.
I want to investigate the impact of this aspect for other document stores.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
P2 Medium priority, add to the next sprint if no P1 available
Projects
None yet
Development

No branches or pull requests

3 participants