Cannot Upload docx document to Milvus Database because of DOCXMetadata #8727

saikanov · 2025-01-16T09:50:04Z

Describe the bug
Cannot upload docx file to Milvus database because of DOCXMetadata

Error message
TypeError: 'DOCXMetadata' object is not subscriptable

Additional context
Add any other context about the problem here, like document types / preprocessing steps / settings of reader etc.

To Reproduce
Use DOCX pipeline with Milvus as vectordb

So i already fix this issue at the time i post this, the issue is about the DOCXMetadata cannot be indexed, and after knowing the issue i try to pop(delete) the metadata and it works fine.

after that i go to [haystack/components/converters/docx.py ](https://github.com/deepset-ai/haystack/blob/main/haystack/components/converters/docx.py)

and edit the merged_metadata variable so it not include the DOCXMetadata
merged_metadata = {**bytestream.meta, **metadata}

and now it work with Pipeline

The thing i want to ask is, what is DOCXMetadata do? does it only error on milvus? and is it fine to not include it to resolve my issue?

Thanks!

The text was updated successfully, but these errors were encountered:

anakin87 · 2025-01-29T16:16:09Z

Reproducible example

pip install --upgrade pymilvus milvus-haystack

from haystack.components.converters.docx import DOCXToDocument
from milvus_haystack import MilvusDocumentStore

document_store = MilvusDocumentStore(
    connection_args={"uri": "./milvus.db"},
    drop_old=True,
)

converter = DOCXToDocument()

path = "sample_docx.docx"

docs = converter.run(sources=[path])["documents"]

print(docs)

document_store.write_documents(docs)

print(document_store.count_documents())
print(document_store.filter_documents())

[Document(id=841f2916f4d4fe3612dac9490fc3d4ceb78ba76a2f78627413e0f5bcded1a206, content: 'Sample Docx File

The US has "passed the peak" on new coronavirus cases, President Donald Trump said...', meta: {'file_path': 'sample_docx.docx', 'docx': DOCXMetadata(author='Saha, Anirban', category='', comments='', content_status='', created='2020-07-14T08:14:00+00:00', identifier='', keywords='', language='', last_modified_by='Saha, Anirban', last_printed=None, modified='2020-07-14T08:16:00+00:00', revision=1, subject='', title='', version='')})]


Traceback (most recent call last):
  File "/home/anakin87/apps/experiments/milvusdocx/try.py", line 18, in <module>
    document_store.write_documents(docs)
  File "/home/anakin87/apps/experiments/milvusdocx/.venv/lib/python3.10/site-packages/milvus_haystack/document_store.py", line 336, in write_documents
    documents_cp = [MilvusDocumentStore._discard_invalid_meta(doc) for doc in deepcopy(documents)]
  File "/home/anakin87/apps/experiments/milvusdocx/.venv/lib/python3.10/site-packages/milvus_haystack/document_store.py", line 336, in <listcomp>
    documents_cp = [MilvusDocumentStore._discard_invalid_meta(doc) for doc in deepcopy(documents)]
  File "/home/anakin87/apps/experiments/milvusdocx/.venv/lib/python3.10/site-packages/milvus_haystack/document_store.py", line 952, in _discard_invalid_meta
    dtype = infer_dtype_bydata(value)
  File "/home/anakin87/apps/experiments/milvusdocx/.venv/lib/python3.10/site-packages/pymilvus/orm/types.py", line 130, in infer_dtype_bydata
    elem = data[0]
TypeError: 'DOCXMetadata' object is not subscriptable

As @saikanov was suggesting, the issue is related to the DOCXMetadata dataclass being included in meta.
I want to investigate the impact of this aspect for other document stores.

julian-risch added the P2 Medium priority, add to the next sprint if no P1 available label Jan 16, 2025

julian-risch assigned anakin87 Jan 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot Upload docx document to Milvus Database because of DOCXMetadata #8727

Cannot Upload docx document to Milvus Database because of DOCXMetadata #8727

saikanov commented Jan 16, 2025 •

edited

Loading

anakin87 commented Jan 29, 2025

Cannot Upload docx document to Milvus Database because of DOCXMetadata #8727

Cannot Upload docx document to Milvus Database because of DOCXMetadata #8727

Comments

saikanov commented Jan 16, 2025 • edited Loading

anakin87 commented Jan 29, 2025

Reproducible example

saikanov commented Jan 16, 2025 •

edited

Loading