unstructured: metadata get mixed up #331

lambda-science · 2024-02-02T14:27:26Z

Describe the bug
When indexing files with Unstructured component. Metadata get mixed (?)
For example with these data:

    files = ["test/samples/sample1.pdf", "test/samples/sample2.pdf", "test/samples/sample3.pdf" ]
    meta = [
        {"meta1": "value1", "source_test": "pytest_api"},
        {"meta2": "value2", "source_test": "pytest_api"},
        {"meta3": "value3", "source_test": "pytest_api"},
    ]

Then these data are converted to json to make a call to my API
I get these results
:

    {
        "id": "b89ce3cc7ed9839459d1606018cf6beb720df0424515cd4cc9442b51970f72b8",
        "content": "blablablalba",
        "meta": {
            "meta2": "value2",
            "source_test": "pytest_api",
            "filename": "5f27_sample1.pdf",
            "s3_key": "bc6c_sample2.pdf",
            "file_path": "C:\\Users\\cmeyer\\code-project\\llm-ale-chatbot\\haystack_api\\rest_api\\file-upload\\5f27_sample1.pdf",
            "languages": [
                "fra"
            ],
            "page_number": 1,
            "filetype": "application/pdf",
            "category": "UncategorizedText"
        },
        "score": 0.0
    },

Here the content correspond to the right file_path & filename (generated by unstructured) BUT my CUSTOM metadata that are not generated by Unstructured processing are mixed up (meta2, and s3_key are wrong) !
Sorry for the not very reproducible example. I'm just writing to know if someone already had similar issue. I make a detailed report this week-end. This doesn't happen with PyPDF, so it's weird.

Describe your environment (please complete the following information):

OS: Windows running a LINUX Docker
Haystack version: 2.0.0 beta5
Integration version: 0.3.0

lambda-science · 2024-02-02T14:28:18Z

Might be related by an error in my implementation of metadata field here: #242
Where could it come from ...

anakin87 · 2024-02-02T14:32:44Z

Waiting for your detailed report to have a proper look.
Thanks!

lambda-science · 2024-02-04T10:41:32Z

Waiting for your detailed report to have a proper look. Thanks!

Coming back with news @anakin87
Identification of the bug, in:

    @component.output_types(documents=List[Document])
    def run(
        self,
        paths: Union[List[str], List[os.PathLike]],
        meta: Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] = None,
    ):
        """
        Convert files to Haystack Documents using the Unstructured API (hosted or running locally).

        :param paths: List of paths to convert. Paths can be files or directories.
            If a path is a directory, all files in the directory are converted. Subdirectories are ignored.
        :param meta: Optional metadata to attach to the Documents.
          This value can be either a list of dictionaries or a single dictionary.
          If it's a single dictionary, its content is added to the metadata of all produced Documents.
          If it's a list, the length of the list must match the number of paths, because the two lists will be zipped.
          Please note that if the paths contain directories, the length of the meta list must match
          the actual number of files contained.
          Defaults to `None`.
        """
        unique_paths = {Path(path) for path in paths}
        filepaths = {path for path in unique_paths if path.is_file()}
        filepaths_in_directories = {
            filepath for path in unique_paths if path.is_dir() for filepath in path.glob("*.*") if filepath.is_file()
        }

        all_filepaths = filepaths.union(filepaths_in_directories)
        # currently, the files are converted sequentially to gently handle API failures
        documents = []
        meta_list = normalize_metadata(meta, sources_count=len(all_filepaths))

We use a set unique_paths = {Path(path) for path in paths} here and in Python set are not ordered. After converting our filepaths to set, the metadata order doesn't correspond to the filepaths order. This leads to attribution of metadata to the wrong filepaths.
We should modify the logic here to not use set maybe ? I will try to think of a solution

lambda-science · 2024-02-04T10:52:04Z

And actually I'm not sure why we need a set logic here to make filepath unique. I feel like it's up to the user to provide unique paths ? For example what happens if a user provide 10 path and 10 metadata but then some filepath are duplicated so then we have 8 filepath and 10 metadata ? It will raise error from normalize_metadata I guess

What I think is that:

We can support directories as path BUT then metadata should max be of a length of 1 (same metadata for all files in directory). Because I'm not sure it's clear how path.glob() orders files (leading to metadata attribution confusion)
If direct paths to files are provided: don't make them unique with sets.

anakin87 · 2024-02-05T09:44:14Z

Released a new version with the bugfix: https://pypi.org/project/unstructured-fileconverter-haystack/0.3.1/

lambda-science added the bug Something isn't working label Feb 2, 2024

lambda-science mentioned this issue Feb 4, 2024

unstructured: fix metadata order mixed up #336

Merged

masci assigned anakin87 Feb 5, 2024

anakin87 closed this as completed in #336 Feb 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

unstructured: metadata get mixed up #331

unstructured: metadata get mixed up #331

lambda-science commented Feb 2, 2024

lambda-science commented Feb 2, 2024

anakin87 commented Feb 2, 2024

lambda-science commented Feb 4, 2024

lambda-science commented Feb 4, 2024

anakin87 commented Feb 5, 2024

unstructured: metadata get mixed up #331

unstructured: metadata get mixed up #331

Comments

lambda-science commented Feb 2, 2024

lambda-science commented Feb 2, 2024

anakin87 commented Feb 2, 2024

lambda-science commented Feb 4, 2024

lambda-science commented Feb 4, 2024

anakin87 commented Feb 5, 2024