Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unstructured: metadata get mixed up #331

Closed
lambda-science opened this issue Feb 2, 2024 · 5 comments · Fixed by #336
Closed

unstructured: metadata get mixed up #331

lambda-science opened this issue Feb 2, 2024 · 5 comments · Fixed by #336
Assignees
Labels
bug Something isn't working

Comments

@lambda-science
Copy link
Contributor

Describe the bug
When indexing files with Unstructured component. Metadata get mixed (?)
For example with these data:

    files = ["test/samples/sample1.pdf", "test/samples/sample2.pdf", "test/samples/sample3.pdf" ]
    meta = [
        {"meta1": "value1", "source_test": "pytest_api"},
        {"meta2": "value2", "source_test": "pytest_api"},
        {"meta3": "value3", "source_test": "pytest_api"},
    ]

Then these data are converted to json to make a call to my API
I get these results
:

    {
        "id": "b89ce3cc7ed9839459d1606018cf6beb720df0424515cd4cc9442b51970f72b8",
        "content": "blablablalba",
        "meta": {
            "meta2": "value2",
            "source_test": "pytest_api",
            "filename": "5f27_sample1.pdf",
            "s3_key": "bc6c_sample2.pdf",
            "file_path": "C:\\Users\\cmeyer\\code-project\\llm-ale-chatbot\\haystack_api\\rest_api\\file-upload\\5f27_sample1.pdf",
            "languages": [
                "fra"
            ],
            "page_number": 1,
            "filetype": "application/pdf",
            "category": "UncategorizedText"
        },
        "score": 0.0
    },

Here the content correspond to the right file_path & filename (generated by unstructured) BUT my CUSTOM metadata that are not generated by Unstructured processing are mixed up (meta2, and s3_key are wrong) !
Sorry for the not very reproducible example. I'm just writing to know if someone already had similar issue. I make a detailed report this week-end. This doesn't happen with PyPDF, so it's weird.

Describe your environment (please complete the following information):

  • OS: Windows running a LINUX Docker
  • Haystack version: 2.0.0 beta5
  • Integration version: 0.3.0
@lambda-science lambda-science added the bug Something isn't working label Feb 2, 2024
@lambda-science
Copy link
Contributor Author

Might be related by an error in my implementation of metadata field here: #242
Where could it come from ...

@anakin87
Copy link
Member

anakin87 commented Feb 2, 2024

Waiting for your detailed report to have a proper look.
Thanks!

@lambda-science
Copy link
Contributor Author

Waiting for your detailed report to have a proper look. Thanks!

Coming back with news @anakin87
Identification of the bug, in:

    @component.output_types(documents=List[Document])
    def run(
        self,
        paths: Union[List[str], List[os.PathLike]],
        meta: Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] = None,
    ):
        """
        Convert files to Haystack Documents using the Unstructured API (hosted or running locally).

        :param paths: List of paths to convert. Paths can be files or directories.
            If a path is a directory, all files in the directory are converted. Subdirectories are ignored.
        :param meta: Optional metadata to attach to the Documents.
          This value can be either a list of dictionaries or a single dictionary.
          If it's a single dictionary, its content is added to the metadata of all produced Documents.
          If it's a list, the length of the list must match the number of paths, because the two lists will be zipped.
          Please note that if the paths contain directories, the length of the meta list must match
          the actual number of files contained.
          Defaults to `None`.
        """
        unique_paths = {Path(path) for path in paths}
        filepaths = {path for path in unique_paths if path.is_file()}
        filepaths_in_directories = {
            filepath for path in unique_paths if path.is_dir() for filepath in path.glob("*.*") if filepath.is_file()
        }

        all_filepaths = filepaths.union(filepaths_in_directories)
        # currently, the files are converted sequentially to gently handle API failures
        documents = []
        meta_list = normalize_metadata(meta, sources_count=len(all_filepaths))

We use a set unique_paths = {Path(path) for path in paths} here and in Python set are not ordered. After converting our filepaths to set, the metadata order doesn't correspond to the filepaths order. This leads to attribution of metadata to the wrong filepaths.
We should modify the logic here to not use set maybe ? I will try to think of a solution

@lambda-science
Copy link
Contributor Author

And actually I'm not sure why we need a set logic here to make filepath unique. I feel like it's up to the user to provide unique paths ? For example what happens if a user provide 10 path and 10 metadata but then some filepath are duplicated so then we have 8 filepath and 10 metadata ? It will raise error from normalize_metadata I guess

What I think is that:

  1. We can support directories as path BUT then metadata should max be of a length of 1 (same metadata for all files in directory). Because I'm not sure it's clear how path.glob() orders files (leading to metadata attribution confusion)
  2. If direct paths to files are provided: don't make them unique with sets.

@anakin87
Copy link
Member

anakin87 commented Feb 5, 2024

Released a new version with the bugfix: https://pypi.org/project/unstructured-fileconverter-haystack/0.3.1/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants