-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
unstructured: metadata get mixed up #331
Comments
Might be related by an error in my implementation of metadata field here: #242 |
Waiting for your detailed report to have a proper look. |
Coming back with news @anakin87 @component.output_types(documents=List[Document])
def run(
self,
paths: Union[List[str], List[os.PathLike]],
meta: Optional[Union[Dict[str, Any], List[Dict[str, Any]]]] = None,
):
"""
Convert files to Haystack Documents using the Unstructured API (hosted or running locally).
:param paths: List of paths to convert. Paths can be files or directories.
If a path is a directory, all files in the directory are converted. Subdirectories are ignored.
:param meta: Optional metadata to attach to the Documents.
This value can be either a list of dictionaries or a single dictionary.
If it's a single dictionary, its content is added to the metadata of all produced Documents.
If it's a list, the length of the list must match the number of paths, because the two lists will be zipped.
Please note that if the paths contain directories, the length of the meta list must match
the actual number of files contained.
Defaults to `None`.
"""
unique_paths = {Path(path) for path in paths}
filepaths = {path for path in unique_paths if path.is_file()}
filepaths_in_directories = {
filepath for path in unique_paths if path.is_dir() for filepath in path.glob("*.*") if filepath.is_file()
}
all_filepaths = filepaths.union(filepaths_in_directories)
# currently, the files are converted sequentially to gently handle API failures
documents = []
meta_list = normalize_metadata(meta, sources_count=len(all_filepaths)) We use a set |
And actually I'm not sure why we need a set logic here to make filepath unique. I feel like it's up to the user to provide unique paths ? For example what happens if a user provide 10 path and 10 metadata but then some filepath are duplicated so then we have 8 filepath and 10 metadata ? It will raise error from What I think is that:
|
Released a new version with the bugfix: https://pypi.org/project/unstructured-fileconverter-haystack/0.3.1/ |
Describe the bug
When indexing files with Unstructured component. Metadata get mixed (?)
For example with these data:
Then these data are converted to json to make a call to my API
I get these results
:
Here the content correspond to the right file_path & filename (generated by unstructured) BUT my CUSTOM metadata that are not generated by Unstructured processing are mixed up (meta2, and s3_key are wrong) !
Sorry for the not very reproducible example. I'm just writing to know if someone already had similar issue. I make a detailed report this week-end. This doesn't happen with PyPDF, so it's weird.
Describe your environment (please complete the following information):
The text was updated successfully, but these errors were encountered: