`Element`'s `id` attribute missing in `Document`'s `metadata` field when using `UnstructuredFileLoader` #20227

rchen19 · 2024-04-09T19:30:21Z

rchen19
Apr 9, 2024

Checked

I searched existing ideas and did not find a similar one
I added a very descriptive title
I've clearly described the feature request and motivation for it

Feature request

Transfer/copy the id attribute of an Element object, resulted from unstructured package's partition function applied on PDF files, into Langchain's Document object's metadata field.

Motivation

When using UnstructuredFileLoader with mode="hi_res", the loader will partition PDF files into Elements, and each element comes with an element id, attached as an attribute of the Element object. Currently when UnstructuredFileLoader converts those Element objects into Langchain's Document objects, that id is discarded. However this information is useful when a Document object's metadata contains parent_id, which points to the id field of the original Element object which unstructured thinks it belongs to, that information can be used to form a hierarchical structure of all Document object extracted from a file.

Proposal (If applicable)

Just modify UnstructuedBaseLoader.lazy_load() method as follows, see the line starts with if hasattr(element, "id")::

    def lazy_load(self) -> Iterator[Document]:
        """Load file."""
        elements = self._get_elements()
        self._post_process_elements(elements)
        if self.mode == "elements":
            for element in elements:
                metadata = self._get_metadata()
                # NOTE(MthwRobinson) - the attribute check is for backward compatibility
                # with unstructured<0.4.9. The metadata attributed was added in 0.4.9.
                if hasattr(element, "metadata"):
                    metadata.update(element.metadata.to_dict())
                if hasattr(element, "category"):
                    metadata["category"] = element.category
                if hasattr(element, "id"):
                    # add document/element id to metadata so that a parent document
                    # can be identified using `parent_id` field in metadata
                    # this is not present in 
                    # `langchain_community.ducument_loaders.unstructured.UnstructuredBaseLoader`
                    metadata["id"] = element.id
                yield Document(page_content=str(element), metadata=metadata)
        elif self.mode == "paged":
            text_dict: Dict[int, str] = {}
            meta_dict: Dict[int, Dict] = {}

            for idx, element in enumerate(elements):
                metadata = self._get_metadata()
                if hasattr(element, "metadata"):
                    metadata.update(element.metadata.to_dict())
                page_number = metadata.get("page_number", 1)

                # Check if this page_number already exists in docs_dict
                if page_number not in text_dict:
                    # If not, create new entry with initial text and metadata
                    text_dict[page_number] = str(element) + "\n\n"
                    meta_dict[page_number] = metadata
                else:
                    # If exists, append to text and update the metadata
                    text_dict[page_number] += str(element) + "\n\n"
                    meta_dict[page_number].update(metadata)

            # Convert the dict to a list of Document objects
            for key in text_dict.keys():
                yield Document(page_content=text_dict[key], metadata=meta_dict[key])
        elif self.mode == "single":
            metadata = self._get_metadata()
            text = "\n\n".join([str(el) for el in elements])
            yield Document(page_content=text, metadata=metadata)
        else:
            raise ValueError(f"mode of {self.mode} not supported.")

Would be happy to open a PR if this looks acceptable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`Element`'s `id` attribute missing in `Document`'s `metadata` field when using `UnstructuredFileLoader` #20227

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

Element's id attribute missing in Document's metadata field when using UnstructuredFileLoader #20227

rchen19 Apr 9, 2024

Checked

Feature request

Motivation

Proposal (If applicable)

Replies: 0 comments

`Element`'s `id` attribute missing in `Document`'s `metadata` field when using `UnstructuredFileLoader` #20227

rchen19
Apr 9, 2024