You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I searched existing ideas and did not find a similar one
I added a very descriptive title
I've clearly described the feature request and motivation for it
Feature request
Transfer/copy the id attribute of an Element object, resulted from unstructured package's partition function applied on PDF files, into Langchain's Document object's metadata field.
Motivation
When using UnstructuredFileLoader with mode="hi_res", the loader will partition PDF files into Elements, and each element comes with an element id, attached as an attribute of the Element object. Currently when UnstructuredFileLoader converts those Element objects into Langchain's Document objects, that id is discarded. However this information is useful when a Document object's metadata contains parent_id, which points to the id field of the original Element object which unstructured thinks it belongs to, that information can be used to form a hierarchical structure of all Document object extracted from a file.
Proposal (If applicable)
Just modify UnstructuedBaseLoader.lazy_load() method as follows, see the line starts with if hasattr(element, "id")::
deflazy_load(self) ->Iterator[Document]:
"""Load file."""elements=self._get_elements()
self._post_process_elements(elements)
ifself.mode=="elements":
forelementinelements:
metadata=self._get_metadata()
# NOTE(MthwRobinson) - the attribute check is for backward compatibility# with unstructured<0.4.9. The metadata attributed was added in 0.4.9.ifhasattr(element, "metadata"):
metadata.update(element.metadata.to_dict())
ifhasattr(element, "category"):
metadata["category"] =element.categoryifhasattr(element, "id"):
# add document/element id to metadata so that a parent document# can be identified using `parent_id` field in metadata# this is not present in # `langchain_community.ducument_loaders.unstructured.UnstructuredBaseLoader`metadata["id"] =element.idyieldDocument(page_content=str(element), metadata=metadata)
elifself.mode=="paged":
text_dict: Dict[int, str] = {}
meta_dict: Dict[int, Dict] = {}
foridx, elementinenumerate(elements):
metadata=self._get_metadata()
ifhasattr(element, "metadata"):
metadata.update(element.metadata.to_dict())
page_number=metadata.get("page_number", 1)
# Check if this page_number already exists in docs_dictifpage_numbernotintext_dict:
# If not, create new entry with initial text and metadatatext_dict[page_number] =str(element) +"\n\n"meta_dict[page_number] =metadataelse:
# If exists, append to text and update the metadatatext_dict[page_number] +=str(element) +"\n\n"meta_dict[page_number].update(metadata)
# Convert the dict to a list of Document objectsforkeyintext_dict.keys():
yieldDocument(page_content=text_dict[key], metadata=meta_dict[key])
elifself.mode=="single":
metadata=self._get_metadata()
text="\n\n".join([str(el) forelinelements])
yieldDocument(page_content=text, metadata=metadata)
else:
raiseValueError(f"mode of {self.mode} not supported.")
Would be happy to open a PR if this looks acceptable.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Checked
Feature request
Transfer/copy the
id
attribute of anElement
object, resulted fromunstructured
package'spartition
function applied on PDF files, into Langchain'sDocument
object'smetadata
field.Motivation
When using
UnstructuredFileLoader
withmode="hi_res"
, the loader will partition PDF files intoElements
, and each element comes with an elementid
, attached as an attribute of theElement
object. Currently whenUnstructuredFileLoader
converts thoseElement
objects into Langchain'sDocument
objects, thatid
is discarded. However this information is useful when aDocument
object'smetadata
containsparent_id
, which points to theid
field of the originalElement
object whichunstructured
thinks it belongs to, that information can be used to form a hierarchical structure of allDocument
object extracted from a file.Proposal (If applicable)
Just modify
UnstructuedBaseLoader.lazy_load()
method as follows, see the line starts withif hasattr(element, "id"):
:Would be happy to open a PR if this looks acceptable.
Beta Was this translation helpful? Give feedback.
All reactions