Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

community: add init for unstructured file loader #29101

Merged
merged 3 commits into from
Jan 13, 2025

Conversation

Marsman1996
Copy link
Contributor

Description

Add __init__ for unstructured loader of epub/image/markdown/pdf/ppt/word to restrict the input type to str or Path.
In the signature these unstructured loaders receive file_path: str | List[str] | Path | List[Path], but actually they only receive str or Path.

Issue

None

Dependencies

No changes.

@dosubot dosubot bot added the size:L This PR changes 100-499 lines, ignoring generated files. label Jan 8, 2025
Copy link

vercel bot commented Jan 8, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Skipped Deployment
Name Status Preview Comments Updated (UTC)
langchain ⬜️ Ignored (Inspect) Visit Preview Jan 11, 2025 2:46am

@dosubot dosubot bot added community Related to langchain-community Ɑ: doc loader Related to document loader module (not documentation) labels Jan 8, 2025
@Marsman1996
Copy link
Contributor Author

I add the type conversion to str for unstructured file loaders, since the type of source in metadata depends the type of file_path the user given.

For example,

from pathlib import Path
from langchain_community.document_loaders import UnstructuredMarkdownLoader

docs = UnstructuredMarkdownLoader(
    Path(
        "../langchain/docs/docs/integrations/document_loaders/example_data/example.md"
    ),
    mode="elements",
    strategy="fast",
).load()

for doc in docs:
    print(doc.metadata)

The metadata is:

{'source': PosixPath('../langchain/docs/docs/integrations/document_loaders/example_data/example.md'), 'languages': ['eng'], 'file_directory': '../langchain/docs/docs/integrations/document_loaders/example_data', 'filename': 'example.md', 'filetype': 'text/markdown', 'last_modified': '2025-01-08T17:28:16', 'parent_id': 'af32054c80c84c0b93ef2dae509ac64b', 'category': 'UncategorizedText', 'element_id': '038864d7e3bfa2181807d629b7bf7327'}

As we can see, the source here is PosixPath type, not one of str, bool, int, float and could be filtered by filter_complex_metadata.

"multi", or "all". Default is "single".
**unstructured_kwargs: Any kwargs to pass to the unstructured.
"""
file_path = str(file_path)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these casts necessary? Some loaders here appear to support Path:

file_path = Path(__file__).parent.parent / "examples/hello.pdf"
loader = UnstructuredPDFLoader(file_path, mode="elements")

(Those integration tests are not run in CI but are intended to be run locally by developers.)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah just saw your comment.

Copy link
Collaborator

@ccurme ccurme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this. Have you verified that all of the loaders updated here are incompatible with list[str / Path]?


As an aside, I recommend checking out langchain-unstructured as we intend for that package to absorb any Unstructured functionality in langchain-community.

It has the advantage that the unstructured dependency is managed explicitly, so we don't need to do the checks you cleaned up here. The package can also be versioned, whereas this is difficult to do with langchain-community.

@dosubot dosubot bot added the lgtm PR looks good. Use to confirm that a PR is ready for merging. label Jan 10, 2025
@Marsman1996
Copy link
Contributor Author

Thanks for this. Have you verified that all of the loaders updated here are incompatible with list[str / Path]?

I double check all the unstructured specific filetype loaders, unlike the base UnstructuredFileLoader all 17 of them are incompatible with list[str / Path]. If we want these loaders to be incompatible with list[str / Path], we need to modify the _get_elements like UnstructuredFileLoader.

def _get_elements(self) -> List[Element]:
from unstructured.partition.auto import partition
if isinstance(self.file_path, list):
elements: List[Element] = []
for file in self.file_path:
if isinstance(file, Path):
file = str(file)
elements.extend(partition(filename=file, **self.unstructured_kwargs))
return elements
else:
if isinstance(self.file_path, Path):
self.file_path = str(self.file_path)
return partition(filename=self.file_path, **self.unstructured_kwargs)

And UnstructuredCHMLoader is also incompatible with Path type, I forgot to add init for it. I will add a commit later.

Traceback (most recent call last):
  File "/home/marsman1996/afgen/test/./hello.py", line 21, in <module>
    UnstructuredCHMLoader(Path("./STEM_2015_12_08.chm")).load()
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^
  File "/home/marsman1996/afgen/test/.venv/lib/python3.13/site-packages/langchain_core/document_loaders/base.py", line 31, in load
    return list(self.lazy_load())
  File "/home/marsman1996/afgen/test/.venv/lib/python3.13/site-packages/langchain_community/document_loaders/unstructured.py", line 107, in lazy_load
    elements = self._get_elements()
  File "/home/marsman1996/afgen/test/.venv/lib/python3.13/site-packages/langchain_community/document_loaders/chm.py", line 30, in _get_elements
    with CHMParser(self.file_path) as f:  # type: ignore[arg-type]
         ~~~~~~~~~^^^^^^^^^^^^^^^^
  File "/home/marsman1996/afgen/test/.venv/lib/python3.13/site-packages/langchain_community/document_loaders/chm.py", line 48, in __init__
    self.file.LoadCHM(path)
    ~~~~~~~~~~~~~~~~~^^^^^^
  File "/home/marsman1996/afgen/test/.venv/lib/python3.13/site-packages/chm/chm.py", line 216, in LoadCHM
    self.file = chmlib.chm_open(archiveName)
                ~~~~~~~~~~~~~~~^^^^^^^^^^^^^
  File "/home/marsman1996/afgen/test/.venv/lib/python3.13/site-packages/chm/chmlib.py", line 20, in chm_open
    return _chmlib.chm_open(filename)
           ~~~~~~~~~~~~~~~~^^^^^^^^^^
TypeError: a bytes-like object is required, not 'PosixPath'

@Marsman1996
Copy link
Contributor Author

As an aside, I recommend checking out langchain-unstructured as we intend for that package to absorb any Unstructured functionality in langchain-community.

It has the advantage that the unstructured dependency is managed explicitly, so we don't need to do the checks you cleaned up here. The package can also be versioned, whereas this is difficult to do with langchain-community.

Thank you for your advise! I will check out that!

@Marsman1996 Marsman1996 changed the title community: add init for unstructured epub/image/markdown/pdf/ppt/word community: add init for unstructured file loader Jan 11, 2025
@Marsman1996
Copy link
Contributor Author

langchain-unstructured works perfectly fine!

I'm wondering why in langchain_community we use multiple unstructed file type loaders which call partition_xx, but in langchain-unstructured we only use 1 loader to call partition only.

@ccurme
Copy link
Collaborator

ccurme commented Jan 13, 2025

langchain-unstructured works perfectly fine!

I'm wondering why in langchain_community we use multiple unstructed file type loaders which call partition_xx, but in langchain-unstructured we only use 1 loader to call partition only.

Good question. I don't know! The community integrations pre-date langchain-unstructured, so it's possible that use of the generic partition brick became more popular after. langchain-unstructured can also use the Unstructured API, and maybe that likes to do mime type detection. All speculation!

@ccurme ccurme merged commit f980144 into langchain-ai:master Jan 13, 2025
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community Related to langchain-community Ɑ: doc loader Related to document loader module (not documentation) lgtm PR looks good. Use to confirm that a PR is ready for merging. size:L This PR changes 100-499 lines, ignoring generated files.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

2 participants