-
Notifications
You must be signed in to change notification settings - Fork 16.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
community: add init for unstructured file loader #29101
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎ 1 Skipped Deployment
|
I add the type conversion to For example, from pathlib import Path
from langchain_community.document_loaders import UnstructuredMarkdownLoader
docs = UnstructuredMarkdownLoader(
Path(
"../langchain/docs/docs/integrations/document_loaders/example_data/example.md"
),
mode="elements",
strategy="fast",
).load()
for doc in docs:
print(doc.metadata) The metadata is:
As we can see, the |
"multi", or "all". Default is "single". | ||
**unstructured_kwargs: Any kwargs to pass to the unstructured. | ||
""" | ||
file_path = str(file_path) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these casts necessary? Some loaders here appear to support Path
:
langchain/libs/community/tests/integration_tests/document_loaders/test_pdf.py
Lines 20 to 21 in 0a54aed
file_path = Path(__file__).parent.parent / "examples/hello.pdf" | |
loader = UnstructuredPDFLoader(file_path, mode="elements") |
(Those integration tests are not run in CI but are intended to be run locally by developers.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah just saw your comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this. Have you verified that all of the loaders updated here are incompatible with list[str / Path]
?
As an aside, I recommend checking out langchain-unstructured as we intend for that package to absorb any Unstructured functionality in langchain-community
.
It has the advantage that the unstructured
dependency is managed explicitly, so we don't need to do the checks you cleaned up here. The package can also be versioned, whereas this is difficult to do with langchain-community
.
I double check all the unstructured specific filetype loaders, unlike the base langchain/libs/community/langchain_community/document_loaders/unstructured.py Lines 215 to 228 in bbc3e3b
And
|
Thank you for your advise! I will check out that! |
langchain-unstructured works perfectly fine! I'm wondering why in langchain_community we use multiple unstructed file type loaders which call |
Good question. I don't know! The community integrations pre-date |
Description
Add
__init__
for unstructured loader of epub/image/markdown/pdf/ppt/word to restrict the input type tostr
orPath
.In the signature these unstructured loaders receive
file_path: str | List[str] | Path | List[Path]
, but actually they only receivestr
orPath
.Issue
None
Dependencies
No changes.