-
Notifications
You must be signed in to change notification settings - Fork 16.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
community: Added propagation of document metadata from O365BaseLoader #20663
community: Added propagation of document metadata from O365BaseLoader #20663
Conversation
…temBlobLoader (that O365BaseLoader uses under the hood). Also modified SharePointLoader to propagate `web_url` in metadata to the output of the parser.
The latest updates on your projects. Learn more about Vercel for Git ↗︎ 1 Ignored Deployment
|
…N/langchain into triska/O365_loader_update
@eyurtsev @baskaryan The PR is ready for review and merging! :) |
@eyurtsev @baskaryan Can you please provide me with a feedback and way to move forward with this PR? Thanks! |
This PR is required for an internal deployment in a large enterprise environment. We'd really like to see it merged. Given that all checks pass, and the changes are relatively minor, what would help get it merged into an official version? Any thoughts @hwchase17 ? Happy to discuss the use case in more detail if that helps. Thanks. |
!ping @eyurtsev @baskaryan @hwchase17 |
@@ -58,6 +68,7 @@ def __init__( | |||
glob: str = "**/[!.]*", | |||
exclude: Sequence[str] = (), | |||
suffixes: Optional[Sequence[str]] = None, | |||
metadata_dict: Optional[Dict[str, Dict[str, Any]]] = None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic should be handled outside the file system blob loader. The code that is using the BlobLoader can inspect the the blobs and add metadata to them based on the path associated with them.
…N/langchain into triska/O365_loader_update
@eyurtsev I've modified the PR to only work within base_o365 and sharepoint classes. Please take a look |
@eyurtsev please review. We'd need to get this finally off the table. Thanks! |
libs/community/langchain_community/document_loaders/base_o365.py
Outdated
Show resolved
Hide resolved
libs/community/langchain_community/document_loaders/base_o365.py
Outdated
Show resolved
Hide resolved
libs/community/langchain_community/document_loaders/base_o365.py
Outdated
Show resolved
Hide resolved
@MacanPN going to push a few changes to your PR in a bit |
@MacanPN sorry for the delay, pushed some aesthetic changes through, let me know if looks good, if so i'll merge. |
@eyurtsev Changes look good. Thanks! I've now also resolved a conflict with |
Is this change included in the 0.2.2 releases? "ValueError: data=None mimetype=None encoding='utf-8' path=PosixPath('/tmp/tmprxucxkc7/filename.docx') metadata={} does not have a mimetype" |
…eLoader (#20663) **Description:** - Added propagation of document metadata from O365BaseLoader to FileSystemBlobLoader (O365BaseLoader uses FileSystemBlobLoader under the hood). - This is done by passing dictionary `metadata_dict`: key=filename and value=dictionary containing document's metadata - Modified `FileSystemBlobLoader` to accept the `metadata_dict`, use `mimetype` from it (if available) and pass metadata further into blob loader. **Issue:** - `O365BaseLoader` under the hood downloads documents to temp folder and then uses `FileSystemBlobLoader` on it. - However metadata about the document in question is lost in this process. In particular: - `mime_type`: `FileSystemBlobLoader` guesses `mime_type` from the file extension, but that does not work 100% of the time. - `web_url`: this is useful to keep around since in RAG LLM we might want to provide link to the source document. In order to work well with document parsers, we pass the `web_url` as `source` (`web_url` is ignored by parsers, `source` is preserved) **Dependencies:** None **Twitter handle:** @martintriska1 Please review @baskaryan --------- Co-authored-by: Bagatur <[email protected]> Co-authored-by: Eugene Yurtsev <[email protected]>
Description:
metadata_dict
: key=filename and value=dictionary containing document's metadataFileSystemBlobLoader
to accept themetadata_dict
, usemimetype
from it (if available) and pass metadata further into blob loader.Issue:
O365BaseLoader
under the hood downloads documents to temp folder and then usesFileSystemBlobLoader
on it.mime_type
:FileSystemBlobLoader
guessesmime_type
from the file extension, but that does not work 100% of the time.web_url
: this is useful to keep around since in RAG LLM we might want to provide link to the source document. In order to work well with document parsers, we pass theweb_url
assource
(web_url
is ignored by parsers,source
is preserved)Dependencies:
None
Twitter handle:
@martintriska1
Please review @baskaryan