community: Added propagation of document metadata from O365BaseLoader #20663

MacanPN · 2024-04-19T16:51:54Z

Description:

Added propagation of document metadata from O365BaseLoader to FileSystemBlobLoader (O365BaseLoader uses FileSystemBlobLoader under the hood).
- This is done by passing dictionary metadata_dict: key=filename and value=dictionary containing document's metadata
Modified FileSystemBlobLoader to accept the metadata_dict, use mimetype from it (if available) and pass metadata further into blob loader.

Issue:

O365BaseLoader under the hood downloads documents to temp folder and then uses FileSystemBlobLoader on it.
However metadata about the document in question is lost in this process. In particular:
- mime_type: FileSystemBlobLoader guesses mime_type from the file extension, but that does not work 100% of the time.
- web_url: this is useful to keep around since in RAG LLM we might want to provide link to the source document. In order to work well with document parsers, we pass the web_url as source (web_url is ignored by parsers, source is preserved)

Dependencies:
None

Twitter handle:
@martintriska1

…temBlobLoader (that O365BaseLoader uses under the hood). Also modified SharePointLoader to propagate `web_url` in metadata to the output of the parser.

vercel · 2024-04-19T16:51:58Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment

Name	Status	Preview	Comments	Updated (UTC)
langchain	⬜️ Ignored (Inspect)	Visit Preview		May 23, 2024 8:52am

…N/langchain into triska/O365_loader_update

… 3.8.

MacanPN · 2024-04-30T16:51:54Z

@eyurtsev @baskaryan The PR is ready for review and merging! :)

MacanPN · 2024-05-03T09:01:41Z

@eyurtsev @baskaryan Can you please provide me with a feedback and way to move forward with this PR? Thanks!

petergoldstein · 2024-05-09T15:36:26Z

This PR is required for an internal deployment in a large enterprise environment. We'd really like to see it merged.

Given that all checks pass, and the changes are relatively minor, what would help get it merged into an official version? Any thoughts @hwchase17 ?

Happy to discuss the use case in more detail if that helps. Thanks.

MacanPN · 2024-05-14T07:15:17Z

!ping @eyurtsev @baskaryan @hwchase17

eyurtsev · 2024-05-15T13:01:09Z

libs/community/langchain_community/document_loaders/blob_loaders/file_system.py

@@ -58,6 +68,7 @@ def __init__(
        glob: str = "**/[!.]*",
        exclude: Sequence[str] = (),
        suffixes: Optional[Sequence[str]] = None,
+        metadata_dict: Optional[Dict[str, Dict[str, Any]]] = None,


This logic should be handled outside the file system blob loader. The code that is using the BlobLoader can inspect the the blobs and add metadata to them based on the path associated with them.

…N/langchain into triska/O365_loader_update

MacanPN · 2024-05-20T13:12:22Z

@eyurtsev I've modified the PR to only work within base_o365 and sharepoint classes. Please take a look

MacanPN · 2024-05-22T09:08:52Z

@eyurtsev please review. We'd need to get this finally off the table. Thanks!

libs/community/langchain_community/document_loaders/base_o365.py

eyurtsev · 2024-05-22T21:28:44Z

@MacanPN going to push a few changes to your PR in a bit

eyurtsev · 2024-05-22T21:36:35Z

@MacanPN sorry for the delay, pushed some aesthetic changes through, let me know if looks good, if so i'll merge.

MacanPN · 2024-05-23T08:57:59Z

@eyurtsev Changes look good. Thanks! I've now also resolved a conflict with master. Feel free to merge!
Full speed ahead Mr. Spock :)

radvanyimome · 2024-06-05T12:37:09Z

Is this change included in the 0.2.2 releases?
I'm still getting

"ValueError: data=None mimetype=None encoding='utf-8' path=PosixPath('/tmp/tmprxucxkc7/filename.docx') metadata={} does not have a mimetype"

@baskaryan

…eLoader (#20663) **Description:** - Added propagation of document metadata from O365BaseLoader to FileSystemBlobLoader (O365BaseLoader uses FileSystemBlobLoader under the hood). - This is done by passing dictionary `metadata_dict`: key=filename and value=dictionary containing document's metadata - Modified `FileSystemBlobLoader` to accept the `metadata_dict`, use `mimetype` from it (if available) and pass metadata further into blob loader. **Issue:** - `O365BaseLoader` under the hood downloads documents to temp folder and then uses `FileSystemBlobLoader` on it. - However metadata about the document in question is lost in this process. In particular: - `mime_type`: `FileSystemBlobLoader` guesses `mime_type` from the file extension, but that does not work 100% of the time. - `web_url`: this is useful to keep around since in RAG LLM we might want to provide link to the source document. In order to work well with document parsers, we pass the `web_url` as `source` (`web_url` is ignored by parsers, `source` is preserved) **Dependencies:** None **Twitter handle:** @martintriska1 Please review @baskaryan --------- Co-authored-by: Bagatur <[email protected]> Co-authored-by: Eugene Yurtsev <[email protected]>

Added propagation of document metadata from O365BaseLoader to FileSys…

df9ca00

…temBlobLoader (that O365BaseLoader uses under the hood). Also modified SharePointLoader to propagate `web_url` in metadata to the output of the parser.

in sharepoint propagating all metadata, not just web_url

4f113b8

MacanPN marked this pull request as ready for review April 19, 2024 16:57

dosubot bot added size:S This PR changes 10-29 lines, ignoring generated files. Ɑ: doc loader Related to document loader module (not documentation) 🤖:improvement Medium size change to existing code to handle new use-cases labels Apr 19, 2024

baskaryan requested a review from eyurtsev April 24, 2024 23:41

baskaryan assigned eyurtsev Apr 24, 2024

baskaryan and others added 8 commits April 24, 2024 16:41

Merge branch 'master' into triska/O365_loader_update

19d654a

Merge branch 'master' into triska/O365_loader_update

92e1f8c

stricter typing on metadata_dict

b47576f

Merge branch 'triska/O365_loader_update' of https://github.com/MacanP…

351d886

…N/langchain into triska/O365_loader_update

importing Any from typing for backwards compatibility with python…

b3da8f0

… 3.8.

fixed initiation of metadata_dict

edb6bc5

Importing Dict from typing for compatibility with python 3.8

ab0c142

sorted imports

7e83458

dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. and removed size:S This PR changes 10-29 lines, ignoring generated files. labels Apr 30, 2024

MacanPN added 3 commits April 30, 2024 18:16

add typing for metadata_dict in base_o365.py

3b3fffa

passing the web_url as source since that is preserved by parsers

d72d8b4

Merge branch 'master' into triska/O365_loader_update

0ae2b14

MacanPN added 2 commits May 1, 2024 11:06

Merge branch 'master' into triska/O365_loader_update

4b1941d

Merge branch 'master' into triska/O365_loader_update

d74eca2

MacanPN added 3 commits May 6, 2024 10:31

Merge branch 'master' into triska/O365_loader_update

8e080bc

Merge branch 'master' into triska/O365_loader_update

cb73e8f

Merge branch 'master' into triska/O365_loader_update

c270c8b

eyurtsev requested changes May 15, 2024

View reviewed changes

MacanPN added 2 commits May 20, 2024 11:50

Reverting changes to fily_system loader

13dd518

Merge branch 'triska/O365_loader_update' of https://github.com/MacanP…

4e5aec2

…N/langchain into triska/O365_loader_update

dosubot bot added size:XS This PR changes 0-9 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels May 20, 2024

modified where/how metadata are preserved

022ca77

dosubot bot added size:S This PR changes 10-29 lines, ignoring generated files. and removed size:XS This PR changes 0-9 lines, ignoring generated files. labels May 20, 2024

MacanPN added 4 commits May 20, 2024 14:41

linting

fc9269c

ensuring metadata is never "None"

9d4c430

handling of a case when document.path is not pathlib object

b2191eb

formatting

07aef47

MacanPN requested a review from eyurtsev May 20, 2024 13:11

eyurtsev reviewed May 22, 2024

View reviewed changes

eyurtsev added 2 commits May 22, 2024 17:34

x

fc3884a

x

5bcbd74

eyurtsev added the waiting-on-author PR Status: Confirmation from author is required label May 22, 2024

eyurtsev approved these changes May 22, 2024

View reviewed changes

dosubot bot added the lgtm PR looks good. Use to confirm that a PR is ready for merging. label May 22, 2024

Merge branch 'master' into triska/O365_loader_update

a2be6a4

eyurtsev merged commit 2df8ac4 into langchain-ai:master May 23, 2024
42 checks passed

radvanyimome mentioned this pull request Jun 7, 2024

SharepointLoader not working as intended despite latest merge 'propagation of document metadata from O365BaseLoader' #22663

Closed

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

community: Added propagation of document metadata from O365BaseLoader #20663

community: Added propagation of document metadata from O365BaseLoader #20663

MacanPN commented Apr 19, 2024 •

edited

Loading

vercel bot commented Apr 19, 2024 •

edited

Loading

MacanPN commented Apr 30, 2024

MacanPN commented May 3, 2024

petergoldstein commented May 9, 2024

MacanPN commented May 14, 2024

eyurtsev May 15, 2024

MacanPN commented May 20, 2024

MacanPN commented May 22, 2024

eyurtsev commented May 22, 2024

eyurtsev commented May 22, 2024

MacanPN commented May 23, 2024

radvanyimome commented Jun 5, 2024

community: Added propagation of document metadata from O365BaseLoader #20663

community: Added propagation of document metadata from O365BaseLoader #20663

Conversation

MacanPN commented Apr 19, 2024 • edited Loading

vercel bot commented Apr 19, 2024 • edited Loading

MacanPN commented Apr 30, 2024

MacanPN commented May 3, 2024

petergoldstein commented May 9, 2024

MacanPN commented May 14, 2024

eyurtsev May 15, 2024

Choose a reason for hiding this comment

MacanPN commented May 20, 2024

MacanPN commented May 22, 2024

eyurtsev commented May 22, 2024

eyurtsev commented May 22, 2024

MacanPN commented May 23, 2024

radvanyimome commented Jun 5, 2024

MacanPN commented Apr 19, 2024 •

edited

Loading

vercel bot commented Apr 19, 2024 •

edited

Loading