Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

community: Added propagation of document metadata from O365BaseLoader #20663

Merged
merged 28 commits into from
May 23, 2024

Conversation

MacanPN
Copy link
Contributor

@MacanPN MacanPN commented Apr 19, 2024

Description:

  • Added propagation of document metadata from O365BaseLoader to FileSystemBlobLoader (O365BaseLoader uses FileSystemBlobLoader under the hood).
    • This is done by passing dictionary metadata_dict: key=filename and value=dictionary containing document's metadata
  • Modified FileSystemBlobLoader to accept the metadata_dict, use mimetype from it (if available) and pass metadata further into blob loader.

Issue:

  • O365BaseLoader under the hood downloads documents to temp folder and then uses FileSystemBlobLoader on it.
  • However metadata about the document in question is lost in this process. In particular:
    • mime_type: FileSystemBlobLoader guesses mime_type from the file extension, but that does not work 100% of the time.
    • web_url: this is useful to keep around since in RAG LLM we might want to provide link to the source document. In order to work well with document parsers, we pass the web_url as source (web_url is ignored by parsers, source is preserved)

Dependencies:
None

Twitter handle:
@martintriska1

Please review @baskaryan

…temBlobLoader (that O365BaseLoader uses under the hood). Also modified SharePointLoader to propagate `web_url` in metadata to the output of the parser.
Copy link

vercel bot commented Apr 19, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

1 Ignored Deployment
Name Status Preview Comments Updated (UTC)
langchain ⬜️ Ignored (Inspect) Visit Preview May 23, 2024 8:52am

@MacanPN MacanPN marked this pull request as ready for review April 19, 2024 16:57
@dosubot dosubot bot added size:S This PR changes 10-29 lines, ignoring generated files. Ɑ: doc loader Related to document loader module (not documentation) 🤖:improvement Medium size change to existing code to handle new use-cases labels Apr 19, 2024
@baskaryan baskaryan requested a review from eyurtsev April 24, 2024 23:41
@dosubot dosubot bot added size:M This PR changes 30-99 lines, ignoring generated files. and removed size:S This PR changes 10-29 lines, ignoring generated files. labels Apr 30, 2024
@MacanPN
Copy link
Contributor Author

MacanPN commented Apr 30, 2024

@eyurtsev @baskaryan The PR is ready for review and merging! :)

@MacanPN
Copy link
Contributor Author

MacanPN commented May 3, 2024

@eyurtsev @baskaryan Can you please provide me with a feedback and way to move forward with this PR? Thanks!

@petergoldstein
Copy link

This PR is required for an internal deployment in a large enterprise environment. We'd really like to see it merged.

Given that all checks pass, and the changes are relatively minor, what would help get it merged into an official version? Any thoughts @hwchase17 ?

Happy to discuss the use case in more detail if that helps. Thanks.

@MacanPN
Copy link
Contributor Author

MacanPN commented May 14, 2024

!ping @eyurtsev @baskaryan @hwchase17

@@ -58,6 +68,7 @@ def __init__(
glob: str = "**/[!.]*",
exclude: Sequence[str] = (),
suffixes: Optional[Sequence[str]] = None,
metadata_dict: Optional[Dict[str, Dict[str, Any]]] = None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic should be handled outside the file system blob loader. The code that is using the BlobLoader can inspect the the blobs and add metadata to them based on the path associated with them.

@dosubot dosubot bot added size:XS This PR changes 0-9 lines, ignoring generated files. and removed size:M This PR changes 30-99 lines, ignoring generated files. labels May 20, 2024
@dosubot dosubot bot added size:S This PR changes 10-29 lines, ignoring generated files. and removed size:XS This PR changes 0-9 lines, ignoring generated files. labels May 20, 2024
@MacanPN MacanPN requested a review from eyurtsev May 20, 2024 13:11
@MacanPN
Copy link
Contributor Author

MacanPN commented May 20, 2024

@eyurtsev I've modified the PR to only work within base_o365 and sharepoint classes. Please take a look

@MacanPN
Copy link
Contributor Author

MacanPN commented May 22, 2024

@eyurtsev please review. We'd need to get this finally off the table. Thanks!

@eyurtsev
Copy link
Collaborator

@MacanPN going to push a few changes to your PR in a bit

@eyurtsev eyurtsev added the waiting-on-author PR Status: Confirmation from author is required label May 22, 2024
@dosubot dosubot bot added the lgtm PR looks good. Use to confirm that a PR is ready for merging. label May 22, 2024
@eyurtsev
Copy link
Collaborator

@MacanPN sorry for the delay, pushed some aesthetic changes through, let me know if looks good, if so i'll merge.

@MacanPN
Copy link
Contributor Author

MacanPN commented May 23, 2024

@eyurtsev Changes look good. Thanks! I've now also resolved a conflict with master. Feel free to merge!
Full speed ahead Mr. Spock :)

@eyurtsev eyurtsev merged commit 2df8ac4 into langchain-ai:master May 23, 2024
42 checks passed
@radvanyimome
Copy link

Is this change included in the 0.2.2 releases?
I'm still getting

"ValueError: data=None mimetype=None encoding='utf-8' path=PosixPath('/tmp/tmprxucxkc7/filename.docx') metadata={} does not have a mimetype"

hinthornw pushed a commit that referenced this pull request Jun 20, 2024
…eLoader (#20663)

**Description:**
- Added propagation of document metadata from O365BaseLoader to
FileSystemBlobLoader (O365BaseLoader uses FileSystemBlobLoader under the
hood).
- This is done by passing dictionary `metadata_dict`: key=filename and
value=dictionary containing document's metadata
- Modified `FileSystemBlobLoader` to accept the `metadata_dict`, use
`mimetype` from it (if available) and pass metadata further into blob
loader.

**Issue:**
- `O365BaseLoader` under the hood downloads documents to temp folder and
then uses `FileSystemBlobLoader` on it.
- However metadata about the document in question is lost in this
process. In particular:
- `mime_type`: `FileSystemBlobLoader` guesses `mime_type` from the file
extension, but that does not work 100% of the time.
- `web_url`: this is useful to keep around since in RAG LLM we might
want to provide link to the source document. In order to work well with
document parsers, we pass the `web_url` as `source` (`web_url` is
ignored by parsers, `source` is preserved)

**Dependencies:**
None

**Twitter handle:**
@martintriska1

Please review @baskaryan

---------

Co-authored-by: Bagatur <[email protected]>
Co-authored-by: Eugene Yurtsev <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Ɑ: doc loader Related to document loader module (not documentation) 🤖:improvement Medium size change to existing code to handle new use-cases lgtm PR looks good. Use to confirm that a PR is ready for merging. size:S This PR changes 10-29 lines, ignoring generated files. waiting-on-author PR Status: Confirmation from author is required
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants