-
Notifications
You must be signed in to change notification settings - Fork 16k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactoring PDF loaders: 02 PyMuPDF #29063
base: master
Are you sure you want to change the base?
Conversation
The latest updates on your projects. Learn more about Vercel for Git ↗︎
|
Add file_path with PurePath Add CloudBlobLoader in __init__ Replace Dict/List to dict/list
039819c
to
3beda82
Compare
@eyurtsev I rebase the code with master ;-) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great will take a look in the AM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left two major comment, a few stylistic comments and some nits.
Let's tackle the two major comments:
- Define the standardized structure of metadata
- Create a dedicated ImageParser which is a blob parser
libs/community/langchain_community/document_loaders/parsers/pdf.py
Outdated
Show resolved
Hide resolved
libs/community/langchain_community/document_loaders/parsers/pdf.py
Outdated
Show resolved
Hide resolved
libs/community/langchain_community/document_loaders/parsers/pdf.py
Outdated
Show resolved
Hide resolved
libs/community/langchain_community/document_loaders/parsers/pdf.py
Outdated
Show resolved
Hide resolved
libs/community/langchain_community/document_loaders/parsers/pdf.py
Outdated
Show resolved
Hide resolved
libs/community/langchain_community/document_loaders/parsers/pdf.py
Outdated
Show resolved
Hide resolved
0d99673
to
3fe4ec5
Compare
4342991
to
760267b
Compare
9fc89e0
to
d30b26d
Compare
6765dbf
to
df1d4d5
Compare
df1d4d5
to
91234f0
Compare
libs/community/langchain_community/document_loaders/parsers/pdf.py
Outdated
Show resolved
Hide resolved
libs/community/langchain_community/document_loaders/parsers/images.py
Outdated
Show resolved
Hide resolved
libs/community/langchain_community/document_loaders/parsers/images.py
Outdated
Show resolved
Hide resolved
libs/community/langchain_community/document_loaders/parsers/images.py
Outdated
Show resolved
Hide resolved
libs/community/langchain_community/document_loaders/parsers/images.py
Outdated
Show resolved
Hide resolved
libs/community/langchain_community/document_loaders/parsers/images.py
Outdated
Show resolved
Hide resolved
libs/community/langchain_community/document_loaders/parsers/images.py
Outdated
Show resolved
Hide resolved
Abstract base class for parsing image blobs into text. | ||
|
||
Attributes: | ||
format (Literal["text", "markdown", "html"]): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Users will not understand what this means from looking at the API reference docs. Could we improve until this is self-explanation from just looking at the API Reference?
Perhaps a type update to: Optional[Literal['markdown-link', 'html-img']]
and make the literal names more specific?
For example,
- None = return the content as is
markdown-link
= wrap the content into a markdown link, w/ link pointing to...html-img
= wrap the content as thealt
text of an tag and link to ...
Also consider a clash against a potential future implementation: The new implementation accepts a screenshot of an entire PDF page and create a markdown representation of it. (i.e., the output is markdown
here, but the has a very different meaning now and we don't want to rewrap it all as a markdown link).
Perhaps renaming format
to wrap_as
would be helpful for communicating what this does?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure wrap_as
is clearer than format
.
return all_text | ||
|
||
|
||
class ImagesPdfParser(BaseBlobParser): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Could we remove the intermediate abstraction? It doesn't serve any user facing purpose in this case -- our unit tests can check that parsers conform to an interface?
- Why are extract_images and images_parser definable independently? Why not use just images_parser: Optional[BaseImageBlobParser] = None for any new implementations?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- This is because, when 10 other PRs are published with other parsers, this class will become meaningful and will be shared with all image management-compatible parsers.
- It's for compatibility reasons. I prefer to have only
images_parser: Optional[BaseImageBlobParser] = None
It is possible to declare extract_images
deprecated, with, implicitly, the use of RapidOCRBlobParser()
if, and only if, extract_images=True
, with an alert.
What do you think?
@eyurtsev
|
Refactoring PDF loaders step 2: "community: Refactoring PDF loaders to standardize approaches"
Description: Update PyMuPDFParser/Loader
Twitter handle: pprados
This is one part of a larger Pull Request (PR) that is too large to be submitted all at once.
This specific part focuses to prepare the update of all parsers.
For more details, see PR 28970.
@eyurtsev it's the continuation of PDFLoader modifications.