Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring PDF loaders: 02 PyMuPDF #29063

Open
wants to merge 18 commits into
base: master
Choose a base branch
from

Conversation

pprados
Copy link
Contributor

@pprados pprados commented Jan 7, 2025

  • Refactoring PDF loaders step 2: "community: Refactoring PDF loaders to standardize approaches"

  • Description: Update PyMuPDFParser/Loader

  • Twitter handle: pprados

This is one part of a larger Pull Request (PR) that is too large to be submitted all at once.
This specific part focuses to prepare the update of all parsers.

For more details, see PR 28970.

@eyurtsev it's the continuation of PDFLoader modifications.

Copy link

vercel bot commented Jan 7, 2025

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
langchain ✅ Ready (Inspect) Visit Preview 💬 Add feedback Jan 14, 2025 10:29am

@pprados pprados marked this pull request as ready for review January 7, 2025 09:16
@dosubot dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. community Related to langchain-community Ɑ: doc loader Related to document loader module (not documentation) labels Jan 7, 2025
@pprados
Copy link
Contributor Author

pprados commented Jan 7, 2025

@eyurtsev I rebase the code with master ;-)

Copy link
Collaborator

@eyurtsev eyurtsev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great will take a look in the AM

@pprados pprados mentioned this pull request Jan 8, 2025
2 tasks
Copy link
Collaborator

@eyurtsev eyurtsev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left two major comment, a few stylistic comments and some nits.

Let's tackle the two major comments:

  1. Define the standardized structure of metadata
  2. Create a dedicated ImageParser which is a blob parser

@pprados pprados force-pushed the pprados/02-pymupdf branch from df1d4d5 to 91234f0 Compare January 10, 2025 15:37
@pprados pprados marked this pull request as ready for review January 10, 2025 15:46
@dosubot dosubot bot added the 🤖:docs Changes to documentation and examples, like .md, .rst, .ipynb files. Changes to the docs/ folder label Jan 10, 2025
@pprados pprados marked this pull request as draft January 13, 2025 08:20
@pprados pprados marked this pull request as ready for review January 13, 2025 12:53
Abstract base class for parsing image blobs into text.

Attributes:
format (Literal["text", "markdown", "html"]):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users will not understand what this means from looking at the API reference docs. Could we improve until this is self-explanation from just looking at the API Reference?

Perhaps a type update to: Optional[Literal['markdown-link', 'html-img']] and make the literal names more specific?

For example,

  • None = return the content as is
  • markdown-link = wrap the content into a markdown link, w/ link pointing to...
  • html-img = wrap the content as the alt text of an tag and link to ...

Also consider a clash against a potential future implementation: The new implementation accepts a screenshot of an entire PDF page and create a markdown representation of it. (i.e., the output is markdown here, but the has a very different meaning now and we don't want to rewrap it all as a markdown link).

Perhaps renaming format to wrap_as would be helpful for communicating what this does?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure wrap_as is clearer than format.

return all_text


class ImagesPdfParser(BaseBlobParser):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • Could we remove the intermediate abstraction? It doesn't serve any user facing purpose in this case -- our unit tests can check that parsers conform to an interface?
  • Why are extract_images and images_parser definable independently? Why not use just images_parser: Optional[BaseImageBlobParser] = None for any new implementations?

Copy link
Contributor Author

@pprados pprados Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • This is because, when 10 other PRs are published with other parsers, this class will become meaningful and will be shared with all image management-compatible parsers.
  • It's for compatibility reasons. I prefer to have only images_parser: Optional[BaseImageBlobParser] = None

It is possible to declare extract_images deprecated, with, implicitly, the use of RapidOCRBlobParser() if, and only if, extract_images=True, with an alert.

What do you think?

@pprados
Copy link
Contributor Author

pprados commented Jan 14, 2025

@eyurtsev
Currently, core sources generate lint errors. There's nothing I can do about it.

[ "langchain_core" = "" ] || poetry run ruff format langchain_core --diff
1 file would be reformatted, 162 files already formatted
--- langchain_core/documents/base.py
+++ langchain_core/documents/base.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community Related to langchain-community Ɑ: doc loader Related to document loader module (not documentation) 🤖:docs Changes to documentation and examples, like .md, .rst, .ipynb files. Changes to the docs/ folder size:XXL This PR changes 1000+ lines, ignoring generated files.
Projects
Status: In review
Development

Successfully merging this pull request may close these issues.

2 participants