Refactoring PDF loaders: 02 PyMuPDF #29063

pprados · 2025-01-07T08:46:24Z

Refactoring PDF loaders step 2: "community: Refactoring PDF loaders to standardize approaches"
Description: Update PyMuPDFParser/Loader
Twitter handle: pprados

This is one part of a larger Pull Request (PR) that is too large to be submitted all at once.
This specific part focuses to prepare the update of all parsers.

For more details, see PR 28970.

@eyurtsev it's the continuation of PDFLoader modifications.

vercel · 2025-01-07T08:46:28Z

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
langchain	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Jan 14, 2025 10:29am

Add file_path with PurePath Add CloudBlobLoader in __init__ Replace Dict/List to dict/list

pprados · 2025-01-07T16:19:05Z

@eyurtsev I rebase the code with master ;-)

eyurtsev

Great will take a look in the AM

eyurtsev

Left two major comment, a few stylistic comments and some nits.

Let's tackle the two major comments:

Define the standardized structure of metadata
Create a dedicated ImageParser which is a blob parser

libs/community/langchain_community/document_loaders/parsers/pdf.py

libs/community/langchain_community/document_loaders/parsers/images.py

libs/community/langchain_community/document_loaders/pdf.py

eyurtsev · 2025-01-13T23:19:49Z

libs/community/langchain_community/document_loaders/parsers/images.py

+    Abstract base class for parsing image blobs into text.
+
+    Attributes:
+        format (Literal["text", "markdown", "html"]):


Users will not understand what this means from looking at the API reference docs. Could we improve until this is self-explanation from just looking at the API Reference?

Perhaps a type update to: Optional[Literal['markdown-link', 'html-img']] and make the literal names more specific?

For example,

None = return the content as is

markdown-link = wrap the content into a markdown link, w/ link pointing to...

html-img = wrap the content as the alt text of an tag and link to ...

Also consider a clash against a potential future implementation: The new implementation accepts a screenshot of an entire PDF page and create a markdown representation of it. (i.e., the output is markdown here, but the has a very different meaning now and we don't want to rewrap it all as a markdown link).

Perhaps renaming format to wrap_as would be helpful for communicating what this does?

I'm not sure wrap_as is clearer than format.

eyurtsev · 2025-01-13T23:21:59Z

libs/community/langchain_community/document_loaders/parsers/pdf.py

+    return all_text
+
+
+class ImagesPdfParser(BaseBlobParser):


Could we remove the intermediate abstraction? It doesn't serve any user facing purpose in this case -- our unit tests can check that parsers conform to an interface?

Why are extract_images and images_parser definable independently? Why not use just images_parser: Optional[BaseImageBlobParser] = None for any new implementations?

This is because, when 10 other PRs are published with other parsers, this class will become meaningful and will be shared with all image management-compatible parsers.

It's for compatibility reasons. I prefer to have only images_parser: Optional[BaseImageBlobParser] = None

It is possible to declare extract_images deprecated, with, implicitly, the use of RapidOCRBlobParser() if, and only if, extract_images=True, with an alert.

What do you think?

libs/community/langchain_community/document_loaders/pdf.py

pprados · 2025-01-14T09:33:56Z

@eyurtsev
Currently, core sources generate lint errors. There's nothing I can do about it.

[ "langchain_core" = "" ] || poetry run ruff format langchain_core --diff
1 file would be reformatted, 162 files already formatted
--- langchain_core/documents/base.py
+++ langchain_core/documents/base.py

vercel bot deployed to Preview January 7, 2025 08:55 View deployment

vercel bot deployed to Preview January 7, 2025 09:15 View deployment

pprados marked this pull request as ready for review January 7, 2025 09:16

dosubot bot added size:XXL This PR changes 1000+ lines, ignoring generated files. community Related to langchain-community Ɑ: doc loader Related to document loader module (not documentation) labels Jan 7, 2025

ccurme assigned eyurtsev Jan 7, 2025

pprados added 7 commits January 7, 2025 17:08

Prepare the integration of new versions of PDFLoader.

21759e2

Add file_path with PurePath Add CloudBlobLoader in __init__ Replace Dict/List to dict/list

Fix Line too long

4607354

Fix Line too long

668dc9c

Fix Line too long

7a5b5c5

Fix Line too long

6340ded

Update PyMuPDF

4845781

Fix tu

3beda82

pprados force-pushed the pprados/02-pymupdf branch from 039819c to 3beda82 Compare January 7, 2025 16:09

vercel bot deployed to Preview January 7, 2025 16:18 View deployment

eyurtsev reviewed Jan 8, 2025

View reviewed changes

pprados mentioned this pull request Jan 8, 2025

Refactoring PDF loaders: all #28970

Draft

2 tasks

eyurtsev reviewed Jan 9, 2025

View reviewed changes

pprados added 3 commits January 9, 2025 16:48

Fix review - step 1

743a83e

Fix all remarques

b623750

Merge remote-tracking branch 'upstream/master' into pprados/02-pymupdf

20f5a41

pprados marked this pull request as draft January 10, 2025 12:45

vercel bot deployed to Preview January 10, 2025 13:30 View deployment

pprados force-pushed the pprados/02-pymupdf branch from 0d99673 to 3fe4ec5 Compare January 10, 2025 13:40

vercel bot deployed to Preview January 10, 2025 13:49 View deployment

pprados force-pushed the pprados/02-pymupdf branch 2 times, most recently from 4342991 to 760267b Compare January 10, 2025 14:05

vercel bot deployed to Preview January 10, 2025 14:15 View deployment

pprados force-pushed the pprados/02-pymupdf branch 3 times, most recently from 9fc89e0 to d30b26d Compare January 10, 2025 14:47

vercel bot deployed to Preview January 10, 2025 14:58 View deployment

pprados force-pushed the pprados/02-pymupdf branch 2 times, most recently from 6765dbf to df1d4d5 Compare January 10, 2025 15:09

vercel bot deployed to Preview January 10, 2025 15:24 View deployment

Fix remarques

91234f0

pprados force-pushed the pprados/02-pymupdf branch from df1d4d5 to 91234f0 Compare January 10, 2025 15:37

vercel bot deployed to Preview January 10, 2025 15:46 View deployment

pprados marked this pull request as ready for review January 10, 2025 15:46

dosubot bot added the 🤖:docs Changes to documentation and examples, like .md, .rst, .ipynb files. Changes to the docs/ folder label Jan 10, 2025

eyurtsev reviewed Jan 11, 2025

View reviewed changes

pprados marked this pull request as draft January 13, 2025 08:20

pprados added 2 commits January 13, 2025 09:31

Fix Images

80ee3f7

Merge remote-tracking branch 'upstream/master' into pprados/02-pymupdf

66f97cf

vercel bot deployed to Preview January 13, 2025 08:42 View deployment

Fix Images

0e6c904

vercel bot deployed to Preview January 13, 2025 11:21 View deployment

pprados marked this pull request as ready for review January 13, 2025 12:53

Merge branch 'master' into pprados/02-pymupdf

9b45bd8

vercel bot deployed to Preview January 13, 2025 16:31 View deployment

eyurtsev reviewed Jan 13, 2025

View reviewed changes

pprados added 2 commits January 14, 2025 10:23

Fix deprecated load() with kwargs

acf4358

Merge branch 'master' into pprados/02-pymupdf

d7d3021

vercel bot deployed to Preview January 14, 2025 09:42 View deployment

Change the format for images parser

4762fab

vercel bot deployed to Preview January 14, 2025 10:29 View deployment

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactoring PDF loaders: 02 PyMuPDF #29063

Refactoring PDF loaders: 02 PyMuPDF #29063

pprados commented Jan 7, 2025 •

edited

Loading

vercel bot commented Jan 7, 2025 •

edited

Loading

pprados commented Jan 7, 2025

eyurtsev left a comment

eyurtsev left a comment

eyurtsev Jan 13, 2025

pprados Jan 14, 2025

eyurtsev Jan 13, 2025

pprados Jan 14, 2025 •

edited

Loading

pprados commented Jan 14, 2025

Refactoring PDF loaders: 02 PyMuPDF #29063

Are you sure you want to change the base?

Refactoring PDF loaders: 02 PyMuPDF #29063

Conversation

pprados commented Jan 7, 2025 • edited Loading

vercel bot commented Jan 7, 2025 • edited Loading

pprados commented Jan 7, 2025

eyurtsev left a comment

Choose a reason for hiding this comment

eyurtsev left a comment

Choose a reason for hiding this comment

eyurtsev Jan 13, 2025

Choose a reason for hiding this comment

pprados Jan 14, 2025

Choose a reason for hiding this comment

eyurtsev Jan 13, 2025

Choose a reason for hiding this comment

pprados Jan 14, 2025 • edited Loading

Choose a reason for hiding this comment

pprados commented Jan 14, 2025

pprados commented Jan 7, 2025 •

edited

Loading

vercel bot commented Jan 7, 2025 •

edited

Loading

pprados Jan 14, 2025 •

edited

Loading