-
Notifications
You must be signed in to change notification settings - Fork 16k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactoring PDF loaders: 02 PyMuPDF #29063
Open
pprados
wants to merge
20
commits into
langchain-ai:master
Choose a base branch
from
pprados:pprados/02-pymupdf
base: master
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
+2,354
−191
Open
Changes from 7 commits
Commits
Show all changes
20 commits
Select commit
Hold shift + click to select a range
21759e2
Prepare the integration of new versions of PDFLoader.
pprados 4607354
Fix Line too long
pprados 668dc9c
Fix Line too long
pprados 7a5b5c5
Fix Line too long
pprados 6340ded
Fix Line too long
pprados 4845781
Update PyMuPDF
pprados 3beda82
Fix tu
pprados 743a83e
Fix review - step 1
pprados b623750
Fix all remarques
pprados 20f5a41
Merge remote-tracking branch 'upstream/master' into pprados/02-pymupdf
pprados 91234f0
Fix remarques
pprados 80ee3f7
Fix Images
pprados 66f97cf
Merge remote-tracking branch 'upstream/master' into pprados/02-pymupdf
pprados 0e6c904
Fix Images
pprados 9b45bd8
Merge branch 'master' into pprados/02-pymupdf
pprados acf4358
Fix deprecated load() with kwargs
pprados d7d3021
Merge branch 'master' into pprados/02-pymupdf
pprados 4762fab
Change the format for images parser
pprados 6121005
Merge branch 'master' into pprados/02-pymupdf
pprados 5910f99
Merge branch 'master' into pprados/02-pymupdf
pprados File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
1,140 changes: 1,099 additions & 41 deletions
1,140
docs/docs/integrations/document_loaders/pymupdf.ipynb
Large diffs are not rendered by default.
Oops, something went wrong.
662 changes: 609 additions & 53 deletions
662
libs/community/langchain_community/document_loaders/parsers/pdf.py
Large diffs are not rendered by default.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -12,6 +12,7 @@ | |
Any, | ||
BinaryIO, | ||
Iterator, | ||
Literal, | ||
Mapping, | ||
Optional, | ||
Sequence, | ||
|
@@ -28,13 +29,15 @@ | |
from langchain_community.document_loaders.blob_loaders import Blob | ||
from langchain_community.document_loaders.dedoc import DedocBaseLoader | ||
from langchain_community.document_loaders.parsers.pdf import ( | ||
CONVERT_IMAGE_TO_TEXT, | ||
AmazonTextractPDFParser, | ||
DocumentIntelligenceParser, | ||
PDFMinerParser, | ||
PDFPlumberParser, | ||
PyMuPDFParser, | ||
PyPDFium2Parser, | ||
PyPDFParser, | ||
_default_page_delimitor, | ||
) | ||
from langchain_community.document_loaders.unstructured import UnstructuredFileLoader | ||
|
||
|
@@ -96,7 +99,8 @@ def __init__( | |
if "~" in self.file_path: | ||
self.file_path = os.path.expanduser(self.file_path) | ||
|
||
# If the file is a web path or S3, download it to a temporary file, and use that | ||
# If the file is a web path or S3, download it to a temporary file, | ||
# and use that. It's better to use a BlobLoader. | ||
if not os.path.isfile(self.file_path) and self._is_valid_url(self.file_path): | ||
self.temp_dir = tempfile.TemporaryDirectory() | ||
_, suffix = os.path.splitext(self.file_path) | ||
|
@@ -412,51 +416,129 @@ def lazy_load(self) -> Iterator[Document]: | |
|
||
|
||
class PyMuPDFLoader(BasePDFLoader): | ||
"""Load `PDF` files using `PyMuPDF`.""" | ||
"""Load and parse a PDF file using 'PyMuPDF' library. | ||
|
||
This class provides methods to load and parse PDF documents, supporting various | ||
configurations such as handling password-protected files, extracting tables, | ||
extracting images, and defining extraction mode. It integrates the `PyMuPDF` | ||
library for PDF processing and offers both synchronous and asynchronous document | ||
loading. | ||
|
||
Examples: | ||
Setup: | ||
|
||
.. code-block:: bash | ||
|
||
pip install -U langchain-community pymupdf | ||
|
||
Instantiate the loader: | ||
|
||
.. code-block:: python | ||
|
||
from langchain_community.document_loaders import PyMuPDFLoader | ||
|
||
loader = PyMuPDFLoader( | ||
file_path = "./example_data/layout-parser-paper.pdf", | ||
# headers = None | ||
# password = None, | ||
mode = "single", | ||
pages_delimitor = "\n\f", | ||
# extract_images = True, | ||
# images_to_text = convert_images_to_text_with_tesseract(), | ||
# extract_tables = "markdown", | ||
# extract_tables_settings = None, | ||
) | ||
|
||
Lazy load documents: | ||
|
||
.. code-block:: python | ||
|
||
docs = [] | ||
docs_lazy = loader.lazy_load() | ||
|
||
for doc in docs_lazy: | ||
docs.append(doc) | ||
print(docs[0].page_content[:100]) | ||
print(docs[0].metadata) | ||
|
||
Load documents asynchronously: | ||
|
||
.. code-block:: python | ||
|
||
docs = await loader.aload() | ||
print(docs[0].page_content[:100]) | ||
print(docs[0].metadata) | ||
""" | ||
|
||
def __init__( | ||
self, | ||
file_path: Union[str, PurePath], | ||
*, | ||
headers: Optional[dict] = None, | ||
password: Optional[str] = None, | ||
mode: Literal["single", "page"] = "page", | ||
pages_delimitor: str = _default_page_delimitor, | ||
extract_images: bool = False, | ||
images_to_text: CONVERT_IMAGE_TO_TEXT = None, | ||
extract_tables: Union[Literal["csv", "markdown", "html"], None] = None, | ||
headers: Optional[dict] = None, | ||
extract_tables_settings: Optional[dict[str, Any]] = None, | ||
**kwargs: Any, | ||
) -> None: | ||
"""Initialize with a file path.""" | ||
try: | ||
import fitz # noqa:F401 | ||
except ImportError: | ||
raise ImportError( | ||
"`PyMuPDF` package not found, please install it with " | ||
"`pip install pymupdf`" | ||
) | ||
super().__init__(file_path, headers=headers) | ||
self.extract_images = extract_images | ||
self.text_kwargs = kwargs | ||
"""Initialize with a file path. | ||
|
||
def _lazy_load(self, **kwargs: Any) -> Iterator[Document]: | ||
if kwargs: | ||
logger.warning( | ||
f"Received runtime arguments {kwargs}. Passing runtime args to `load`" | ||
f" is deprecated. Please pass arguments during initialization instead." | ||
) | ||
Args: | ||
file_path: The path to the PDF file to be loaded. | ||
headers: Optional headers to use for GET request to download a file from a | ||
web path. | ||
password: Optional password for opening encrypted PDFs. | ||
mode: The extraction mode, either "single" for the entire document or "page" | ||
for page-wise extraction. | ||
pages_delimitor: A string delimiter to separate pages in single-mode | ||
extraction. | ||
extract_images: Whether to extract images from the PDF. | ||
images_to_text: Optional function or callable to convert images to text | ||
during extraction. | ||
extract_tables: Whether to extract tables in a specific format, such as | ||
"csv", "markdown", or "html". | ||
extract_tables_settings: Optional dictionary of settings for customizing | ||
table extraction. | ||
**kwargs: Additional keyword arguments for customizing text extraction | ||
behavior. | ||
|
||
Returns: | ||
This method does not directly return data. Use the `load`, `lazy_load`, or | ||
`aload` methods to retrieve parsed documents with content and metadata. | ||
|
||
text_kwargs = {**self.text_kwargs, **kwargs} | ||
parser = PyMuPDFParser( | ||
text_kwargs=text_kwargs, extract_images=self.extract_images | ||
Raises: | ||
ValueError: If the `mode` argument is not one of "single" or "page". | ||
""" | ||
if mode not in ["single", "page"]: | ||
raise ValueError("mode must be single or page") | ||
super().__init__(file_path, headers=headers) | ||
self.parser = PyMuPDFParser( | ||
password=password, | ||
mode=mode, | ||
pages_delimitor=pages_delimitor, | ||
text_kwargs=kwargs, | ||
extract_images=extract_images, | ||
images_to_text=images_to_text, | ||
extract_tables=extract_tables, | ||
extract_tables_settings=extract_tables_settings, | ||
) | ||
|
||
def lazy_load(self) -> Iterator[Document]: | ||
""" | ||
Lazy load given path as pages. | ||
Insert image, if possible, between two paragraphs. | ||
In this way, a paragraph can be continued on the next page. | ||
""" | ||
parser = self.parser | ||
pprados marked this conversation as resolved.
Show resolved
Hide resolved
|
||
if self.web_path: | ||
blob = Blob.from_data(open(self.file_path, "rb").read(), path=self.web_path) # type: ignore[attr-defined] | ||
else: | ||
blob = Blob.from_path(self.file_path) # type: ignore[attr-defined] | ||
yield from parser.lazy_parse(blob) | ||
|
||
def load(self, **kwargs: Any) -> list[Document]: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would love to make this change, but it's a breaking change due to
|
||
return list(self._lazy_load(**kwargs)) | ||
|
||
def lazy_load(self) -> Iterator[Document]: | ||
yield from self._lazy_load() | ||
|
||
|
||
# MathpixPDFLoader implementation taken largely from Daniel Gross's: | ||
# https://gist.github.com/danielgross/3ab4104e14faccc12b49200843adab21 | ||
|
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps remove this doc-string or improve it?
This doc-string is better at class level or at init level if semantics are controlled by parameterization in the initializer (e.g., mode)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The important message here is that “Insert image, if possible, between two paragraphs.” This is not always the case, and therefore cannot be indicated in BasePDFLoader. That's why I've added this information, specifically in this implementation. It will be found in all the others that work like this. But DocumentIntellignent, for example, doesn't respect this.