Skip to content

Commit

Permalink
Add feature for extracting images from pdf and recognizing text from …
Browse files Browse the repository at this point in the history
…images. (langchain-ai#10653)

**Description**

It is for langchain-ai#10423 that it will be a useful feature if we can extract
images from pdf and recognize text on them. I have implemented it with
`PyPDFLoader`, `PyPDFium2Loader`, `PyPDFDirectoryLoader`,
`PyMuPDFLoader`, `PDFMinerLoader`, and `PDFPlumberLoader`.
[RapidOCR](https://github.com/RapidAI/RapidOCR.git) is used to recognize
text on extracted images. It is time-consuming for ocr so a boolen
parameter `extract_images` is set to control whether to extract and
recognize. I have tested the time usage for each parser on my own laptop
thinkbook 14+ with AMD R7-6800H by unit test and the result is:

| extract_images | PyPDFParser | PDFMinerParser | PyMuPDFParser |
PyPDFium2Parser | PDFPlumberParser |
| ------------- | ------------- | ------------- | ------------- |
------------- | ------------- |
| False | 0.27s | 0.39s | 0.06s | 0.08s | 1.01s |
| True  | 17.01s  | 20.67s | 20.32s | 19,75s | 20.55s |

**Issue**

langchain-ai#10423 

**Dependencies**

rapidocr_onnxruntime in
[RapidOCR](https://github.com/RapidAI/RapidOCR/tree/main)

---------

Co-authored-by: Bagatur <[email protected]>
  • Loading branch information
therontau0054 and baskaryan authored Oct 6, 2023
1 parent 8e3fbc9 commit 35297ca
Show file tree
Hide file tree
Showing 6 changed files with 487 additions and 30 deletions.
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Using PyPDF
## Using PyPDF

Load PDF using `pypdf` into array of documents, where each document contains the page content and metadata with `page` number.

Expand Down Expand Up @@ -74,6 +74,30 @@ for doc in docs:

</CodeOutputBlock>


### Extracting images

Using the `rapidocr-onnxruntime` package we can extract images as text as well:

```bash
pip install rapidocr-onnxruntime
```

```python
loader = PyPDFLoader("https://arxiv.org/pdf/2103.15348.pdf", extract_images=True)
pages = loader.load()
pages[4].page_content
```

<CodeOutputBlock lang="python">

```
'LayoutParser : A Unified Toolkit for DL-Based DIA 5\nTable 1: Current layout detection models in the LayoutParser model zoo\nDataset Base Model1Large Model Notes\nPubLayNet [38] F / M M Layouts of modern scientific documents\nPRImA [3] M - Layouts of scanned modern magazines and scientific reports\nNewspaper [17] F - Layouts of scanned US newspapers from the 20th century\nTableBank [18] F F Table region on modern scientific and business document\nHJDataset [31] F / M - Layouts of history Japanese documents\n1For each dataset, we train several models of different sizes for different needs (the trade-off between accuracy\nvs. computational cost). For “base model” and “large model”, we refer to using the ResNet 50 or ResNet 101\nbackbones [ 13], respectively. One can train models of different architectures, like Faster R-CNN [ 28] (F) and Mask\nR-CNN [ 12] (M). For example, an F in the Large Model column indicates it has a Faster R-CNN model trained\nusing the ResNet 101 backbone. The platform is maintained and a number of additions will be made to the model\nzoo in coming months.\nlayout data structures , which are optimized for efficiency and versatility. 3) When\nnecessary, users can employ existing or customized OCR models via the unified\nAPI provided in the OCR module . 4)LayoutParser comes with a set of utility\nfunctions for the visualization and storage of the layout data. 5) LayoutParser\nis also highly customizable, via its integration with functions for layout data\nannotation and model training . We now provide detailed descriptions for each\ncomponent.\n3.1 Layout Detection Models\nInLayoutParser , a layout model takes a document image as an input and\ngenerates a list of rectangular boxes for the target content regions. Different\nfrom traditional methods, it relies on deep convolutional neural networks rather\nthan manually curated rules to identify content regions. It is formulated as an\nobject detection problem and state-of-the-art models like Faster R-CNN [ 28] and\nMask R-CNN [ 12] are used. This yields prediction results of high accuracy and\nmakes it possible to build a concise, generalized interface for layout detection.\nLayoutParser , built upon Detectron2 [ 35], provides a minimal API that can\nperform layout detection with only four lines of code in Python:\n1import layoutparser as lp\n2image = cv2. imread (" image_file ") # load images\n3model = lp. Detectron2LayoutModel (\n4 "lp :// PubLayNet / faster_rcnn_R_50_FPN_3x / config ")\n5layout = model . detect ( image )\nLayoutParser provides a wealth of pre-trained model weights using various\ndatasets covering different languages, time periods, and document types. Due to\ndomain shift [ 7], the prediction performance can notably drop when models are ap-\nplied to target samples that are significantly different from the training dataset. As\ndocument structures and layouts vary greatly in different domains, it is important\nto select models trained on a dataset similar to the test samples. A semantic syntax\nis used for initializing the model weights in LayoutParser , using both the dataset\nname and model name lp://<dataset-name>/<model-architecture-name> .'
```

</CodeOutputBlock>


## Using MathPix

Inspired by Daniel Gross's [https://gist.github.com/danielgross/3ab4104e14faccc12b49200843adab21](https://gist.github.com/danielgross/3ab4104e14faccc12b49200843adab21)
Expand Down
233 changes: 219 additions & 14 deletions libs/langchain/langchain/document_loaders/parsers/pdf.py
Original file line number Diff line number Diff line change
@@ -1,22 +1,79 @@
"""Module contains common parsers for PDFs."""
from __future__ import annotations

from typing import TYPE_CHECKING, Any, Iterator, Mapping, Optional, Sequence, Union
import warnings
from typing import (
TYPE_CHECKING,
Any,
Iterable,
Iterator,
Mapping,
Optional,
Sequence,
Union,
)
from urllib.parse import urlparse

import numpy as np

from langchain.document_loaders.base import BaseBlobParser
from langchain.document_loaders.blob_loaders import Blob
from langchain.schema import Document

if TYPE_CHECKING:
import fitz.fitz
import pdfminer.layout
import pdfplumber.page
import pypdf._page
import pypdfium2._helpers.page


_PDF_FILTER_WITH_LOSS = ["DCTDecode", "DCT", "JPXDecode"]
_PDF_FILTER_WITHOUT_LOSS = [
"LZWDecode",
"LZW",
"FlateDecode",
"Fl",
"ASCII85Decode",
"A85",
"ASCIIHexDecode",
"AHx",
"RunLengthDecode",
"RL",
"CCITTFaxDecode",
"CCF",
"JBIG2Decode",
]


def extract_from_images_with_rapidocr(
images: Sequence[Union[Iterable[np.ndarray], bytes]]
) -> str:
try:
from rapidocr_onnxruntime import RapidOCR
except ImportError:
raise ImportError(
"`rapidocr-onnxruntime` package not found, please install it with "
"`pip install rapidocr-onnxruntime`"
)
ocr = RapidOCR()
text = ""
for img in images:
result, _ = ocr(img)
if result:
result = [text[1] for text in result]
text += "\n".join(result)
return text


class PyPDFParser(BaseBlobParser):
"""Load `PDF` using `pypdf`"""

def __init__(self, password: Optional[Union[str, bytes]] = None):
def __init__(
self, password: Optional[Union[str, bytes]] = None, extract_images: bool = False
):
self.password = password
self.extract_images = extract_images

def lazy_parse(self, blob: Blob) -> Iterator[Document]:
"""Lazily parse the blob."""
Expand All @@ -26,36 +83,123 @@ def lazy_parse(self, blob: Blob) -> Iterator[Document]:
pdf_reader = pypdf.PdfReader(pdf_file_obj, password=self.password)
yield from [
Document(
page_content=page.extract_text(),
page_content=page.extract_text()
+ self._extract_images_from_page(page),
metadata={"source": blob.source, "page": page_number},
)
for page_number, page in enumerate(pdf_reader.pages)
]

def _extract_images_from_page(self, page: pypdf._page.PageObject) -> str:
"""Extract images from page and get the text with RapidOCR."""
if not self.extract_images or "/XObject" not in page["/Resources"].keys():
return ""

xObject = page["/Resources"]["/XObject"].get_object()
images = []
for obj in xObject:
if xObject[obj]["/Subtype"] == "/Image":
if xObject[obj]["/Filter"][1:] in _PDF_FILTER_WITHOUT_LOSS:
height, width = xObject[obj]["/Height"], xObject[obj]["/Width"]

images.append(
np.frombuffer(xObject[obj].get_data(), dtype=np.uint8).reshape(
height, width, -1
)
)
elif xObject[obj]["/Filter"][1:] in _PDF_FILTER_WITH_LOSS:
images.append(xObject[obj].get_data())
else:
warnings.warn("Unknown PDF Filter!")
return extract_from_images_with_rapidocr(images)


class PDFMinerParser(BaseBlobParser):
"""Parse `PDF` using `PDFMiner`."""

def __init__(self, extract_images: bool = False):
self.extract_images = extract_images

def lazy_parse(self, blob: Blob) -> Iterator[Document]:
"""Lazily parse the blob."""
from pdfminer.high_level import extract_text
if not self.extract_images:
from pdfminer.high_level import extract_text

with blob.as_bytes_io() as pdf_file_obj:
text = extract_text(pdf_file_obj)
metadata = {"source": blob.source}
yield Document(page_content=text, metadata=metadata)
with blob.as_bytes_io() as pdf_file_obj:
text = extract_text(pdf_file_obj)
metadata = {"source": blob.source}
yield Document(page_content=text, metadata=metadata)
else:
import io

from pdfminer.converter import PDFPageAggregator, TextConverter
from pdfminer.layout import LAParams
from pdfminer.pdfinterp import PDFPageInterpreter, PDFResourceManager
from pdfminer.pdfpage import PDFPage

text_io = io.StringIO()
with blob.as_bytes_io() as pdf_file_obj:
pages = PDFPage.get_pages(pdf_file_obj)
rsrcmgr = PDFResourceManager()
device_for_text = TextConverter(rsrcmgr, text_io, laparams=LAParams())
device_for_image = PDFPageAggregator(rsrcmgr, laparams=LAParams())
interpreter_for_text = PDFPageInterpreter(rsrcmgr, device_for_text)
interpreter_for_image = PDFPageInterpreter(rsrcmgr, device_for_image)
for i, page in enumerate(pages):
interpreter_for_text.process_page(page)
interpreter_for_image.process_page(page)
content = text_io.getvalue() + self._extract_images_from_page(
device_for_image.get_result()
)
text_io.truncate(0)
text_io.seek(0)
metadata = {"source": blob.source, "page": str(i)}
yield Document(page_content=content, metadata=metadata)

def _extract_images_from_page(self, page: pdfminer.layout.LTPage) -> str:
"""Extract images from page and get the text with RapidOCR."""
import pdfminer

def get_image(layout_object: Any) -> Any:
if isinstance(layout_object, pdfminer.layout.LTImage):
return layout_object
if isinstance(layout_object, pdfminer.layout.LTContainer):
for child in layout_object:
return get_image(child)
else:
return None

images = []

for img in list(filter(bool, map(get_image, page))):
if img.stream["Filter"].name in _PDF_FILTER_WITHOUT_LOSS:
images.append(
np.frombuffer(img.stream.get_data(), dtype=np.uint8).reshape(
img.stream["Height"], img.stream["Width"], -1
)
)
elif img.stream["Filter"].name in _PDF_FILTER_WITH_LOSS:
images.append(img.stream.get_data())
else:
warnings.warn("Unknown PDF Filter!")
return extract_from_images_with_rapidocr(images)


class PyMuPDFParser(BaseBlobParser):
"""Parse `PDF` using `PyMuPDF`."""

def __init__(self, text_kwargs: Optional[Mapping[str, Any]] = None) -> None:
def __init__(
self,
text_kwargs: Optional[Mapping[str, Any]] = None,
extract_images: bool = False,
) -> None:
"""Initialize the parser.
Args:
text_kwargs: Keyword arguments to pass to ``fitz.Page.get_text()``.
"""
self.text_kwargs = text_kwargs or {}
self.extract_images = extract_images

def lazy_parse(self, blob: Blob) -> Iterator[Document]:
"""Lazily parse the blob."""
Expand All @@ -66,7 +210,8 @@ def lazy_parse(self, blob: Blob) -> Iterator[Document]:

yield from [
Document(
page_content=page.get_text(**self.text_kwargs),
page_content=page.get_text(**self.text_kwargs)
+ self._extract_images_from_page(doc, page),
metadata=dict(
{
"source": blob.source,
Expand All @@ -84,11 +229,31 @@ def lazy_parse(self, blob: Blob) -> Iterator[Document]:
for page in doc
]

def _extract_images_from_page(
self, doc: fitz.fitz.Document, page: fitz.fitz.Page
) -> str:
"""Extract images from page and get the text with RapidOCR."""
if not self.extract_images:
return ""
import fitz

img_list = page.get_images()
imgs = []
for img in img_list:
xref = img[0]
pix = fitz.Pixmap(doc, xref)
imgs.append(
np.frombuffer(pix.samples, dtype=np.uint8).reshape(
pix.height, pix.width, -1
)
)
return extract_from_images_with_rapidocr(imgs)


class PyPDFium2Parser(BaseBlobParser):
"""Parse `PDF` with `PyPDFium2`."""

def __init__(self) -> None:
def __init__(self, extract_images: bool = False) -> None:
"""Initialize the parser."""
try:
import pypdfium2 # noqa:F401
Expand All @@ -97,6 +262,7 @@ def __init__(self) -> None:
"pypdfium2 package not found, please install it with"
" `pip install pypdfium2`"
)
self.extract_images = extract_images

def lazy_parse(self, blob: Blob) -> Iterator[Document]:
"""Lazily parse the blob."""
Expand All @@ -111,18 +277,34 @@ def lazy_parse(self, blob: Blob) -> Iterator[Document]:
text_page = page.get_textpage()
content = text_page.get_text_range()
text_page.close()
content += "\n" + self._extract_images_from_page(page)
page.close()
metadata = {"source": blob.source, "page": page_number}
yield Document(page_content=content, metadata=metadata)
finally:
pdf_reader.close()

def _extract_images_from_page(self, page: pypdfium2._helpers.page.PdfPage) -> str:
"""Extract images from page and get the text with RapidOCR."""
if not self.extract_images:
return ""

import pypdfium2.raw as pdfium_c

images = list(page.get_objects(filter=(pdfium_c.FPDF_PAGEOBJ_IMAGE,)))

images = list(map(lambda x: x.get_bitmap().to_numpy(), images))
return extract_from_images_with_rapidocr(images)


class PDFPlumberParser(BaseBlobParser):
"""Parse `PDF` with `PDFPlumber`."""

def __init__(
self, text_kwargs: Optional[Mapping[str, Any]] = None, dedupe: bool = False
self,
text_kwargs: Optional[Mapping[str, Any]] = None,
dedupe: bool = False,
extract_images: bool = False,
) -> None:
"""Initialize the parser.
Expand All @@ -132,6 +314,7 @@ def __init__(
"""
self.text_kwargs = text_kwargs or {}
self.dedupe = dedupe
self.extract_images = extract_images

def lazy_parse(self, blob: Blob) -> Iterator[Document]:
"""Lazily parse the blob."""
Expand All @@ -142,12 +325,14 @@ def lazy_parse(self, blob: Blob) -> Iterator[Document]:

yield from [
Document(
page_content=self._process_page_content(page),
page_content=self._process_page_content(page)
+ "\n"
+ self._extract_images_from_page(page),
metadata=dict(
{
"source": blob.source,
"file_path": blob.source,
"page": page.page_number,
"page": page.page_number - 1,
"total_pages": len(doc.pages),
},
**{
Expand All @@ -166,6 +351,26 @@ def _process_page_content(self, page: pdfplumber.page.Page) -> str:
return page.dedupe_chars().extract_text(**self.text_kwargs)
return page.extract_text(**self.text_kwargs)

def _extract_images_from_page(self, page: pdfplumber.page.Page) -> str:
"""Extract images from page and get the text with RapidOCR."""
if not self.extract_images:
return ""

images = []
for img in page.images:
if img["stream"]["Filter"].name in _PDF_FILTER_WITHOUT_LOSS:
images.append(
np.frombuffer(img["stream"].get_data(), dtype=np.uint8).reshape(
img["stream"]["Height"], img["stream"]["Width"], -1
)
)
elif img["stream"]["Filter"].name in _PDF_FILTER_WITH_LOSS:
images.append(img["stream"].get_data())
else:
warnings.warn("Unknown PDF Filter!")

return extract_from_images_with_rapidocr(images)


class AmazonTextractPDFParser(BaseBlobParser):
"""Send `PDF` files to `Amazon Textract` and parse them.
Expand Down
Loading

0 comments on commit 35297ca

Please sign in to comment.