forked from langchain-ai/langchain
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add feature for extracting images from pdf and recognizing text from …
…images. (langchain-ai#10653) **Description** It is for langchain-ai#10423 that it will be a useful feature if we can extract images from pdf and recognize text on them. I have implemented it with `PyPDFLoader`, `PyPDFium2Loader`, `PyPDFDirectoryLoader`, `PyMuPDFLoader`, `PDFMinerLoader`, and `PDFPlumberLoader`. [RapidOCR](https://github.com/RapidAI/RapidOCR.git) is used to recognize text on extracted images. It is time-consuming for ocr so a boolen parameter `extract_images` is set to control whether to extract and recognize. I have tested the time usage for each parser on my own laptop thinkbook 14+ with AMD R7-6800H by unit test and the result is: | extract_images | PyPDFParser | PDFMinerParser | PyMuPDFParser | PyPDFium2Parser | PDFPlumberParser | | ------------- | ------------- | ------------- | ------------- | ------------- | ------------- | | False | 0.27s | 0.39s | 0.06s | 0.08s | 1.01s | | True | 17.01s | 20.67s | 20.32s | 19,75s | 20.55s | **Issue** langchain-ai#10423 **Dependencies** rapidocr_onnxruntime in [RapidOCR](https://github.com/RapidAI/RapidOCR/tree/main) --------- Co-authored-by: Bagatur <[email protected]>
- Loading branch information
1 parent
8e3fbc9
commit 35297ca
Showing
6 changed files
with
487 additions
and
30 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.