Feature Request: Enable PDF parser to extract layouts, especially table from documents #19735
bricefotzo
started this conversation in
Ideas
Replies: 1 comment
-
I agree, the pdf loading has been really improved in v4 |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Checked
Feature request
Enhance the PyPDFParser and PyPDFLoader classes in the langchain-community library to support the
extraction_mode
parameter fromPageObject.extract_text()
in pypdf.The recent pypdf version
4.0.0
introduces different extraction modes including "plain" and "layout". This feature will allow langchain users to specify the extraction mode, which is particularly beneficial for processing PDFs* with complex structures like tables.The upgrade necessitates updating the pypdf dependency in langchain libraries from version
3.4.0
to at least4.0.0
.Motivation
As a langchain user, I often encounter the challenge of accurately extracting text from PDFs with complex layouts, particularly when working with our company's AI product that utilizes langchain. The current implementation defaults to a "plain" text extraction, which can mishandle layouts, especially those involving tables or multicolumn text. Enabling the selection of an extraction mode during the document loading process would significantly enhance the flexibility and robustness of the langchain toolchain, benefiting not just my own workflows but also those of many other users handling similar document types.
Proposal (If applicable)
The solution I propose involves refactoring the constructors of the PyPDFParser and PyPDFLoader classes to accept
extraction_mode
and additionalkwargs
. A new test,test_pypdf_loader_with_layout
, will be added along with an example text file to ensure the proper functionality of these enhancements. The implementation will be designed to be backward-compatible, maintaining the existing interface while extending its capabilities.Beta Was this translation helpful? Give feedback.
All reactions