feat: add converter based on pdfminer #7607

medsriha · 2024-04-27T02:32:19Z

Related Issues

feat: Add converter based on pdfminer #6763

Proposed Changes:

The default PDF converter may not extract text correctly for PDFs with complex layouts, such as those containing multiple text columns. To address this issue, PDFMinerToDocument is being introduced to enable users to customize text extraction from PDF files through pdfminer native arguments. Users can then configure the object to retain the reading order, among other options.

How did you test it?

Tested using several unit tests

Notes for the reviewer

I have read the contributors guidelines and the code of conduct
I have updated the related issue with new insights and changes
I added unit tests and updated the docstrings
I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test:.
I documented my code
I ran pre-commit hooks and fixed any issue

coveralls · 2024-04-27T22:15:50Z

Pull Request Test Coverage Report for Build 8901937116

Details

0 of 0 changed or added relevant lines in 0 files are covered.
No unchanged relevant lines lost coverage.
Overall coverage increased (+0.07%) to 90.195%

Totals
Change from base Build 8896610003:	0.07%
Covered Lines:	6384
Relevant Lines:	7078

💛 - Coveralls

silvanocerza · 2024-04-29T14:40:03Z

Nice! PR looks great, there's some linting issues that need to be fixed though. I see pylint, mypy and black failing, should be easy fixes. You can run those quite easily locally with hatch run test:lint and hatch run test:types, some lint failures can be automatically fixed with hatch run lint-fix.

Also when adding new lazy imports remember to update the test dependencies with the necessary dependencies otherwise tests will always fail.

silvanocerza · 2024-04-29T15:30:00Z

Also I suggest rebasing or merging main in your branch to bring PR #7215 in as I recently changed which checks are required to merge.

silvanocerza · 2024-04-30T07:35:35Z

Didn't notice at all we were missing tests, this causes coverage to go down a bit. Could you add some? 👀

silvanocerza

Nice!

medsriha and others added 5 commits April 26, 2024 19:15

Initial commit pdfminer converter

24b4e45

Merge branch 'deepset-ai:main' into add_pdfminer

19b7f15

Revert back naming of argument all_text per pdfminer documentation

0f6e652

Add the component decorator

b6d731a

Add release notes

8246392

medsriha requested review from a team as code owners April 27, 2024 02:32

medsriha requested review from dfokina and silvanocerza and removed request for a team April 27, 2024 02:32

github-actions bot added 2.x Related to Haystack v2.0 type:documentation Improvements on the docs labels Apr 27, 2024

medsriha added type:feature New feature or request topic:file_converter and removed type:feature New feature or request labels Apr 27, 2024

Reformat code with black

7f2e099

Remove LTPage and comments

fb6a012

silvanocerza self-assigned this Apr 29, 2024

Update dependencies in pyproject.toml

1f89af6

github-actions bot added the topic:build/distribution label Apr 29, 2024

Merge branch 'main' into add_pdfminer

0c5c54a

medsriha and others added 3 commits April 30, 2024 16:23

Merge branch 'main' into add_pdfminer

3cc5102

Added some tests and incorporated reference doc in docstring

1be43d9

Added some tests and incorporated reference doc in docstring

2a6609b

github-actions bot added the topic:tests label Apr 30, 2024

silvanocerza approved these changes May 2, 2024

View reviewed changes

silvanocerza merged commit 2e35f13 into deepset-ai:main May 2, 2024
24 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add converter based on pdfminer #7607

feat: add converter based on pdfminer #7607

medsriha commented Apr 27, 2024 •

edited

Loading

coveralls commented Apr 27, 2024 •

edited

Loading

silvanocerza commented Apr 29, 2024

silvanocerza commented Apr 29, 2024

silvanocerza commented Apr 30, 2024

silvanocerza left a comment

feat: add converter based on pdfminer #7607

feat: add converter based on pdfminer #7607

Conversation

medsriha commented Apr 27, 2024 • edited Loading

Related Issues

Proposed Changes:

How did you test it?

Notes for the reviewer

coveralls commented Apr 27, 2024 • edited Loading

Pull Request Test Coverage Report for Build 8901937116

Details

💛 - Coveralls

silvanocerza commented Apr 29, 2024

silvanocerza commented Apr 29, 2024

silvanocerza commented Apr 30, 2024

silvanocerza left a comment

Choose a reason for hiding this comment

medsriha commented Apr 27, 2024 •

edited

Loading

coveralls commented Apr 27, 2024 •

edited

Loading