Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add converter based on pdfminer #7607

Merged
merged 12 commits into from
May 2, 2024
Merged

feat: add converter based on pdfminer #7607

merged 12 commits into from
May 2, 2024

Conversation

medsriha
Copy link
Member

@medsriha medsriha commented Apr 27, 2024

Related Issues

Proposed Changes:

The default PDF converter may not extract text correctly for PDFs with complex layouts, such as those containing multiple text columns. To address this issue, PDFMinerToDocument is being introduced to enable users to customize text extraction from PDF files through pdfminer native arguments. Users can then configure the object to retain the reading order, among other options.

How did you test it?

Tested using several unit tests

Notes for the reviewer

@medsriha medsriha requested review from a team as code owners April 27, 2024 02:32
@medsriha medsriha requested review from dfokina and silvanocerza and removed request for a team April 27, 2024 02:32
@github-actions github-actions bot added 2.x Related to Haystack v2.0 type:documentation Improvements on the docs labels Apr 27, 2024
@medsriha medsriha added type:feature New feature or request topic:file_converter and removed type:feature New feature or request labels Apr 27, 2024
@coveralls
Copy link
Collaborator

coveralls commented Apr 27, 2024

Pull Request Test Coverage Report for Build 8901937116

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.07%) to 90.195%

Totals Coverage Status
Change from base Build 8896610003: 0.07%
Covered Lines: 6384
Relevant Lines: 7078

💛 - Coveralls

@silvanocerza
Copy link
Contributor

Nice! PR looks great, there's some linting issues that need to be fixed though. I see pylint, mypy and black failing, should be easy fixes. You can run those quite easily locally with hatch run test:lint and hatch run test:types, some lint failures can be automatically fixed with hatch run lint-fix.

Also when adding new lazy imports remember to update the test dependencies with the necessary dependencies otherwise tests will always fail.

@silvanocerza silvanocerza self-assigned this Apr 29, 2024
@silvanocerza
Copy link
Contributor

Also I suggest rebasing or merging main in your branch to bring PR #7215 in as I recently changed which checks are required to merge.

@silvanocerza
Copy link
Contributor

Didn't notice at all we were missing tests, this causes coverage to go down a bit. Could you add some? 👀

Copy link
Contributor

@silvanocerza silvanocerza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@silvanocerza silvanocerza merged commit 2e35f13 into deepset-ai:main May 2, 2024
24 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants