Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

community: Added ADOBE PDF EXTRACT #23686

Closed
wants to merge 21 commits into from

Conversation

DavidMoserAI
Copy link

@DavidMoserAI DavidMoserAI commented Jun 30, 2024

Description: Adobe PDF Extract is a service that provides superior performance over other document intelligence services, both in its accuracy and variety of features. Parsing documents based on their layout information is crucial for retrieval augmented generation, especially when trying to achieve production grade performance. I have used this service myself successfully and would like to contribute my code to the world.
Issue: Someone raised an issue about this a while ago: #8163
Dependencies: A user would have to install the adobe pdf services library like so: pip install pdfservices-sdk

Twitter handle: @DavidMoserAI

Copy link

vercel bot commented Jun 30, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
langchain 🛑 Canceled (Inspect) Dec 14, 2024 2:02am

@dosubot dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. community Related to langchain-community Ɑ: doc loader Related to document loader module (not documentation) 🤖:improvement Medium size change to existing code to handle new use-cases labels Jun 30, 2024
@DavidMoserAI
Copy link
Author

@hwchase17

@DavidMoserAI
Copy link
Author

@baskaryan @efriis @eyurtsev

Copy link

vercel bot commented Sep 2, 2024

Deployment failed with the following error:

The provided GitHub repository does not contain the requested branch or commit reference. Please ensure the repository is not empty.

class AdobePDFExtractParser(BaseBlobParser):
"""Loads a document using the Adobe PDF Services API.

Args:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could we expand the arg descriptions here

@davemaguire
Copy link

Hello! I would love to see the adobe pdf api added to langchain. What needs to be done to get this into main? Just address the following comment?

could we expand the arg descriptions here

I am happy to help get this over the line. We use adobe pdf extract extensively and would love to have this integrated in langchain.
@baskaryan @DavidMoserAI

@DavidMoserAI
Copy link
Author

@davemaguire I have updated the arg descriptions and am waiting for a response from the moderators.

@davemaguire
Copy link

Awesome, great work! I'm eager to see this feature in langchain
@efriis @eyurtsev @hwchase17

efriis
efriis previously approved these changes Dec 14, 2024
@dosubot dosubot bot added the lgtm PR looks good. Use to confirm that a PR is ready for merging. label Dec 14, 2024
@efriis
Copy link
Member

efriis commented Dec 14, 2024

@DavidMoserAI this PR had a lot of issues to fix (wrong sdk used in docs, tests for an old Loader instead of Parser implementation), so I'm hesitant to merge it without someone testing it. if you could take at the docs and also screenshot using it to parse an actual PDF using the service, that would be great! Otherwise will probably close without an actual test.

If you're interested in maintaining this integration without us in the loop and publishing a higher-quality integration, we'd love to get an integration package out! Future PRs against langchain would just be {docs updates, as well as registering your package in libs/packages.yml, deprecating this community integration in favor of your integration package}

Here's the guide, and if you have questions, feel free to leave them in the comments on those pages so others can see them! https://python.langchain.com/docs/contributing/how_to/integrations/

@efriis efriis dismissed their stale review December 14, 2024 01:49

ci failing still

@efriis efriis self-assigned this Dec 14, 2024
@efriis
Copy link
Member

efriis commented Dec 16, 2024

closing for now, and if you decide to pick it up again would recommend publishing externally!

@efriis efriis closed this Dec 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community Related to langchain-community Ɑ: doc loader Related to document loader module (not documentation) 🤖:improvement Medium size change to existing code to handle new use-cases lgtm PR looks good. Use to confirm that a PR is ready for merging. size:XL This PR changes 500-999 lines, ignoring generated files.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

5 participants