Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

community: add Docling document loader #27987

Closed

Conversation

vagenas
Copy link

@vagenas vagenas commented Nov 8, 2024

Description

This adds a document loader for Docling doc parsing package from IBM that parses PDF, DOCX, PPTX, HTML, and other formats into a rich unified representation including document layout, tables etc., making them ready for generative AI workflows like RAG.

Some references:

The introduced DoclingLoader enables users to:

  • use various document types in their LLM applications with ease and speed, and
  • leverage Docling's rich representation for advanced, document-native grounding.

Issue

No issue, but discussion initiated couple weeks ago: #27641

Dependencies

docling

Copy link

vercel bot commented Nov 8, 2024

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
langchain ✅ Ready (Inspect) Visit Preview 💬 Add feedback Nov 13, 2024 3:15pm

@vagenas
Copy link
Author

vagenas commented Nov 8, 2024

Most CI findings fixed — just two remaining points, for which some feedback from the maintainers would be appreciated:

  • Check make extended_tests #3.9: How should this best be handled given that Docling does not support Python 3.9?
  • Check extended_tests #3.11: This fails on a transitive dependency conflict between docling and another another package from extended_testing_deps.txt. Do you have some suggestion as to how to address this (besides attempting relaxing the constraints for that transitive dependency)?

Signed-off-by: Panos Vagenas <[email protected]>
Signed-off-by: Panos Vagenas <[email protected]>
@vagenas vagenas force-pushed the add-docling-document-loader branch from 7b8b0e3 to 10ad0c3 Compare November 13, 2024 15:06
@vagenas
Copy link
Author

vagenas commented Nov 13, 2024

Hey @ccurme, can you have a look at this Docling PR and my two questions above?

FYI I have updated the description with some further references for more context 😉

Happy to help in case of any questions!

@efriis
Copy link
Member

efriis commented Nov 20, 2024

Hey! This adds a community integration, which is no longer recommended. Would you be interested in publishing your own integration package, and contributing docs via this guide? https://python.langchain.com/docs/contributing/how_to/integrations/

@vagenas
Copy link
Author

vagenas commented Nov 22, 2024

Hey! This adds a community integration, which is no longer recommended. Would you be interested in publishing your own integration package, and contributing docs via this guide? https://python.langchain.com/docs/contributing/how_to/integrations/

Hi @efriis, I see, then we can do it like that, yes.
Will come back with a docs PR once our package is ready! 🔜

@vagenas vagenas closed this Nov 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community Related to langchain-community size:XL This PR changes 500-999 lines, ignoring generated files.
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

2 participants