Replies: 1 comment
-
Answer generated by a 🤖 AnswerI understand that you're having trouble with the OnlinePDFLoader in LangChain. Specifically, it seems to be able to read some online PDF files but not others. You've provided two links as examples, and while the first one works as expected, the second one returns an empty string. The issue you're encountering could be due to the structure or encoding of the specific PDF file that's causing trouble. LangChain's OnlinePDFLoader uses the UnstructuredPDFLoader to load PDF files, which in turn uses the Here's a look at the relevant code from the OnlinePDFLoader: class OnlinePDFLoader(BasePDFLoader):
"""Loader that loads online PDFs."""
def load(self) -> List[Document]:
"""Load documents."""
loader = UnstructuredPDFLoader(str(self.file_path))
return loader.load() And from the UnstructuredPDFLoader: class UnstructuredPDFLoader(UnstructuredFileLoader):
"""Loader that uses unstructured to load PDF files."""
def _get_elements(self) -> List:
from unstructured.partition.pdf import partition_pdf
return partition_pdf(filename=self.file_path, **self.unstructured_kwargs) It's also possible that the specific version of the PDF file might not be supported by the PDF parsing library used by LangChain, or there might be an issue with the encoding of the PDF file. This is just one potential explanation. Without more information or the ability to test the specific PDF files, it's difficult to provide a more specific answer. If you believe this is a bug that could impact other users, you're encouraged to make a pull request with a proposed fix. If my explanation doesn't align with your experience, it might be worth checking if you're using the latest version of LangChain. An update might resolve the issue you're facing. SourcesThis response is meant to be useful, save you time, and share context. It is not meant to be a precise solution, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. |
Beta Was this translation helpful? Give feedback.
-
Here if
two online pdf download link
https://raw.githubusercontent.com/aza1200/langchain_pdf_practice/main/%EA%B5%AC%EA%B8%80_%EC%95%8C%EB%A0%89%EC%8A%A4%EB%84%B7.pdf
I use online pdf loader for reading pdf files and here is the online pdf document link : https://python.langchain.com/docs/modules/data_connection/document_loaders/how_to/pdf
When I use OnlinePdf Loader for reading pdf file
The first online link can be read by this code.
But second online link cannot be read
the result was empty string . What's the difference between two link?
And why the onlinePDFloader cannot read all the online pdf link?
Beta Was this translation helpful? Give feedback.
All reactions