Using WebResearchRetriever for PDFs as well as HTML #8630
Replies: 1 comment
-
Answer generated by a 🤖 AnswerHi, Based on the context provided, it seems like you want to modify the Here's an example of how you might do this: from langchain.document_loaders import AsyncPdfLoader # Assuming this exists
from langchain.document_transformers import Pdf2TextTransformer # Assuming this exists
class WebResearchRetriever(BaseRetriever):
# ... existing code ...
def _get_relevant_documents(
self,
query: str,
*,
run_manager: CallbackManagerForRetrieverRun,
) -> List[Document]:
# ... existing code ...
# Load, split, and add new urls to vectorstore
if new_urls:
html_loader = AsyncHtmlLoader([url for url in new_urls if url.endswith('.html')])
pdf_loader = AsyncPdfLoader([url for url in new_urls if url.endswith('.pdf')])
html2text = Html2TextTransformer()
pdf2text = Pdf2TextTransformer()
logger.info("Indexing new urls...")
html_docs = html_loader.load()
pdf_docs = pdf_loader.load()
html_docs = list(html2text.transform_documents(html_docs))
pdf_docs = list(pdf2text.transform_documents(pdf_docs))
docs = html_docs + pdf_docs
docs = self.text_splitter.split_documents(docs)
self.vectorstore.add_documents(docs)
self.url_database.extend(new_urls)
# ... existing code ... In this example, we're assuming that Please note that this is a simplified example and might need adjustments based on the actual implementation of the If these classes ( I hope this helps! Let me know if you have any other questions. Sources
This response is meant to be useful, save you time, and share context. It is not meant to be a precise solution, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. |
Beta Was this translation helpful? Give feedback.
-
Hi I'm using WebResearchRetriever to do materials research, unfortunately the top results are often PDFs so I get the error:
langchain.document_loaders.async_html:Failed to decode content from xyz.pdf
I've seen there is a PDF Loader, but I'm not quite sure how to get it to work with my script.
Here is what I have so far:
Beta Was this translation helpful? Give feedback.
All reactions