-
Checked other resources
Commit to Help
Example Code""" This Module contains functions to extract text from PDF files using LangChain's PyPDFLoader and PyPDFDirectoryLoader.
"""
from langchain_community.document_loaders import PyPDFLoader, PyPDFDirectoryLoader
from langchain_core.documents import Document
from typing import List
def get_pdf(path_to_pdf_file: str) -> List[Document]:
""" Get text from a PDF file using LangChain's PyPDFLoader.
Args:
path_to_pdf_file (str): Path to the PDF file.
Returns:
List[Document]: List with LangChain Document object containing the text from the PDF file and other metadata.
For Example: If a PDF file has 5 pages, then this function will return a list of 5 Document objects.
"""
loader = PyPDFLoader(path_to_pdf_file)
doc = loader.load()
return doc
def get_all_pdfs(path_to_directory: str) -> List[Document]:
""" Get text from all PDF files in all subdirectories using LangChain's PyPDFDirectoryLoader.
Args:
path_to_directory (str): Path to the directory containing the PDF files.
Returns:
List[Document]: List of LangChain Document objects, each Document contains one page of a PDF file.
For Example: If one PDF file has 5 pages and another PDF file has 3 pages, then this function will
return a list of 8 Document objects.
"""
loader = PyPDFDirectoryLoader(path_to_directory)
docs = loader.load()
return docs
def get_unique_number_of_documents(documents: List[Document]) -> tuple:
""" Get the Path and Number of unique documents from a list of LangChain Document objects.
Args:
documents (List[Document]): List of LangChain Document objects.
Returns:
tuple: Set of Paths, Number of unique documents.
"""
unique_documents = set()
for doc in documents:
unique_documents.add(doc.metadata["source"])
return unique_documents, len(unique_documents)
if __name__ == '__main__':
entry = "main"
docs = get_all_pdfs(entry)
unique_docs, num_unique_docs = get_unique_number_of_documents(docs)
print(f"Number of unique documents: {num_unique_docs}")
print(f"Entry path: {entry}", end="\n\n")
print("Unique Documents:")
for doc in unique_docs:
print(doc)
```bash
Number of unique documents: 5
Entry path: main
Unique Documents:
main\level1\level2\level2_1.pdf
main\level1\level1_1.pdf
main\level1\level2\level3_2\level3_2_1.pdf
main\level1\level2\level3_2\level4_2\level4_2_1.pdf
main\level1\level2\level3_1\level3_1_1.pdf
System Infowindows10 |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Hello, @LarsKutz! I'm here to assist with any bugs or questions you have while we wait for a human maintainer. I can help you solve issues, answer questions, or even become a contributor. Let me know how I can assist you today! No, it is not intended that the items = p.rglob(self.glob) if self.recursive else p.glob(self.glob) If |
Beta Was this translation helpful? Give feedback.
Hello, @LarsKutz! I'm here to assist with any bugs or questions you have while we wait for a human maintainer. I can help you solve issues, answer questions, or even become a contributor. Let me know how I can assist you today!
No, it is not intended that the
PyPDFDirectoryLoader
always searches all subfolders for PDF files regardless of therecursive
parameter. The code differentiates between recursive and non-recursive searches based on the value of therecursive
parameter:If
recursive
is set toTrue
, it usesp.rglob(self.glob)
to search recursively. Ifrecursive
is set toFalse
, it usesp.glob(self.glob)
to search no…