PyPDFDirectoryLoader always searches in subfolders, although recursive=False #24759

LarsKutz · 2024-07-29T00:11:02Z

LarsKutz
Jul 29, 2024

Checked other resources

I added a very descriptive title to this question.
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.

Commit to Help

I commit to help with one of those options 👆

Example Code

""" This Module contains functions to extract text from PDF files using LangChain's PyPDFLoader and PyPDFDirectoryLoader.
"""

from langchain_community.document_loaders import PyPDFLoader, PyPDFDirectoryLoader
from langchain_core.documents import Document
from typing import List


def get_pdf(path_to_pdf_file: str) -> List[Document]:
    """ Get text from a PDF file using LangChain's PyPDFLoader.
    
    Args:
        path_to_pdf_file (str): Path to the PDF file.
    
    Returns:
        List[Document]: List with LangChain Document object containing the text from the PDF file and other metadata.
        For Example: If a PDF file has 5 pages, then this function will return a list of 5 Document objects.
    """
    loader = PyPDFLoader(path_to_pdf_file)
    doc = loader.load()
    return doc


def get_all_pdfs(path_to_directory: str) -> List[Document]:
    """ Get text from all PDF files in all subdirectories using LangChain's PyPDFDirectoryLoader.
    
    Args:
        path_to_directory (str): Path to the directory containing the PDF files.
    
    Returns:
        List[Document]: List of LangChain Document objects, each Document contains one page of a PDF file.  
        For Example: If one PDF file has 5 pages and another PDF file has 3 pages, then this function will 
        return a list of 8 Document objects.
    """
    loader = PyPDFDirectoryLoader(path_to_directory)
    docs = loader.load()
    return docs


def get_unique_number_of_documents(documents: List[Document]) -> tuple:
    """ Get the Path and Number of unique documents from a list of LangChain Document objects.
    
    Args:
        documents (List[Document]): List of LangChain Document objects.
    
    Returns:
        tuple: Set of Paths, Number of unique documents.
    """
    unique_documents = set()
    for doc in documents:
        unique_documents.add(doc.metadata["source"])
    return unique_documents, len(unique_documents)


if __name__ == '__main__':
    entry = "main"
    docs = get_all_pdfs(entry)    
    unique_docs, num_unique_docs = get_unique_number_of_documents(docs)
    print(f"Number of unique documents: {num_unique_docs}")
    print(f"Entry path: {entry}", end="\n\n")
    print("Unique Documents:")
    for doc in unique_docs:
        print(doc)


```bash
Number of unique documents: 5
Entry path: main

Unique Documents:
main\level1\level2\level2_1.pdf
main\level1\level1_1.pdf
main\level1\level2\level3_2\level3_2_1.pdf
main\level1\level2\level3_2\level4_2\level4_2_1.pdf
main\level1\level2\level3_1\level3_1_1.pdf



### Description

Is it exactly intended that the PyPDFDirectoryLoader always searches all subfolders for pdf files, regardless of whether you set recursive=True or False?

[documentation from PyPDFDirectoryLoader](https://api.python.langchain.com/en/latest/_modules/langchain_community/document_loaders/pdf.html#PyPDFDirectoryLoader.load)

relevant part:  
```py
...
class PyPDFDirectoryLoader(BaseLoader):
...
def load(self) -> List[Document]:
     p = Path(self.path)
     docs = []
     items = p.rglob(self.glob) if self.recursive else p.glob(self.glob)
...

System Info

windows10
python3.11.1
langchain==0.2.11
langchain-community==0.2.10
langchain-core==0.2.24

Answered by dosubot[bot]

Jul 29, 2024

Hello, @LarsKutz! I'm here to assist with any bugs or questions you have while we wait for a human maintainer. I can help you solve issues, answer questions, or even become a contributor. Let me know how I can assist you today!

No, it is not intended that the PyPDFDirectoryLoader always searches all subfolders for PDF files regardless of the recursive parameter. The code differentiates between recursive and non-recursive searches based on the value of the recursive parameter:

items = p.rglob(self.glob) if self.recursive else p.glob(self.glob)

If recursive is set to True, it uses p.rglob(self.glob) to search recursively. If recursive is set to False, it uses p.glob(self.glob) to search no…

View full answer

@LarsKutz · 2024-07-29T01:09:57Z

dosubot[bot]
bot Jul 29, 2024

Hello, @LarsKutz! I'm here to assist with any bugs or questions you have while we wait for a human maintainer. I can help you solve issues, answer questions, or even become a contributor. Let me know how I can assist you today!

No, it is not intended that the PyPDFDirectoryLoader always searches all subfolders for PDF files regardless of the recursive parameter. The code differentiates between recursive and non-recursive searches based on the value of the recursive parameter:

items = p.rglob(self.glob) if self.recursive else p.glob(self.glob)

If recursive is set to True, it uses p.rglob(self.glob) to search recursively. If recursive is set to False, it uses p.glob(self.glob) to search non-recursively. This behavior is consistent with the DirectoryLoader class in the LangChain framework, which also uses the recursive parameter to control whether the search is recursive or not ^[1]^[2].

To continue talking to Dosu, mention @dosu.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PyPDFDirectoryLoader always searches in subfolders, although recursive=False #24759

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

PyPDFDirectoryLoader always searches in subfolders, although recursive=False #24759

LarsKutz Jul 29, 2024

Checked other resources

Commit to Help

Example Code

System Info

Replies: 1 comment

dosubot[bot] bot Jul 29, 2024

LarsKutz
Jul 29, 2024

dosubot[bot]
bot Jul 29, 2024