Allow conditional URL loading in RecursiveUrlLoader api #27006

sofeikov · 2024-09-30T19:54:08Z

sofeikov
Sep 30, 2024

Checked

I searched existing ideas and did not find a similar one
I added a very descriptive title
I've clearly described the feature request and motivation for it

Feature request

What do people think of the idea of having conditional URL loading in langchain_community.document_loaders.RecursiveUrlLoader?

For example, before recursively loading a link in this code

        for link in sub_links:
            # Check all unvisited links
            if link not in visited:
                yield from self._get_child_links_recursive(
                    link, visited, depth=depth + 1
                )

one could have an additional method returning a boolean value signalling if a link should be scrapped. For example something like this

class RecursiveUrlLoader(BaseLoader):
    def check_url(self, url):
       return True

Motivation

This would increase the flexibility of this class. Think of the following scenario: I use this class in a large scrapping exercise and want a nice integration with the langchain ecosystem. So I could inherit the class and overload the method and do something like

class CustomRecursiveUrlLoader(BaseLoader):
    def check_url(self, url):
       return check_if_recently_scrapped(url)

which could reach out to a cache and see if a URL was recently scrapped in another process, or during an earlier run.

Proposal (If applicable)

The proposal is to implement a dummy method that always returns True by default. Further method overrides can be implemented by developers for thir custom cases.

So the current check would check that the link is both visited and allowed for scarpping, so the code above would look like

        for link in sub_links:
            # Check all unvisited links
            if link not in visited and self.scrap_url(url):
                yield from self._get_child_links_recursive(
                    link, visited, depth=depth + 1
                )

If this is of interest and seems useful to others, I can implement a PR for that.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow conditional URL loading in RecursiveUrlLoader api #27006

{{title}}

Replies: 0 comments

Select a reply

Allow conditional URL loading in RecursiveUrlLoader api #27006

sofeikov Sep 30, 2024

Checked

Feature request

Motivation

Proposal (If applicable)

Replies: 0 comments

sofeikov
Sep 30, 2024