You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I searched existing ideas and did not find a similar one
I added a very descriptive title
I've clearly described the feature request and motivation for it
Feature request
What do people think of the idea of having conditional URL loading in langchain_community.document_loaders.RecursiveUrlLoader?
For example, before recursively loading a link in this code
for link in sub_links:
# Check all unvisited links
if link not in visited:
yield from self._get_child_links_recursive(
link, visited, depth=depth + 1
)
one could have an additional method returning a boolean value signalling if a link should be scrapped. For example something like this
class RecursiveUrlLoader(BaseLoader):
def check_url(self, url):
return True
Motivation
This would increase the flexibility of this class. Think of the following scenario: I use this class in a large scrapping exercise and want a nice integration with the langchain ecosystem. So I could inherit the class and overload the method and do something like
class CustomRecursiveUrlLoader(BaseLoader):
def check_url(self, url):
return check_if_recently_scrapped(url)
which could reach out to a cache and see if a URL was recently scrapped in another process, or during an earlier run.
Proposal (If applicable)
The proposal is to implement a dummy method that always returns True by default. Further method overrides can be implemented by developers for thir custom cases.
So the current check would check that the link is both visited and allowed for scarpping, so the code above would look like
for link in sub_links:
# Check all unvisited links
if link not in visited and self.scrap_url(url):
yield from self._get_child_links_recursive(
link, visited, depth=depth + 1
)
If this is of interest and seems useful to others, I can implement a PR for that.
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Checked
Feature request
What do people think of the idea of having conditional URL loading in
langchain_community.document_loaders.RecursiveUrlLoader
?For example, before recursively loading a link in this code
one could have an additional method returning a boolean value signalling if a link should be scrapped. For example something like this
Motivation
This would increase the flexibility of this class. Think of the following scenario: I use this class in a large scrapping exercise and want a nice integration with the langchain ecosystem. So I could inherit the class and overload the method and do something like
which could reach out to a cache and see if a URL was recently scrapped in another process, or during an earlier run.
Proposal (If applicable)
The proposal is to implement a dummy method that always returns True by default. Further method overrides can be implemented by developers for thir custom cases.
So the current check would check that the link is both visited and allowed for scarpping, so the code above would look like
If this is of interest and seems useful to others, I can implement a PR for that.
Beta Was this translation helpful? Give feedback.
All reactions