-
Hi, Is there a way to prevent crawling domains if a URL re-directs to a different URL? For me right now, if it hits a URL that redirects to a different domain it proceeds to crawl that domain as well, even using a strategy of enqueueing links. For example, www.somelink.com/github redirects to their github profile, which then leads to crawling every URL on page, which leads to endless crawling of GitHub. My code: async def main() -> None:
crawler = PlaywrightCrawler()
urls_found = []
# Define a request handler and attach it to the crawler using the decorator.
@crawler.router.default_handler
async def request_handler(context: PlaywrightCrawlingContext) -> None:
# See BeautifulSoup documentation for API docs.
url = context.request.url
context.log.info(f"On URL: {url}")
urls_found.append(url)
await context.enqueue_links(strategy=EnqueueStrategy.SAME_ORIGIN)
await crawler.run(["some url here"])
print(f"Found {len(urls_found)} URLs") Thanks in advance |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
I made a workaround via putting enqueue links in a conditional block so that it will only start crawling if the domain is the same. |
Beta Was this translation helpful? Give feedback.
-
Hello @jkumz, I added a test that should reproduce the bug that you're reporting, but it is passing without any changes to the code - #873. Could you share the URL that triggers the behavior that you're observing? |
Beta Was this translation helpful? Give feedback.
I made a workaround via putting enqueue links in a conditional block so that it will only start crawling if the domain is the same.
Still redirects to URLs outside of original domain, but doesn't start crawling any URLs under different domains which is sufficient.