Redirect URLs #845

jkumz · 2024-12-28T13:24:13Z

jkumz
Dec 28, 2024

Hi,

Is there a way to prevent crawling domains if a URL re-directs to a different URL?

For me right now, if it hits a URL that redirects to a different domain it proceeds to crawl that domain as well, even using a strategy of enqueueing links.

For example, www.somelink.com/github redirects to their github profile, which then leads to crawling every URL on page, which leads to endless crawling of GitHub.

My code:

async def main() -> None:
    crawler = PlaywrightCrawler()

    urls_found = []

    # Define a request handler and attach it to the crawler using the decorator.
    @crawler.router.default_handler
    async def request_handler(context: PlaywrightCrawlingContext) -> None:
        # See BeautifulSoup documentation for API docs.
        url = context.request.url
        context.log.info(f"On URL: {url}")
        urls_found.append(url)

        await context.enqueue_links(strategy=EnqueueStrategy.SAME_ORIGIN)

    await crawler.run(["some url here"])

    print(f"Found {len(urls_found)} URLs")

Thanks in advance

Answered by jkumz

Dec 28, 2024

I made a workaround via putting enqueue links in a conditional block so that it will only start crawling if the domain is the same.
Still redirects to URLs outside of original domain, but doesn't start crawling any URLs under different domains which is sufficient.

View full answer

jkumz · 2024-12-28T15:57:22Z

jkumz
Dec 28, 2024
Author

I made a workaround via putting enqueue links in a conditional block so that it will only start crawling if the domain is the same.
Still redirects to URLs outside of original domain, but doesn't start crawling any URLs under different domains which is sufficient.

0 replies

janbuchar · 2025-01-06T14:20:41Z

janbuchar
Jan 6, 2025
Maintainer

Hello @jkumz, I added a test that should reproduce the bug that you're reporting, but it is passing without any changes to the code - #873. Could you share the URL that triggers the behavior that you're observing?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redirect URLs #845

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Redirect URLs #845

jkumz Dec 28, 2024

Replies: 2 comments

jkumz Dec 28, 2024 Author

janbuchar Jan 6, 2025 Maintainer

jkumz
Dec 28, 2024

jkumz
Dec 28, 2024
Author

janbuchar
Jan 6, 2025
Maintainer