Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: Fix timeouts and retry limit exceeded in interim introduction example #2791

Merged
merged 2 commits into from
Jan 7, 2025

Conversation

webchick
Copy link
Contributor

@webchick webchick commented Jan 6, 2025

I'm not sure if this is the correct fix, but it does fix the problem, and it uses similar logic to the final script on this page, which is working properly.

(It also works if you simply comment out the await page.waitForSelector('.collection-block-item'); line, but I assume that's there for a reason.)

Old output:

INFO  PlaywrightCrawler: All requests from the queue have been processed, the crawler will shut down.
INFO  PlaywrightCrawler: Final request statistics: {"requestsFinished":1,"requestsFailed":31,"retryHistogram":[1,null,null,31],"requestAvgFailedDurationMillis":30359,"requestAvgFinishedDurationMillis":1056,"requestsFinishedPerMinute":0,"requestsFailedPerMinute":6,"requestTotalDurationMillis":942186,"requestsTotal":32,"crawlerRuntimeMillis":291681}
INFO  PlaywrightCrawler: Error analysis: {"totalErrors":31,"uniqueErrors":1,"mostCommonErrors":["31x: page.waitForSelector: Timeout 30000ms exceeded. (/Users/webchick/TechAround/fun-with-scraping/src/main.js:8:20)"]}
INFO  PlaywrightCrawler: Finished! Total 32 requests: 1 succeeded, 31 failed. {"terminal":true}

New output:

INFO  PlaywrightCrawler: All requests from the queue have been processed, the crawler will shut down.
INFO  PlaywrightCrawler: Final request statistics: {"requestsFinished":32,"requestsFailed":0,"retryHistogram":[32],"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":302,"requestsFinishedPerMinute":340,"requestsFailedPerMinute":0,"requestTotalDurationMillis":9677,"requestsTotal":32,"crawlerRuntimeMillis":5644}
INFO  PlaywrightCrawler: Finished! Total 32 requests: 32 succeeded, 0 failed. {"terminal":true}

Closes #2790

@webchick webchick changed the title #2790: Fix timeouts and retry limit exceeded in interim introduction example docs: Fix timeouts and retry limit exceeded in interim introduction example Jan 6, 2025
@janbuchar janbuchar requested a review from B4nan January 6, 2025 12:23
@barjin
Copy link
Contributor

barjin commented Jan 7, 2025

Thanks for the PR!

The code example is indeed incorrect and fails with timeouts. Regarding the changes - it seems fine to me, but it also looks like the job for Router. With the Router class, Crawlee provides an idiomatic way of splitting the crawler behavior based on the request labels. Given the current structure of the code (simple conditionals with predicates depending solely on the request labels), the switch should be quite painless.

Wdyt @B4nan - is there a reason against using Router in the examples? Perhaps beginner friendliness? Or are the examples so ancient we didn't have Routers back then? :)

@B4nan
Copy link
Member

B4nan commented Jan 7, 2025

The guide itself is indeed older than the router (older than crawlee actually). Note that we cover the router in a later section: https://crawlee.dev/docs/introduction/refactoring

Let's just fix the code snippet now and leave a bigger rewrite of this for later (together with @TC-MO).

Note that the changes you did should be also done at least in the latest stable docs snapshot. We also have an e2e test for the final code of the tutorial, but I guess that still works fine since we use the router there.

@webchick
Copy link
Contributor Author

webchick commented Jan 7, 2025

Ah, missed the versioned_docs folder. Thanks, I believe this is now fixed.

FWIW from an "end user" POV... I really enjoyed this tutorial a lot, and appreciated that it starts with simple code and gradually builds more functionality into it vs. introducing concepts like Router before I even know what I am doing yet. :D

OTOH, the resulting code ends up pretty clunky in this interim step, so it might indeed be best to introduce it sooner when you are in the process of updating these docs.

Thanks for your consideration!

@B4nan
Copy link
Member

B4nan commented Jan 7, 2025

Yeah, I would still describe things without router first, but we could introduce it right ahead, instead of having a separate refactoring section about it.

Copy link
Member

@B4nan B4nan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

@B4nan B4nan merged commit 1ea95f7 into apify:master Jan 7, 2025
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

"Crawling the listing pages" example results in numerous timeouts / max retries exceeded
3 participants