docs: Fix timeouts and retry limit exceeded in interim introduction example #2791

webchick · 2025-01-06T09:30:27Z

I'm not sure if this is the correct fix, but it does fix the problem, and it uses similar logic to the final script on this page, which is working properly.

(It also works if you simply comment out the await page.waitForSelector('.collection-block-item'); line, but I assume that's there for a reason.)

Old output:

INFO  PlaywrightCrawler: All requests from the queue have been processed, the crawler will shut down.
INFO  PlaywrightCrawler: Final request statistics: {"requestsFinished":1,"requestsFailed":31,"retryHistogram":[1,null,null,31],"requestAvgFailedDurationMillis":30359,"requestAvgFinishedDurationMillis":1056,"requestsFinishedPerMinute":0,"requestsFailedPerMinute":6,"requestTotalDurationMillis":942186,"requestsTotal":32,"crawlerRuntimeMillis":291681}
INFO  PlaywrightCrawler: Error analysis: {"totalErrors":31,"uniqueErrors":1,"mostCommonErrors":["31x: page.waitForSelector: Timeout 30000ms exceeded. (/Users/webchick/TechAround/fun-with-scraping/src/main.js:8:20)"]}
INFO  PlaywrightCrawler: Finished! Total 32 requests: 1 succeeded, 31 failed. {"terminal":true}

New output:

INFO  PlaywrightCrawler: All requests from the queue have been processed, the crawler will shut down.
INFO  PlaywrightCrawler: Final request statistics: {"requestsFinished":32,"requestsFailed":0,"retryHistogram":[32],"requestAvgFailedDurationMillis":null,"requestAvgFinishedDurationMillis":302,"requestsFinishedPerMinute":340,"requestsFailedPerMinute":0,"requestTotalDurationMillis":9677,"requestsTotal":32,"crawlerRuntimeMillis":5644}
INFO  PlaywrightCrawler: Finished! Total 32 requests: 32 succeeded, 0 failed. {"terminal":true}

Closes #2790

barjin · 2025-01-07T09:27:03Z

Thanks for the PR!

The code example is indeed incorrect and fails with timeouts. Regarding the changes - it seems fine to me, but it also looks like the job for Router. With the Router class, Crawlee provides an idiomatic way of splitting the crawler behavior based on the request labels. Given the current structure of the code (simple conditionals with predicates depending solely on the request labels), the switch should be quite painless.

Wdyt @B4nan - is there a reason against using Router in the examples? Perhaps beginner friendliness? Or are the examples so ancient we didn't have Routers back then? :)

B4nan · 2025-01-07T09:31:49Z

The guide itself is indeed older than the router (older than crawlee actually). Note that we cover the router in a later section: https://crawlee.dev/docs/introduction/refactoring

Let's just fix the code snippet now and leave a bigger rewrite of this for later (together with @TC-MO).

Note that the changes you did should be also done at least in the latest stable docs snapshot. We also have an e2e test for the final code of the tutorial, but I guess that still works fine since we use the router there.

webchick · 2025-01-07T13:11:35Z

Ah, missed the versioned_docs folder. Thanks, I believe this is now fixed.

FWIW from an "end user" POV... I really enjoyed this tutorial a lot, and appreciated that it starts with simple code and gradually builds more functionality into it vs. introducing concepts like Router before I even know what I am doing yet. :D

OTOH, the resulting code ends up pretty clunky in this interim step, so it might indeed be best to introduce it sooner when you are in the process of updating these docs.

Thanks for your consideration!

B4nan · 2025-01-07T13:33:13Z

Yeah, I would still describe things without router first, but we could introduce it right ahead, instead of having a separate refactoring section about it.

B4nan

thanks!

Wrap await logic in a conditional so it only runs on main listing page

11944b9

webchick changed the title ~~#2790: Fix timeouts and retry limit exceeded in interim introduction example~~ docs: Fix timeouts and retry limit exceeded in interim introduction example Jan 6, 2025

janbuchar requested a review from B4nan January 6, 2025 12:23

Apply same changes to latest version of docs

b69b599

B4nan approved these changes Jan 7, 2025

View reviewed changes

B4nan merged commit 1ea95f7 into apify:master Jan 7, 2025
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: Fix timeouts and retry limit exceeded in interim introduction example #2791

docs: Fix timeouts and retry limit exceeded in interim introduction example #2791

webchick commented Jan 6, 2025 •

edited by B4nan

Loading

barjin commented Jan 7, 2025

B4nan commented Jan 7, 2025

webchick commented Jan 7, 2025

B4nan commented Jan 7, 2025

B4nan left a comment

docs: Fix timeouts and retry limit exceeded in interim introduction example #2791

docs: Fix timeouts and retry limit exceeded in interim introduction example #2791

Conversation

webchick commented Jan 6, 2025 • edited by B4nan Loading

barjin commented Jan 7, 2025

B4nan commented Jan 7, 2025

webchick commented Jan 7, 2025

B4nan commented Jan 7, 2025

B4nan left a comment

Choose a reason for hiding this comment

webchick commented Jan 6, 2025 •

edited by B4nan

Loading