-
Notifications
You must be signed in to change notification settings - Fork 112
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[Webcrawler] Handle websites taking long to crawl (#5757)
* [Webcrawler] Handle websites taking long to crawl Description --- Some websites may take long to crawl. This is an issue when crawling exceeds 2 hours which is the activity limit for our crawl => the crawl is silently aborted and retried and hour later, and this repeats 15 times before we get an 'activity timeout' monitor (so 2 days) See example in issue dust-tt/tasks#883 This PR fixes dust-tt/tasks#883. It: 1. clarifies the situation by raising a panic flag when the issue is clearly that the website is long to crawl so we don't crawl uselessely the same pages for 2 days before seeing an activity timeout (which is btw less clear than "website takes too long to crawl); 2. moves the timeout to 4 hours which seems acceptable for slow websites with big pages (2 pages / minute maximum tolerated slowness for a 512 pages crawl) 3. decreases max-requests-per-minute Risk --- na Deploy --- - deploy connectors - Update eng runner runbook * loglevel off
- Loading branch information
1 parent
e8581a9
commit d7591b9
Showing
2 changed files
with
31 additions
and
5 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters