-
Notifications
You must be signed in to change notification settings - Fork 727
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crawlee timing out only when deployed via docker #2763
Comments
Hello @klvs and thank you for your interest in Crawlee! Please, provide the code and the Dockerfile so that we can investigate further. Also, I understood that you run the Docker image in AWS. Which AWS product do you use for that? An EC2 instance running docker? ECS? EKS? Lambda? |
@janbuchar Sure, here's crawler.js and the Dockerfile I moved everything into crawler.js for simplicity sake and had to remove some company-relevant information. The crawler is kicked off from an SQS queue listener, but I didn't include that. Instead I just put a long The It's based off of The AWS product used is EKS. Thanks a bunch for looking into it. I'm really at my wits end. It works fine locally on my macbook or locally in docker. |
Okay, I figured this out. Oh boy was this a doozy. First of all, this was not an issue with Crawlee (sorry guys, it was never your fault). It is an issue with puppeteer. To reiterate, the issue was that sometimes, for some websites, the crawler would fail with: After being unable to figure this out, I gave up and re-wrote my own crawler with purely puppeteer. It seemed to be working, but eventually I ran into a very similar issue. In this case, puppeteer would also time out. The error was slightly different with: So if you're running into this, one thing you can do is just set the timeout to infinity with
I don't fully have the answers to this, but I do have a fix. So debugging starts: What I was trying to fix was that each page on a particular site would take a couple minutes consistently. This resulted in multi-hour crawls. I started debugging my puppeteer (infinite timeout) crawler with
For some reason, a request to google fonts was timing out and taking a couple minutes. I think the reason it's failing is because I'm setting cookies for the site I'm crawling and maybe it's also setting them on the googleapis request? Not sure. So I decided to filter out any requests that weren't on the domain I was crawling. Put this in page.setDefaultNavigationTimeout(0) // maybe don't need this
page.setRequestInterception(true);
page.on('request', req => {
if(!req.isInterceptResolutionHandled()) {
const url = new URL(req.url())
if(url.hostname !== domain) {
return req.abort()
} else {
return req.continue()
}
}
}) This also works for just puppeteer btw. This might not work for everyone, you could definitely break things for puppeteer by filtering out all non-same-domain requests, but it does work for my use case. This might not fix it for you. I spent a lot of time googling and saw a LOT of people trying to debug a navigation timeout or figure out "why does puppeteer fail when deployed on a server but succeed locally?" That question I'm not 100% sure on. Here's what I suspect: The default tcp keepalive time for a lot of linux distros is really high. I think 2 hours is a common one. If you have a request in the browser that's in a going-to-time-out-state it can take well over a minute (or 30s is the default for pptr?). This default timeout is probably set to something more sane locally and I think even docker will inherit that timeout. Obviously chrome/chromium has it's own, but at least in my case, the difference between the default timeout locally and the deployed environment is different. Anyways, I am overjoyed to be unstuck on this. I just had to post this comment in case anyone else is suffering as I was. Hope it helps. Good luck! |
@klvs wow, I'm glad you were able to solve it and thanks for the detailed summary! |
Which package is this bug report for? If unsure which one to select, leave blank
None
Issue description
I'm using crawlee
3.9.1
on theapify/actor-node-puppeteer-chrome:20
docker image and running into a very bizarre bug. I don't expect to get very far here because there's no way that I can provide a reproduction but this is my last ditch effort to fix this before I have to abandon using crawlee altogether:This crawler is intended for crawling internal documentation (a cookie is set). It's then deployed in a docker container.
It generally works fine except for specific pages, the request will time out and fail first with this warning:
WARN PuppeteerCrawler: Reclaiming failed request back to the list or queue. Navigation timed out after 60 seconds.
it will then re-queue the request and eventually fail with:
ERROR PuppeteerCrawler: Request failed and reached maximum retries. Navigation timed out after 60 seconds.
I can provide code, but I don't think it will help since this works fine when run locally in docker. I can use the same exact cookie that it fails with when deployed it docker, and it has no problem crawling sites. There's nothing particularly unique about the pages it fails on and, again, it runs just fine when run locally in docker (or not in docker).
This seemed a lot like the issues described here except puppeteer does not fail altogether, it just fails on certain pages.
I've tried passing the
--disable-gpu
flag as well to no avail.Any tips on how to debug this? I cannot reproduce it locally. I wrote a quick and dirty crawler using just puppeteer which doesn't suffer from the same issue. I'd like to use crawlee, but I've spent a LOT of time trying to figure out why some pages timeout only when deployed.
I've tried different images, puppeteer versions, crawlee versions, browser versions, you name it. It only fails for certain pages, when deployed, in docker, on AWS.
Code sample
No response
Package version
3.9.1
Node.js version
20.18.1
Operating system
Linux
Apify platform
I have tested this on the
next
release3.12.1-beta.19
Other context
No response
The text was updated successfully, but these errors were encountered: