Crawlee timing out only when deployed via docker #2763

klvs · 2024-12-04T23:50:16Z

Which package is this bug report for? If unsure which one to select, leave blank

None

Issue description

I'm using crawlee 3.9.1on the apify/actor-node-puppeteer-chrome:20 docker image and running into a very bizarre bug. I don't expect to get very far here because there's no way that I can provide a reproduction but this is my last ditch effort to fix this before I have to abandon using crawlee altogether:

This crawler is intended for crawling internal documentation (a cookie is set). It's then deployed in a docker container.

It generally works fine except for specific pages, the request will time out and fail first with this warning:

WARN PuppeteerCrawler: Reclaiming failed request back to the list or queue. Navigation timed out after 60 seconds.

it will then re-queue the request and eventually fail with:

ERROR PuppeteerCrawler: Request failed and reached maximum retries. Navigation timed out after 60 seconds.

I can provide code, but I don't think it will help since this works fine when run locally in docker. I can use the same exact cookie that it fails with when deployed it docker, and it has no problem crawling sites. There's nothing particularly unique about the pages it fails on and, again, it runs just fine when run locally in docker (or not in docker).

This seemed a lot like the issues described here except puppeteer does not fail altogether, it just fails on certain pages.

I've tried passing the --disable-gpu flag as well to no avail.

Any tips on how to debug this? I cannot reproduce it locally. I wrote a quick and dirty crawler using just puppeteer which doesn't suffer from the same issue. I'd like to use crawlee, but I've spent a LOT of time trying to figure out why some pages timeout only when deployed.

I've tried different images, puppeteer versions, crawlee versions, browser versions, you name it. It only fails for certain pages, when deployed, in docker, on AWS.

Code sample

No response

Package version

3.9.1

Node.js version

20.18.1

Operating system

Linux

Apify platform

Tick me if you encountered this issue on the Apify platform

I have tested this on the `next` release

3.12.1-beta.19

Other context

No response

The text was updated successfully, but these errors were encountered:

janbuchar · 2024-12-06T15:27:38Z

Hello @klvs and thank you for your interest in Crawlee! Please, provide the code and the Dockerfile so that we can investigate further.

Also, I understood that you run the Docker image in AWS. Which AWS product do you use for that? An EC2 instance running docker? ECS? EKS? Lambda?

klvs · 2024-12-07T01:53:22Z

@janbuchar Sure, here's crawler.js and the Dockerfile

I moved everything into crawler.js for simplicity sake and had to remove some company-relevant information. The crawler is kicked off from an SQS queue listener, but I didn't include that. Instead I just put a long sleep so that the docker container doesn't shut down. You can just start the container and docker exec in, cd /src and then run node crawler.js <url>

The DEV env variable is just for testing it locally.

It's based off of meilisearch/scrapix

The AWS product used is EKS.

Thanks a bunch for looking into it. I'm really at my wits end. It works fine locally on my macbook or locally in docker.

klvs · 2024-12-12T05:22:31Z

Okay, I figured this out. Oh boy was this a doozy.

First of all, this was not an issue with Crawlee (sorry guys, it was never your fault). It is an issue with puppeteer.

To reiterate, the issue was that sometimes, for some websites, the crawler would fail with: ERROR PuppeteerCrawler: Request failed and reached maximum retries. Navigation timed out after 60 seconds. This would ONLY occur when I deployed my crawler in docker, in production. Never locally or even locally in docker.

After being unable to figure this out, I gave up and re-wrote my own crawler with purely puppeteer. It seemed to be working, but eventually I ran into a very similar issue. In this case, puppeteer would also time out. The error was slightly different with: TimeoutError: Navigation timeout of 30000 ms exceeded.

So if you're running into this, one thing you can do is just set the timeout to infinity with page.setDefaultNavigationTimeout(0) but wait! Really a request should never be taking a full minute to complete. Something is definitely very wrong there. Plus, there's still some unanswered questions here:

why is a request taking over 30 seconds to complete and
why does this only happen in docker when I deploy it and not locally?

I don't fully have the answers to this, but I do have a fix.

So debugging starts: What I was trying to fix was that each page on a particular site would take a couple minutes consistently. This resulted in multi-hour crawls. I started debugging my puppeteer (infinite timeout) crawler with env DEBUG="puppeteer:*" node crawler/index.js. I noticed that around the time pptr would resolve (couple minutes) I would see this debug message:

puppeteer:protocol:RECV ◀ '{"method":"Log.entryAdded","params":{"entry":{"source":"network","level":"error","text":"Failed to load resource: net::ERR_TIMED_OUT","timestamp":1.733975591266968e+12,"url":"https://fonts.googleapis.com/css2?family=Lato:wght@100;300;400;700;900&family=Raleway:wght@100;200;300;400;500;600;700;800;900&display=swap","networkRequestId":"14306.186"}},"sessionId":"4C8B0C990FAB3A60FBE0C43D87827537"}'

For some reason, a request to google fonts was timing out and taking a couple minutes. I think the reason it's failing is because I'm setting cookies for the site I'm crawling and maybe it's also setting them on the googleapis request? Not sure.

So I decided to filter out any requests that weren't on the domain I was crawling. Put this in PreRequestHooks:

page.setDefaultNavigationTimeout(0) // maybe don't need this
page.setRequestInterception(true);
page.on('request', req => {
    if(!req.isInterceptResolutionHandled()) {

        const url = new URL(req.url())
        if(url.hostname !== domain) {
            return req.abort()
        } else {
            return req.continue()
        }
    }
})

This also works for just puppeteer btw. This might not work for everyone, you could definitely break things for puppeteer by filtering out all non-same-domain requests, but it does work for my use case. This might not fix it for you. I spent a lot of time googling and saw a LOT of people trying to debug a navigation timeout or figure out "why does puppeteer fail when deployed on a server but succeed locally?"

That question I'm not 100% sure on. Here's what I suspect: The default tcp keepalive time for a lot of linux distros is really high. I think 2 hours is a common one. If you have a request in the browser that's in a going-to-time-out-state it can take well over a minute (or 30s is the default for pptr?). This default timeout is probably set to something more sane locally and I think even docker will inherit that timeout. Obviously chrome/chromium has it's own, but at least in my case, the difference between the default timeout locally and the deployed environment is different.

Anyways, I am overjoyed to be unstuck on this. I just had to post this comment in case anyone else is suffering as I was. Hope it helps. Good luck!

janbuchar · 2024-12-12T09:41:15Z

@klvs wow, I'm glad you were able to solve it and thanks for the detailed summary!

klvs added the bug Something isn't working. label Dec 4, 2024

github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label Dec 4, 2024

B4nan assigned barjin Dec 9, 2024

klvs closed this as completed Dec 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crawlee timing out only when deployed via docker #2763

Crawlee timing out only when deployed via docker #2763

klvs commented Dec 4, 2024

janbuchar commented Dec 6, 2024

klvs commented Dec 7, 2024

klvs commented Dec 12, 2024

janbuchar commented Dec 12, 2024

Crawlee timing out only when deployed via docker #2763

Crawlee timing out only when deployed via docker #2763

Comments

klvs commented Dec 4, 2024

Which package is this bug report for? If unsure which one to select, leave blank

Issue description

Code sample

Package version

Node.js version

Operating system

Apify platform

I have tested this on the next release

Other context

janbuchar commented Dec 6, 2024

klvs commented Dec 7, 2024

klvs commented Dec 12, 2024

janbuchar commented Dec 12, 2024

I have tested this on the `next` release