Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crawlee timing out only when deployed via docker #2763

Closed
1 task
klvs opened this issue Dec 4, 2024 · 4 comments
Closed
1 task

Crawlee timing out only when deployed via docker #2763

klvs opened this issue Dec 4, 2024 · 4 comments
Assignees
Labels
bug Something isn't working. t-tooling Issues with this label are in the ownership of the tooling team.

Comments

@klvs
Copy link

klvs commented Dec 4, 2024

Which package is this bug report for? If unsure which one to select, leave blank

None

Issue description

I'm using crawlee 3.9.1on the apify/actor-node-puppeteer-chrome:20 docker image and running into a very bizarre bug. I don't expect to get very far here because there's no way that I can provide a reproduction but this is my last ditch effort to fix this before I have to abandon using crawlee altogether:

This crawler is intended for crawling internal documentation (a cookie is set). It's then deployed in a docker container.

It generally works fine except for specific pages, the request will time out and fail first with this warning:

WARN PuppeteerCrawler: Reclaiming failed request back to the list or queue. Navigation timed out after 60 seconds.

it will then re-queue the request and eventually fail with:

ERROR PuppeteerCrawler: Request failed and reached maximum retries. Navigation timed out after 60 seconds.

I can provide code, but I don't think it will help since this works fine when run locally in docker. I can use the same exact cookie that it fails with when deployed it docker, and it has no problem crawling sites. There's nothing particularly unique about the pages it fails on and, again, it runs just fine when run locally in docker (or not in docker).

This seemed a lot like the issues described here except puppeteer does not fail altogether, it just fails on certain pages.

I've tried passing the --disable-gpu flag as well to no avail.

Any tips on how to debug this? I cannot reproduce it locally. I wrote a quick and dirty crawler using just puppeteer which doesn't suffer from the same issue. I'd like to use crawlee, but I've spent a LOT of time trying to figure out why some pages timeout only when deployed.

I've tried different images, puppeteer versions, crawlee versions, browser versions, you name it. It only fails for certain pages, when deployed, in docker, on AWS.

Code sample

No response

Package version

3.9.1

Node.js version

20.18.1

Operating system

Linux

Apify platform

  • Tick me if you encountered this issue on the Apify platform

I have tested this on the next release

3.12.1-beta.19

Other context

No response

@klvs klvs added the bug Something isn't working. label Dec 4, 2024
@github-actions github-actions bot added the t-tooling Issues with this label are in the ownership of the tooling team. label Dec 4, 2024
@janbuchar
Copy link
Contributor

Hello @klvs and thank you for your interest in Crawlee! Please, provide the code and the Dockerfile so that we can investigate further.

Also, I understood that you run the Docker image in AWS. Which AWS product do you use for that? An EC2 instance running docker? ECS? EKS? Lambda?

@klvs
Copy link
Author

klvs commented Dec 7, 2024

@janbuchar Sure, here's crawler.js and the Dockerfile

I moved everything into crawler.js for simplicity sake and had to remove some company-relevant information. The crawler is kicked off from an SQS queue listener, but I didn't include that. Instead I just put a long sleep so that the docker container doesn't shut down. You can just start the container and docker exec in, cd /src and then run node crawler.js <url>

The DEV env variable is just for testing it locally.

It's based off of meilisearch/scrapix

The AWS product used is EKS.

Thanks a bunch for looking into it. I'm really at my wits end. It works fine locally on my macbook or locally in docker.

@klvs
Copy link
Author

klvs commented Dec 12, 2024

Okay, I figured this out. Oh boy was this a doozy.

First of all, this was not an issue with Crawlee (sorry guys, it was never your fault). It is an issue with puppeteer.

To reiterate, the issue was that sometimes, for some websites, the crawler would fail with: ERROR PuppeteerCrawler: Request failed and reached maximum retries. Navigation timed out after 60 seconds. This would ONLY occur when I deployed my crawler in docker, in production. Never locally or even locally in docker.

After being unable to figure this out, I gave up and re-wrote my own crawler with purely puppeteer. It seemed to be working, but eventually I ran into a very similar issue. In this case, puppeteer would also time out. The error was slightly different with: TimeoutError: Navigation timeout of 30000 ms exceeded.

So if you're running into this, one thing you can do is just set the timeout to infinity with page.setDefaultNavigationTimeout(0) but wait! Really a request should never be taking a full minute to complete. Something is definitely very wrong there. Plus, there's still some unanswered questions here:

  1. why is a request taking over 30 seconds to complete and
  2. why does this only happen in docker when I deploy it and not locally?

I don't fully have the answers to this, but I do have a fix.

So debugging starts: What I was trying to fix was that each page on a particular site would take a couple minutes consistently. This resulted in multi-hour crawls. I started debugging my puppeteer (infinite timeout) crawler with env DEBUG="puppeteer:*" node crawler/index.js. I noticed that around the time pptr would resolve (couple minutes) I would see this debug message:

puppeteer:protocol:RECV ◀ '{"method":"Log.entryAdded","params":{"entry":{"source":"network","level":"error","text":"Failed to load resource: net::ERR_TIMED_OUT","timestamp":1.733975591266968e+12,"url":"https://fonts.googleapis.com/css2?family=Lato:wght@100;300;400;700;900&family=Raleway:wght@100;200;300;400;500;600;700;800;900&display=swap","networkRequestId":"14306.186"}},"sessionId":"4C8B0C990FAB3A60FBE0C43D87827537"}'

For some reason, a request to google fonts was timing out and taking a couple minutes. I think the reason it's failing is because I'm setting cookies for the site I'm crawling and maybe it's also setting them on the googleapis request? Not sure.

So I decided to filter out any requests that weren't on the domain I was crawling. Put this in PreRequestHooks:

page.setDefaultNavigationTimeout(0) // maybe don't need this
page.setRequestInterception(true);
page.on('request', req => {
    if(!req.isInterceptResolutionHandled()) {

        const url = new URL(req.url())
        if(url.hostname !== domain) {
            return req.abort()
        } else {
            return req.continue()
        }
    }
})

This also works for just puppeteer btw. This might not work for everyone, you could definitely break things for puppeteer by filtering out all non-same-domain requests, but it does work for my use case. This might not fix it for you. I spent a lot of time googling and saw a LOT of people trying to debug a navigation timeout or figure out "why does puppeteer fail when deployed on a server but succeed locally?"

That question I'm not 100% sure on. Here's what I suspect: The default tcp keepalive time for a lot of linux distros is really high. I think 2 hours is a common one. If you have a request in the browser that's in a going-to-time-out-state it can take well over a minute (or 30s is the default for pptr?). This default timeout is probably set to something more sane locally and I think even docker will inherit that timeout. Obviously chrome/chromium has it's own, but at least in my case, the difference between the default timeout locally and the deployed environment is different.

Anyways, I am overjoyed to be unstuck on this. I just had to post this comment in case anyone else is suffering as I was. Hope it helps. Good luck!

@klvs klvs closed this as completed Dec 12, 2024
@janbuchar
Copy link
Contributor

@klvs wow, I'm glad you were able to solve it and thanks for the detailed summary!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working. t-tooling Issues with this label are in the ownership of the tooling team.
Projects
None yet
Development

No branches or pull requests

3 participants