[Bug] Self-Hosted: /scrape and /crawl endpoints don't respond #713

twilwa · 2024-09-28T22:05:28Z

Describe the Bug
Note: I'll mentioned I've deployed via Coolify & docker-compose, so my setup might be a little wonky. That said if there's anything to check, I'd love some direction. When calling /scrape:
[2024-09-28T20:57:32.113Z]DEBUG - Fetching sitemap links from https://mendable.ai
[2024-09-28T21:01:07.403Z]WARN - You're bypassing authentication
[2024-09-28T21:01:07.403Z]WARN - You're bypassing authentication
[2024-09-28T21:01:07.525Z]DEBUG - [Crawl] Failed to get robots.txt (this is probably fine!): {"message":"Request failed with status code 404","name":"AxiosError","stack":"AxiosError: Request failed with status code 404\n at settle (/app/node_modules/.pnpm/[email protected]/node_modules/axios/dist/node/axios.cjs:1983:12)\n at BrotliDecompress.handleStreamEnd (/app/node_modules/.pnpm/[email protected]/node_modules/axios/dist/node/axios.cjs:3085:11)\n at BrotliDecompress.emit (node:events:531:35)\n at endReadableNT (node:internal/streams/readable:1696:12)\n at process.processTicksAndRejections (node:internal/process/task_queues:82:21)\n at Axios.request (/app/node_modules/.pnpm/[email protected]/node_modules/axios/dist/node/axios.cjs:4224:41)\n at process.processTicksAndRejections (node:internal/process/task_queues:95:5)\n at async WebCrawler.getRobotsTxt (/app/dist/src/scraper/WebScraper/crawler.js:120:26)\n at async crawlController (/app/dist/src/controllers/v1/crawl.js:52:21)","config":{"transitional":{"silentJSONParsing":true,"forcedJSONParsing":true,"clarifyTimeoutError":false},"adapter":["xhr","http","fetch"],"transformRequest":[null],"transformResponse":[null],"timeout":3000,"xsrfCookieName":"XSRF-TOKEN","xsrfHeaderName":"X-XSRF-TOKEN","maxContentLength":-1,"maxBodyLength":-1,"env":{},"headers":{"Accept":"application/json, text/plain, /","User-Agent":"axios/1.7.2","Accept-Encoding":"gzip, compress, deflate, br"},"method":"get","url":"https://mendable.ai/robots.txt","axios-retry":{"retries":3,"shouldResetTimeout":false,"validateResponse":null,"retryCount":0,"lastRequestTime":1727557267405}},"code":"ERR_BAD_REQUEST","status":404}

As far as I can tell it just hangs forever. That said, the requests that are getting returned seem to be succeeding:

anon@pop-os:~$ curl -X POST http://api-firecrawl.x-ware.online:3002/v1/crawl -H 'Content-Type: application/json' -d '{
"url": "https://mendable.ai"
}'
{"success":true,"id":"35d7987d-e160-4a07-836f-0c776c3736ae","url":"https://api-firecrawl.x-ware.online:3002/v1/crawl/35d7987d-e160-4a07-836f-0c776c3736ae}

And I can visit the corresponding job page:

{"success":true,"status":"scraping","completed":0,"total":1,"creditsUsed":1,"expiresAt":"2024-09-29T21:01:07.000Z","next":"https://api-firecrawl.x-ware.online:3002/v1/crawl/9f34da99-1022-490b-988b-65c4f2d9c8d2?skip=0","data":[]}

(different job, just had the tab open, all the mendable attempts return like that, haven't tested much else.)

When calling /scrape, I get a timeout. When I try to visit api-firecrawl.x-ware.online (the domain i'm directing api traffic to) on port 3000, I do see the following simple HTML page:
SCRAPERS-JS: Hello, world! Fly.io

To Reproduce
Steps to reproduce the issue:
Deploy via coolify through the 'repo' option with 'docker-compose' as the build utility. Set the following env vars:
BLOCK_MEDIA=
BULL_AUTH_KEY=
HOST=0.0.0.0
LLAMAPARSE_API_KEY=
LOGGING_LEVEL=
LOGTAIL_KEY=
MODEL_NAME=gpt-4o
NUM_WORKERS_PER_QUEUE=
OPENAI_API_KEY=
OPENAI_BASE_URL=
PLAYWRIGHT_MICROSERVICE_URL=http://playwright-service:3000
PORT=3002
POSTHOG_API_KEY=
POSTHOG_HOST=
PROXY_PASSWORD=
PROXY_SERVER=
PROXY_USERNAME=
REDIS_URL=redis://redis:6379
SCRAPING_BEE_API_KEY=
SELF_HOSTED_WEBHOOK_URL=
SLACK_WEBHOOK_URL=
SUPABASE_ANON_TOKEN=[redacted]
SUPABASE_SERVICE_TOKEN=[redacted]
SUPABASE_URL=https://supabasekong.x-ware.online/
TEST_API_KEY=
USE_DB_AUTHENTICATION=false
2. Run the command '...'

Run the api calls, error messages, and container logs described above.

Expected Behavior
Crawl and scrape function normally.

Screenshots
If applicable, add screenshots or copies of the command line output to help explain the issue.

Environment (please complete the following information):

OS:
NAME="Amazon Linux"
VERSION="2023"
ID="amzn"
ID_LIKE="fedora"
VERSION_ID="2023"
PLATFORM_ID="platform:al2023"
PRETTY_NAME="Amazon Linux 2023.5.20240624"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2023"
HOME_URL="https://aws.amazon.com/linux/amazon-linux-2023/"
DOCUMENTATION_URL="https://docs.aws.amazon.com/linux/"
SUPPORT_URL="https://aws.amazon.com/premiumsupport/"
BUG_REPORT_URL="https://github.com/amazonlinux/amazon-linux-2023"
VENDOR_NAME="AWS"
VENDOR_URL="https://aws.amazon.com/"
SUPPORT_END="2028-03-15"
Firecrawl Version: :latest via docker-compose
Node.js Version: 20.15.0

Logs
Logs found above.

Additional Context
Networking handled by traefik via coolify

nickscamara · 2024-09-28T22:06:45Z

Thanks for the report @twilwa! That's quite odd. ccing @rafaelsideguide here to take a look

chenjinjun · 2024-09-29T03:40:18Z

I have also encountered this situation. How can we solve it

lawtj · 2024-10-01T15:31:01Z

Can confirm. /scrape, /crawl, /map endpoints all do not function even with a simple curl request.

rafaelsideguide · 2024-10-03T19:41:01Z

@twilwa, it seems like the workers aren’t running, which is causing the jobs to get stuck in the "active" queue indefinitely. Could you configure a BULL_AUTH_KEY=@ and check the Bull dashboard at http://api-firecrawl.x-ware.online:3002/admin/@/queues to see if there are active jobs when you send the requests?

nickscamara · 2024-10-03T21:19:14Z

Hey all, just pushed a pr that fixes this. #733

rafaelsideguide · 2024-10-04T12:30:59Z

@twilwa let us know if the updates fix this issue.

rothnic · 2025-02-01T19:35:11Z

Has this issue been confirmed to be resolved? I was trying to setup firecrawl using the provided docker_compose file in the firecrawl repo to work with Dify. I have this setup as a stack in portainer. I can successfully use the v0 api endpoints, but the v1 endpoints seem to always return a 404. I also can't seem to get much out of the log data at any log level. I'll attach my logs and my env variables, but otherwise this is identical to the docker compose provided in the repo and following the self host guide.

When using with dify it seems to add the crawls to the queue, but nothing ever gets picked up. Via the api it

Firecrawl Worker 1 Logs.txt
Firecrawl Redis Logs.txt
Firecrawl Playwright Service Logs.txt
Firecrawl API Logs.txt

Maybe related to #1082

twilwa added the bug Something isn't working label Sep 28, 2024

nickscamara assigned rafaelsideguide Sep 28, 2024

nickscamara added the self-host label Oct 3, 2024

nickscamara mentioned this issue Oct 3, 2024

Fixed the self host issues where methods don't work #733

Merged

nickscamara closed this as completed in #733 Oct 3, 2024

rothnic mentioned this issue Feb 1, 2025

[Self-Host] Working with Dify #1082

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] Self-Hosted: /scrape and /crawl endpoints don't respond #713

[Bug] Self-Hosted: /scrape and /crawl endpoints don't respond #713

twilwa commented Sep 28, 2024

nickscamara commented Sep 28, 2024

chenjinjun commented Sep 29, 2024

lawtj commented Oct 1, 2024

rafaelsideguide commented Oct 3, 2024

nickscamara commented Oct 3, 2024

rafaelsideguide commented Oct 4, 2024

rothnic commented Feb 1, 2025

[Bug] Self-Hosted: /scrape and /crawl endpoints don't respond #713

[Bug] Self-Hosted: /scrape and /crawl endpoints don't respond #713

Comments

twilwa commented Sep 28, 2024

nickscamara commented Sep 28, 2024

chenjinjun commented Sep 29, 2024

lawtj commented Oct 1, 2024

rafaelsideguide commented Oct 3, 2024

nickscamara commented Oct 3, 2024

rafaelsideguide commented Oct 4, 2024

rothnic commented Feb 1, 2025