Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] Self-Hosted: /scrape and /crawl endpoints don't respond #713

Closed
twilwa opened this issue Sep 28, 2024 · 7 comments · Fixed by #733
Closed

[Bug] Self-Hosted: /scrape and /crawl endpoints don't respond #713

twilwa opened this issue Sep 28, 2024 · 7 comments · Fixed by #733
Assignees
Labels
bug Something isn't working self-host

Comments

@twilwa
Copy link

twilwa commented Sep 28, 2024

Describe the Bug
Note: I'll mentioned I've deployed via Coolify & docker-compose, so my setup might be a little wonky. That said if there's anything to check, I'd love some direction. When calling /scrape:
[2024-09-28T20:57:32.113Z]DEBUG - Fetching sitemap links from https://mendable.ai
[2024-09-28T21:01:07.403Z]WARN - You're bypassing authentication
[2024-09-28T21:01:07.403Z]WARN - You're bypassing authentication
[2024-09-28T21:01:07.525Z]DEBUG - [Crawl] Failed to get robots.txt (this is probably fine!): {"message":"Request failed with status code 404","name":"AxiosError","stack":"AxiosError: Request failed with status code 404\n at settle (/app/node_modules/.pnpm/[email protected]/node_modules/axios/dist/node/axios.cjs:1983:12)\n at BrotliDecompress.handleStreamEnd (/app/node_modules/.pnpm/[email protected]/node_modules/axios/dist/node/axios.cjs:3085:11)\n at BrotliDecompress.emit (node:events:531:35)\n at endReadableNT (node:internal/streams/readable:1696:12)\n at process.processTicksAndRejections (node:internal/process/task_queues:82:21)\n at Axios.request (/app/node_modules/.pnpm/[email protected]/node_modules/axios/dist/node/axios.cjs:4224:41)\n at process.processTicksAndRejections (node:internal/process/task_queues:95:5)\n at async WebCrawler.getRobotsTxt (/app/dist/src/scraper/WebScraper/crawler.js:120:26)\n at async crawlController (/app/dist/src/controllers/v1/crawl.js:52:21)","config":{"transitional":{"silentJSONParsing":true,"forcedJSONParsing":true,"clarifyTimeoutError":false},"adapter":["xhr","http","fetch"],"transformRequest":[null],"transformResponse":[null],"timeout":3000,"xsrfCookieName":"XSRF-TOKEN","xsrfHeaderName":"X-XSRF-TOKEN","maxContentLength":-1,"maxBodyLength":-1,"env":{},"headers":{"Accept":"application/json, text/plain, /","User-Agent":"axios/1.7.2","Accept-Encoding":"gzip, compress, deflate, br"},"method":"get","url":"https://mendable.ai/robots.txt","axios-retry":{"retries":3,"shouldResetTimeout":false,"validateResponse":null,"retryCount":0,"lastRequestTime":1727557267405}},"code":"ERR_BAD_REQUEST","status":404}

As far as I can tell it just hangs forever. That said, the requests that are getting returned seem to be succeeding:

anon@pop-os:~$ curl -X POST http://api-firecrawl.x-ware.online:3002/v1/crawl -H 'Content-Type: application/json' -d '{
"url": "https://mendable.ai"
}'
{"success":true,"id":"35d7987d-e160-4a07-836f-0c776c3736ae","url":"https://api-firecrawl.x-ware.online:3002/v1/crawl/35d7987d-e160-4a07-836f-0c776c3736ae}

And I can visit the corresponding job page:

{"success":true,"status":"scraping","completed":0,"total":1,"creditsUsed":1,"expiresAt":"2024-09-29T21:01:07.000Z","next":"https://api-firecrawl.x-ware.online:3002/v1/crawl/9f34da99-1022-490b-988b-65c4f2d9c8d2?skip=0","data":[]}

(different job, just had the tab open, all the mendable attempts return like that, haven't tested much else.)

When calling /scrape, I get a timeout. When I try to visit api-firecrawl.x-ware.online (the domain i'm directing api traffic to) on port 3000, I do see the following simple HTML page:
SCRAPERS-JS: Hello, world! Fly.io

To Reproduce
Steps to reproduce the issue:
Deploy via coolify through the 'repo' option with 'docker-compose' as the build utility. Set the following env vars:
BLOCK_MEDIA=
BULL_AUTH_KEY=
HOST=0.0.0.0
LLAMAPARSE_API_KEY=
LOGGING_LEVEL=
LOGTAIL_KEY=
MODEL_NAME=gpt-4o
NUM_WORKERS_PER_QUEUE=
OPENAI_API_KEY=
OPENAI_BASE_URL=
PLAYWRIGHT_MICROSERVICE_URL=http://playwright-service:3000
PORT=3002
POSTHOG_API_KEY=
POSTHOG_HOST=
PROXY_PASSWORD=
PROXY_SERVER=
PROXY_USERNAME=
REDIS_URL=redis://redis:6379
SCRAPING_BEE_API_KEY=
SELF_HOSTED_WEBHOOK_URL=
SLACK_WEBHOOK_URL=
SUPABASE_ANON_TOKEN=[redacted]
SUPABASE_SERVICE_TOKEN=[redacted]
SUPABASE_URL=https://supabasekong.x-ware.online/
TEST_API_KEY=
USE_DB_AUTHENTICATION=false
2. Run the command '...'

Run the api calls, error messages, and container logs described above.

Expected Behavior
Crawl and scrape function normally.

Screenshots
If applicable, add screenshots or copies of the command line output to help explain the issue.

Environment (please complete the following information):

Logs
Logs found above.

Additional Context
Networking handled by traefik via coolify

@twilwa twilwa added the bug Something isn't working label Sep 28, 2024
@nickscamara
Copy link
Member

Thanks for the report @twilwa! That's quite odd. ccing @rafaelsideguide here to take a look

@chenjinjun
Copy link

I have also encountered this situation. How can we solve it

@lawtj
Copy link

lawtj commented Oct 1, 2024

Can confirm. /scrape, /crawl, /map endpoints all do not function even with a simple curl request.

@rafaelsideguide
Copy link
Collaborator

@twilwa, it seems like the workers aren’t running, which is causing the jobs to get stuck in the "active" queue indefinitely. Could you configure a BULL_AUTH_KEY=@ and check the Bull dashboard at http://api-firecrawl.x-ware.online:3002/admin/@/queues to see if there are active jobs when you send the requests?

@nickscamara
Copy link
Member

Hey all, just pushed a pr that fixes this. #733

@rafaelsideguide
Copy link
Collaborator

@twilwa let us know if the updates fix this issue.

@rothnic
Copy link

rothnic commented Feb 1, 2025

Has this issue been confirmed to be resolved? I was trying to setup firecrawl using the provided docker_compose file in the firecrawl repo to work with Dify. I have this setup as a stack in portainer. I can successfully use the v0 api endpoints, but the v1 endpoints seem to always return a 404. I also can't seem to get much out of the log data at any log level. I'll attach my logs and my env variables, but otherwise this is identical to the docker compose provided in the repo and following the self host guide.

When using with dify it seems to add the crawls to the queue, but nothing ever gets picked up. Via the api it

Image Image Image

Firecrawl Worker 1 Logs.txt
Firecrawl Redis Logs.txt
Firecrawl Playwright Service Logs.txt
Firecrawl API Logs.txt

Maybe related to #1082

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working self-host
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants