-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] Self-Hosted: /scrape and /crawl endpoints don't respond #713
Comments
Thanks for the report @twilwa! That's quite odd. ccing @rafaelsideguide here to take a look |
I have also encountered this situation. How can we solve it |
Can confirm. /scrape, /crawl, /map endpoints all do not function even with a simple curl request. |
@twilwa, it seems like the workers aren’t running, which is causing the jobs to get stuck in the "active" queue indefinitely. Could you configure a |
Hey all, just pushed a pr that fixes this. #733 |
@twilwa let us know if the updates fix this issue. |
Has this issue been confirmed to be resolved? I was trying to setup firecrawl using the provided docker_compose file in the firecrawl repo to work with Dify. I have this setup as a stack in portainer. I can successfully use the v0 api endpoints, but the v1 endpoints seem to always return a 404. I also can't seem to get much out of the log data at any log level. I'll attach my logs and my env variables, but otherwise this is identical to the docker compose provided in the repo and following the self host guide. When using with dify it seems to add the crawls to the queue, but nothing ever gets picked up. Via the api it Firecrawl Worker 1 Logs.txt Maybe related to #1082 |
Describe the Bug
Note: I'll mentioned I've deployed via Coolify & docker-compose, so my setup might be a little wonky. That said if there's anything to check, I'd love some direction. When calling /scrape:
[2024-09-28T20:57:32.113Z]DEBUG - Fetching sitemap links from https://mendable.ai
[2024-09-28T21:01:07.403Z]WARN - You're bypassing authentication
[2024-09-28T21:01:07.403Z]WARN - You're bypassing authentication
[2024-09-28T21:01:07.525Z]DEBUG - [Crawl] Failed to get robots.txt (this is probably fine!): {"message":"Request failed with status code 404","name":"AxiosError","stack":"AxiosError: Request failed with status code 404\n at settle (/app/node_modules/.pnpm/[email protected]/node_modules/axios/dist/node/axios.cjs:1983:12)\n at BrotliDecompress.handleStreamEnd (/app/node_modules/.pnpm/[email protected]/node_modules/axios/dist/node/axios.cjs:3085:11)\n at BrotliDecompress.emit (node:events:531:35)\n at endReadableNT (node:internal/streams/readable:1696:12)\n at process.processTicksAndRejections (node:internal/process/task_queues:82:21)\n at Axios.request (/app/node_modules/.pnpm/[email protected]/node_modules/axios/dist/node/axios.cjs:4224:41)\n at process.processTicksAndRejections (node:internal/process/task_queues:95:5)\n at async WebCrawler.getRobotsTxt (/app/dist/src/scraper/WebScraper/crawler.js:120:26)\n at async crawlController (/app/dist/src/controllers/v1/crawl.js:52:21)","config":{"transitional":{"silentJSONParsing":true,"forcedJSONParsing":true,"clarifyTimeoutError":false},"adapter":["xhr","http","fetch"],"transformRequest":[null],"transformResponse":[null],"timeout":3000,"xsrfCookieName":"XSRF-TOKEN","xsrfHeaderName":"X-XSRF-TOKEN","maxContentLength":-1,"maxBodyLength":-1,"env":{},"headers":{"Accept":"application/json, text/plain, /","User-Agent":"axios/1.7.2","Accept-Encoding":"gzip, compress, deflate, br"},"method":"get","url":"https://mendable.ai/robots.txt","axios-retry":{"retries":3,"shouldResetTimeout":false,"validateResponse":null,"retryCount":0,"lastRequestTime":1727557267405}},"code":"ERR_BAD_REQUEST","status":404}
As far as I can tell it just hangs forever. That said, the requests that are getting returned seem to be succeeding:
anon@pop-os:~$ curl -X POST http://api-firecrawl.x-ware.online:3002/v1/crawl -H 'Content-Type: application/json' -d '{
"url": "https://mendable.ai"
}'
{"success":true,"id":"35d7987d-e160-4a07-836f-0c776c3736ae","url":"https://api-firecrawl.x-ware.online:3002/v1/crawl/35d7987d-e160-4a07-836f-0c776c3736ae}
And I can visit the corresponding job page:
{"success":true,"status":"scraping","completed":0,"total":1,"creditsUsed":1,"expiresAt":"2024-09-29T21:01:07.000Z","next":"https://api-firecrawl.x-ware.online:3002/v1/crawl/9f34da99-1022-490b-988b-65c4f2d9c8d2?skip=0","data":[]}
(different job, just had the tab open, all the mendable attempts return like that, haven't tested much else.)
When calling /scrape, I get a timeout. When I try to visit api-firecrawl.x-ware.online (the domain i'm directing api traffic to) on port 3000, I do see the following simple HTML page:
SCRAPERS-JS: Hello, world! Fly.io
To Reproduce
Steps to reproduce the issue:
Deploy via coolify through the 'repo' option with 'docker-compose' as the build utility. Set the following env vars:
BLOCK_MEDIA=
BULL_AUTH_KEY=
HOST=0.0.0.0
LLAMAPARSE_API_KEY=
LOGGING_LEVEL=
LOGTAIL_KEY=
MODEL_NAME=gpt-4o
NUM_WORKERS_PER_QUEUE=
OPENAI_API_KEY=
OPENAI_BASE_URL=
PLAYWRIGHT_MICROSERVICE_URL=http://playwright-service:3000
PORT=3002
POSTHOG_API_KEY=
POSTHOG_HOST=
PROXY_PASSWORD=
PROXY_SERVER=
PROXY_USERNAME=
REDIS_URL=redis://redis:6379
SCRAPING_BEE_API_KEY=
SELF_HOSTED_WEBHOOK_URL=
SLACK_WEBHOOK_URL=
SUPABASE_ANON_TOKEN=[redacted]
SUPABASE_SERVICE_TOKEN=[redacted]
SUPABASE_URL=https://supabasekong.x-ware.online/
TEST_API_KEY=
USE_DB_AUTHENTICATION=false
2. Run the command '...'
Run the api calls, error messages, and container logs described above.
Expected Behavior
Crawl and scrape function normally.
Screenshots
If applicable, add screenshots or copies of the command line output to help explain the issue.
Environment (please complete the following information):
NAME="Amazon Linux"
VERSION="2023"
ID="amzn"
ID_LIKE="fedora"
VERSION_ID="2023"
PLATFORM_ID="platform:al2023"
PRETTY_NAME="Amazon Linux 2023.5.20240624"
ANSI_COLOR="0;33"
CPE_NAME="cpe:2.3:o:amazon:amazon_linux:2023"
HOME_URL="https://aws.amazon.com/linux/amazon-linux-2023/"
DOCUMENTATION_URL="https://docs.aws.amazon.com/linux/"
SUPPORT_URL="https://aws.amazon.com/premiumsupport/"
BUG_REPORT_URL="https://github.com/amazonlinux/amazon-linux-2023"
VENDOR_NAME="AWS"
VENDOR_URL="https://aws.amazon.com/"
SUPPORT_END="2028-03-15"
Logs
Logs found above.
Additional Context
Networking handled by traefik via coolify
The text was updated successfully, but these errors were encountered: