Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add network health check to both worker and data fetcher services #368

Open
vasyl-ivanchuk opened this issue Jan 6, 2025 · 0 comments
Open
Labels
backend Task requires changes to the backend implementation bug Something isn't working

Comments

@vasyl-ivanchuk
Copy link
Collaborator

🐛 Bug Report

📝 Description

When a data fetcher pod loses its connection to the network, it is still considered healthy and continues receiving traffic. As a result, it keeps retrying network requests until the timeout is reached. Example logs in such a case:

{"code":"Unknown system error -116","context":"BlockchainService","level":"error","message":"getaddrinfo Unknown system error -116 <BLOCKCHAIN URL>","ms":"+30s","stack":["Error: getaddrinfo Unknown system error -116 <BLOCKCHAIN URL>
at GetAddrInfoReqWrap.onlookup [as oncomplete] (node:dns:108:26)"],"timestamp":"2025-01-06T09:05:25.035Z"}
{"context":"BlockchainService","functionName":"getBlockDetails","level":"error","message":"Exceeded retries total timeout, failing the request","ms":"+0ms","stack":[null],"timestamp":"2025-01-06T09:05:25.035Z"}

Same problem can happen with the worker service.

🤔 Expected Behavior

The health check for the pod should fail, so the pod is considered unhealthy and gets replaced.

😯 Current Behavior

The pod is considered healthy and keeps receiving traffic.

📋 Additional Context

Suggested solution: Both the worker and data fetcher services already have a JsonRpcHealthIndicator. I suggest customizing this indicator so that it pings the blockchain at a configured interval (e.g., every 20 seconds) with a large timeout (e.g., 10 seconds) and updates an internal state variable. Then, when the isHealthy function is called, the value of the internal state is returned. The motivation behind this approach is to avoid spamming the network too frequently, which could be harmful when the network is under heavy load.

@vasyl-ivanchuk vasyl-ivanchuk added bug Something isn't working backend Task requires changes to the backend implementation labels Jan 6, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend Task requires changes to the backend implementation bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant