-
-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Investigate: NATS Streaming crash/lock-up #1031
Comments
The reason NATS crashes is unknow to me, yet. But the way to reproduce it for testing purposes is killing NATS with this: $ docker kill $(docker ps -qf "name=func_nats") After that you will get the same results as when it crashes "in the wild" Then you can get the stack back to work with this: docker kill $(docker ps -qf "name=func_nats") $(docker ps -qf "name=func_queue-worker") $(docker ps -qf "name=func_gateway") |
I have some success, I'm testing right now reconnecting logic, but if everything will be fine, I should have PR with solution in around 2h. |
Summary: What I did so far?
I also did a research related to prometheus exporters for NATS streaming, and till official NATS exporter wouldn't support NATS Streaming metrics (there is open PR for that: nats-io/prometheus-nats-exporter#54) our best (and actually the only one) option is to use this exporter: https://gitlab.com/civist/nats-streaming-exporter
so if you want to test that exporter too, feel free to use my image: |
Thank you for the update Bart. Which metrics could @padiazg make use of to alert on or observe the health of his NATS Streaming instance/cluster? |
I think:
|
Resolved through patches to gateway and queue-worker |
Expected Behaviour
The connection to the NATS Streaming Server in the gateway should stay up and available to serve asynchronous requests.
Current Behaviour
@padiazg observed with Swarm on two occasions that NATS Streaming appeared to stop accepting new asynchronous requests. I have also noticed this with Kubernetes in OpenFaaS Cloud Community Cluster on one occasion.
Possible Solution
Tasks:
I don't believe you can restart a connection / subscription if the NATS Streaming Server is running in in-memory mode. See also: openfaas/nats-queue-worker#33
Steps to Reproduce (for bugs)
Context
If this crashes then manual action is required and it is currently not easy to know whether it has crashed from a dashboard/alert. This could affect people relying on NATS Streaming in production like @padiazg / Vision.
The configuration of NATS Streaming is "memory" by default:
Patricio has done some experimentation with MySQL as a backing store
Your Environment
FaaS-CLI version ( Full output from:
faas-cli version
):Docker version
docker version
(e.g. Docker 17.0.05 ):Are you using Docker Swarm or Kubernetes (FaaS-netes)?
Operating System and version (e.g. Linux, Windows, MacOS):
Link to your project or a code example to reproduce issue:
Please also follow the troubleshooting guide and paste in any other diagnostic information you have:
The text was updated successfully, but these errors were encountered: