Investigate: NATS Streaming crash/lock-up #1031

alexellis · 2019-01-14T11:52:18Z

Expected Behaviour

The connection to the NATS Streaming Server in the gateway should stay up and available to serve asynchronous requests.

Current Behaviour

@padiazg observed with Swarm on two occasions that NATS Streaming appeared to stop accepting new asynchronous requests. I have also noticed this with Kubernetes in OpenFaaS Cloud Community Cluster on one occasion.

Possible Solution

Tasks:

Run NATS Streaming in HA mode or with persistence so that it can be started if an issue is detected without losing data. (Document how to deploy with HA NATS deployments docs#101)
Document the three HA NATS instructions sent over from the NATS team in the docs site via Document how to deploy with HA NATS deployments docs#101 - in addition we need to give users clear instructions for Kubernetes + Swarm
Investigate whether the gateway can reconnect if the NATS Streaming TCP connection is severed (this may need to be simulated by patching the gateway code) - this may be related to whether NATS is running with an in-memory filesystem or with persistence
Add re-connect to the gateway handler.go code (publisher)
Add re-connect handler to the nats-queue-worker code (subscriber)
Evaluate current set of NATS Streaming Prometheus exporters and whether their metrics can be used to create alerts in AlertManager for HipChat/PagerDuty etc.

I don't believe you can restart a connection / subscription if the NATS Streaming Server is running in in-memory mode. See also: openfaas/nats-queue-worker#33

Steps to Reproduce (for bugs)

Unclear

Context

If this crashes then manual action is required and it is currently not easy to know whether it has crashed from a dashboard/alert. This could affect people relying on NATS Streaming in production like @padiazg / Vision.

The configuration of NATS Streaming is "memory" by default:

Patricio has done some experimentation with MySQL as a backing store

Your Environment

FaaS-CLI version ( Full output from: faas-cli version ):
Docker version docker version (e.g. Docker 17.0.05 ):
Are you using Docker Swarm or Kubernetes (FaaS-netes)?
Operating System and version (e.g. Linux, Windows, MacOS):
Link to your project or a code example to reproduce issue:
Please also follow the troubleshooting guide and paste in any other diagnostic information you have:

The text was updated successfully, but these errors were encountered:

padiazg · 2019-01-14T12:57:41Z

The reason NATS crashes is unknow to me, yet. But the way to reproduce it for testing purposes is killing NATS with this:

$ docker kill $(docker ps -qf "name=func_nats")

After that you will get the same results as when it crashes "in the wild"

Then you can get the stack back to work with this:

docker kill $(docker ps -qf "name=func_nats") $(docker ps -qf "name=func_queue-worker") $(docker ps -qf "name=func_gateway")

bartsmykla · 2019-01-14T14:34:04Z

I have some success, I'm testing right now reconnecting logic, but if everything will be fine, I should have PR with solution in around 2h.

bartsmykla · 2019-01-14T16:10:24Z

openfaas/nats-queue-worker#49

bartsmykla · 2019-01-21T11:57:02Z

Summary: What I did so far?

Opened Pull Request: nats-queue-worker#49 and addressed all comments, which resolves task:

Add re-connect to the gateway handler.go code (publisher)

Opened Pull Request: nats-queue-worker#52 and addressed all comments, which resolves task:

Add re-connect handler to the nats-queue-worker code (subscriber)`

I also did a research related to prometheus exporters for NATS streaming, and till official NATS exporter wouldn't support NATS Streaming metrics (there is open PR for that: nats-io/prometheus-nats-exporter#54) our best (and actually the only one) option is to use this exporter: https://gitlab.com/civist/nats-streaming-exporter
I couldn't find anything else, and tested that if it's exporting the metrics. I have build and pushed my own docker image and deployed it to my local swarm cluster of OpenFaaS adding to docker-compose.yml:

    nats-streaming-prometheus-exporter:
        image: bartsmykla/nats-streaming-exporter:0.0.1
        networks:
            - functions
        command: "/nats-streaming-exporter -nats-uri http://nats:8222"
        deploy:
            resources:
                limits:
                    memory: 125M
                reservations:
                    memory: 50M
            placement:
                constraints:
                    - 'node.platform.os == linux'
        ports:
            - 9275:9275

so if you want to test that exporter too, feel free to use my image: bartsmykla/nats-streaming-exporter:0.0.1

alexellis · 2019-01-25T11:28:34Z

Thank you for the update Bart.

Which metrics could @padiazg make use of to alert on or observe the health of his NATS Streaming instance/cluster?

bartsmykla · 2019-01-28T14:16:27Z

I think: natsstreaming_up, natsstreaming_exporter_json_parse_failures, natsstreaming_subscriptions_pending. Here is the full list from the exporter:

&Exporter{
		URI:     u,
		Timeout: timeout,
		up: prometheus.NewGauge(prometheus.GaugeOpts{
			Namespace: namespace,
			Name:      "up",
			Help:      "Was the last scrape of nats-streaming successful.",
		}),
		totalScrapes: prometheus.NewCounter(prometheus.CounterOpts{
			Namespace: namespace,
			Name:      "exporter_total_scrapes",
			Help:      "Current total nats-streaming scrapes.",
		}),
		jsonParseFailures: prometheus.NewCounter(prometheus.CounterOpts{
			Namespace: namespace,
			Name:      "exporter_json_parse_failures",
			Help:      "Number of errors while parsing JSON.",
		}),
		clientsTotal:              newDesc("clients", "Number of currently connected clients.", nil),
		channelsTotal:             newDesc("channels", "Current number of channels.", nil),
		storeMessagesTotal:        newDesc("store_messages", "Current number of messages in the store.", nil),
		storeMessagesBytes:        newDesc("store_messages_bytes", "Total size of the messages in the store.", nil),
		subscriptionsTotal:        newDesc("subscriptions", "Number of subscriptions.", []string{"channel", "client"}),
		subscriptionsPendingTotal: newDesc("subscriptions_pending", "Number of pending messages.", []string{"channel", "client"}),
		subscriptionsStalledTotal: newDesc("subscriptions_stalled", "Number of stalled subscriptions.", []string{"channel", "client"}),
		messagesTotal:             newDesc("messages", "Number of messages.", []string{"channel"}),
		messagesBytes:             newDesc("messages_bytes", "Size of the messages.", []string{"channel"}),
	}

alexellis · 2019-04-14T15:48:05Z

Resolved through patches to gateway and queue-worker

bartsmykla mentioned this issue Jan 14, 2019

Added reconnection logic when NATS is disconnected openfaas/nats-queue-worker#49

Merged

11 tasks

alexellis mentioned this issue Jan 15, 2019

Add reconnection logic for NATS queue worker and gateway handler openfaas/nats-queue-worker#50

Open

bartsmykla mentioned this issue Jan 15, 2019

Implemented reconnection logic in queue-worker openfaas/nats-queue-worker#52

Closed

11 tasks

alexellis mentioned this issue Jan 29, 2019

Re-vendor queue-worker publisher for reconnect #1065

Merged

11 tasks

alexellis closed this as completed Apr 14, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate: NATS Streaming crash/lock-up #1031

Investigate: NATS Streaming crash/lock-up #1031

alexellis commented Jan 14, 2019 •

edited

Loading

padiazg commented Jan 14, 2019

bartsmykla commented Jan 14, 2019

bartsmykla commented Jan 14, 2019

bartsmykla commented Jan 21, 2019

alexellis commented Jan 25, 2019

bartsmykla commented Jan 28, 2019

alexellis commented Apr 14, 2019

Investigate: NATS Streaming crash/lock-up #1031

Investigate: NATS Streaming crash/lock-up #1031

Comments

alexellis commented Jan 14, 2019 • edited Loading

Expected Behaviour

Current Behaviour

Possible Solution

Steps to Reproduce (for bugs)

Context

Your Environment

padiazg commented Jan 14, 2019

bartsmykla commented Jan 14, 2019

bartsmykla commented Jan 14, 2019

bartsmykla commented Jan 21, 2019

alexellis commented Jan 25, 2019

bartsmykla commented Jan 28, 2019

alexellis commented Apr 14, 2019

alexellis commented Jan 14, 2019 •

edited

Loading