Impact of server cluster internal slow consumer and how to handle it? #4975

SirDave0 · 2024-01-18T22:30:58Z

SirDave0
Jan 18, 2024

The slow consumer metric being reported here is nats_varz_slow_consumer_stats_routes (to distinguish from nats_varz_slow_consumer_stats_client), which happens during a server cluster rollout restart.

For example, I have a NATS server k8s StatefulSet with 3 replicas/nodes: nats0, nats1, and nats2. For our needs, we sometimes want to refresh and redistribute external client connections to this NATS server cluster, and we do a rolling restart. We will always ensure there is at least one Nats pod available at all times (no downtime from the client's perspective).

It is in this process that nats1 and nats2 can report a restarting nats0 as a slow consumer, as it takes time to become available again to digest the gossip from nats1 and nats2. Raising the write_deadline in NATS server configuration (from 15s to 20s), for example, will avoid this phenomenon. But it also means we detect real client slow consumers 5 seconds late.

I wonder what the impacts of "routes", or internal server slow consumers are to the application, and how to address it? Could there be message loss due to this?

wallyqs · 2024-01-18T22:40:35Z

wallyqs
Jan 18, 2024
Maintainer

Hi @SirDave0 what is the version of the server that you are using? Is this with JetStream enabled? Do you mean that you see routes becoming slow consumers after 10s often?

0 replies

SirDave0 · 2024-01-18T23:07:30Z

SirDave0
Jan 18, 2024
Author

Hi @wallyqs the image I used is nats:2.10.3-alpine. No JetStream was not enabled - I'm checking it out right now.

0 replies

SirDave0 · 2024-01-19T02:29:16Z

SirDave0
Jan 19, 2024
Author

From looking at https://github.com/nats-io/nack#getting-started it seems that to integrate jetStream we would need to specify every subject, and manage a Stream CRD in the namespace. Is there a ready-to-use JetStream server configuration that enables it?

0 replies

SirDave0 · 2024-01-25T18:12:09Z

SirDave0
Jan 25, 2024
Author

Is there a way to measure the impact of occasion server slow consumers? I don't see in varz metrics there are message dropped count, or things similar.

5 replies

derekcollison Jan 25, 2024
Maintainer

We do not know how many messages would be dropped in that situation.

SirDave0 Jan 25, 2024
Author

Ok. There is "pending_bytes" in connz metric, however. Could that be used to gauge - if pending_bytes is less than the server buffer, we can assume there is no dropped message?

Sample metric:

Time, __name__, container, endpoint, instance, job, namespace, pod, server_id, service, Value
2024-01-22 13:45:26.582, nats_connz_pending_bytes, nats-exporter, openmetrics, 100.64.40.130:7777, nats2, test-namespace, nats2-1, http://100.64.40.130:8222, nats2, 273

derekcollison Jan 25, 2024
Maintainer

Yes that will show if there is early back pressure on the client connection.

SirDave0 Jan 25, 2024
Author

This pressure/pending bytes are stored in the server buffer then sent when the connection can handle it, right?

For example, if the server's max_pending is set to 128MB and there are 54KB of "pending_bytes" that dissipated in a minute or so, we can consider all messages are sent successfully by the server cluster (despite some internal server slow consumers)

derekcollison Jan 25, 2024
Maintainer

They should be dissipated very quickly, however that simply means they left that server. We can not make any statement past that.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Impact of server cluster internal slow consumer and how to handle it? #4975

{{title}}

Replies: 4 comments 5 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Impact of server cluster internal slow consumer and how to handle it? #4975

SirDave0 Jan 18, 2024

Replies: 4 comments · 5 replies

wallyqs Jan 18, 2024 Maintainer

SirDave0 Jan 18, 2024 Author

SirDave0 Jan 19, 2024 Author

SirDave0 Jan 25, 2024 Author

derekcollison Jan 25, 2024 Maintainer

SirDave0 Jan 25, 2024 Author

derekcollison Jan 25, 2024 Maintainer

SirDave0 Jan 25, 2024 Author

derekcollison Jan 25, 2024 Maintainer

SirDave0
Jan 18, 2024

Replies: 4 comments 5 replies

wallyqs
Jan 18, 2024
Maintainer

SirDave0
Jan 18, 2024
Author

SirDave0
Jan 19, 2024
Author

SirDave0
Jan 25, 2024
Author

derekcollison Jan 25, 2024
Maintainer

SirDave0 Jan 25, 2024
Author

derekcollison Jan 25, 2024
Maintainer

SirDave0 Jan 25, 2024
Author

derekcollison Jan 25, 2024
Maintainer