ProtoActor cluster members stop talking to eachother #2143

philipp-durrer-jarowa · 2024-11-19T06:28:08Z

We've used this ProtoActor library in one of our systems and despite setting CPU requests and memory requests/limits we do see a cluster of 4-8 ProtoActor member suddenly stop talking to eachother. It feels like this is happening during times of little to no traffic.
We're using AKS (k8s v1.29.4). What concerns me is that our devs hardcoded the ClusterName and we have multiple instances of this service running in different Namespaces (is there namespace isolation done in protoactor code?).

rogeralsing · 2024-11-21T18:18:44Z

is there namespace isolation done in protoactor code?

Yes.

Do you see anything else, e.g. gossip starts timing out?
Any logs from when it happens?

philipp-durrer-jarowa · 2024-11-22T08:46:04Z

https://gist.github.com/philipp-durrer-jarowa/78f0a7c46c9c8812bda3d3ca51deb2ce

According to the logs it seems that within 2 seconds after startup it tried to find other cluster members, found some but blocked them and then gave up but didn't ever retry to join the cluster.
It could be that the other 3 pods were restarting around that time too.

Which makes me wonder if it's generally recommended to also have an odd number of members or is there no such thing as quorum happening?

So our setup is that we have 4 pods in the ProtoActor cluster (now with a dedicated Kind and ClusterName value) and we do see situation where some of the members are just stop being part of the cluster. Our application however still sends requests to those now isolated pods and obviously they can't forward requests and/or check for who feels responsible.

rogeralsing · 2024-11-24T07:53:49Z

There is no quorum mechanism as we fully rely on Kubernetes in this case to know what is running and what is healthy.
Do you by any chance have some form of service mesh? e.g. Traefik, Istio, or something similar running?

Have you tried booting up some empty dummy proto cluster in the same environment? e.g. just a cluster, no actors to see if that cluster manages to form?

What happens if you reboot all nodes at the same time, does it still disconnect after some time?

philipp-durrer-jarowa · 2024-11-24T14:40:44Z

Yes restarting all actors at the same time seems to be more successful and usually fixes the issue. Do you have an idea what in the logic would prevent a regular kubernetes deployment rollout with 4-8 replicas where each member leaves/joins within 30-60 seconds to be functional?
I guess my question is how often do running cluster nodes refresh their discovery of the metadata labels?

rogeralsing · 2024-11-30T10:14:48Z

how often do running cluster nodes refresh their discovery of the metadata labels?

It is using the "watch" feature, so it is as soon as Kubernetes announces it.

Do you by any chance have some custom settings applied to e.g. gossip interval or anything special for the Proto.Cluster configurations?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ProtoActor cluster members stop talking to eachother #2143

ProtoActor cluster members stop talking to eachother #2143

philipp-durrer-jarowa commented Nov 19, 2024

rogeralsing commented Nov 21, 2024

philipp-durrer-jarowa commented Nov 22, 2024 •

edited

Loading

rogeralsing commented Nov 24, 2024

philipp-durrer-jarowa commented Nov 24, 2024 •

edited

Loading

rogeralsing commented Nov 30, 2024

ProtoActor cluster members stop talking to eachother #2143

ProtoActor cluster members stop talking to eachother #2143

Comments

philipp-durrer-jarowa commented Nov 19, 2024

rogeralsing commented Nov 21, 2024

philipp-durrer-jarowa commented Nov 22, 2024 • edited Loading

rogeralsing commented Nov 24, 2024

philipp-durrer-jarowa commented Nov 24, 2024 • edited Loading

rogeralsing commented Nov 30, 2024

philipp-durrer-jarowa commented Nov 22, 2024 •

edited

Loading

philipp-durrer-jarowa commented Nov 24, 2024 •

edited

Loading