Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ProtoActor cluster members stop talking to eachother #2143

Open
philipp-durrer-jarowa opened this issue Nov 19, 2024 · 5 comments
Open

ProtoActor cluster members stop talking to eachother #2143

philipp-durrer-jarowa opened this issue Nov 19, 2024 · 5 comments

Comments

@philipp-durrer-jarowa
Copy link

We've used this ProtoActor library in one of our systems and despite setting CPU requests and memory requests/limits we do see a cluster of 4-8 ProtoActor member suddenly stop talking to eachother. It feels like this is happening during times of little to no traffic.
We're using AKS (k8s v1.29.4). What concerns me is that our devs hardcoded the ClusterName and we have multiple instances of this service running in different Namespaces (is there namespace isolation done in protoactor code?).

@rogeralsing
Copy link
Contributor

is there namespace isolation done in protoactor code?

Yes.

Do you see anything else, e.g. gossip starts timing out?
Any logs from when it happens?

@philipp-durrer-jarowa
Copy link
Author

philipp-durrer-jarowa commented Nov 22, 2024

https://gist.github.com/philipp-durrer-jarowa/78f0a7c46c9c8812bda3d3ca51deb2ce

According to the logs it seems that within 2 seconds after startup it tried to find other cluster members, found some but blocked them and then gave up but didn't ever retry to join the cluster.
It could be that the other 3 pods were restarting around that time too.

Which makes me wonder if it's generally recommended to also have an odd number of members or is there no such thing as quorum happening?

So our setup is that we have 4 pods in the ProtoActor cluster (now with a dedicated Kind and ClusterName value) and we do see situation where some of the members are just stop being part of the cluster. Our application however still sends requests to those now isolated pods and obviously they can't forward requests and/or check for who feels responsible.

@rogeralsing
Copy link
Contributor

There is no quorum mechanism as we fully rely on Kubernetes in this case to know what is running and what is healthy.
Do you by any chance have some form of service mesh? e.g. Traefik, Istio, or something similar running?

Have you tried booting up some empty dummy proto cluster in the same environment? e.g. just a cluster, no actors to see if that cluster manages to form?

What happens if you reboot all nodes at the same time, does it still disconnect after some time?

@philipp-durrer-jarowa
Copy link
Author

philipp-durrer-jarowa commented Nov 24, 2024

Yes restarting all actors at the same time seems to be more successful and usually fixes the issue. Do you have an idea what in the logic would prevent a regular kubernetes deployment rollout with 4-8 replicas where each member leaves/joins within 30-60 seconds to be functional?
I guess my question is how often do running cluster nodes refresh their discovery of the metadata labels?

@rogeralsing
Copy link
Contributor

how often do running cluster nodes refresh their discovery of the metadata labels?

It is using the "watch" feature, so it is as soon as Kubernetes announces it.

Do you by any chance have some custom settings applied to e.g. gossip interval or anything special for the Proto.Cluster configurations?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants