-
-
Notifications
You must be signed in to change notification settings - Fork 288
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ProtoActor cluster members stop talking to eachother #2143
Comments
Yes. Do you see anything else, e.g. gossip starts timing out? |
https://gist.github.com/philipp-durrer-jarowa/78f0a7c46c9c8812bda3d3ca51deb2ce According to the logs it seems that within 2 seconds after startup it tried to find other cluster members, found some but blocked them and then gave up but didn't ever retry to join the cluster. Which makes me wonder if it's generally recommended to also have an odd number of members or is there no such thing as quorum happening? So our setup is that we have 4 pods in the ProtoActor cluster (now with a dedicated Kind and ClusterName value) and we do see situation where some of the members are just stop being part of the cluster. Our application however still sends requests to those now isolated pods and obviously they can't forward requests and/or check for who feels responsible. |
There is no quorum mechanism as we fully rely on Kubernetes in this case to know what is running and what is healthy. Have you tried booting up some empty dummy proto cluster in the same environment? e.g. just a cluster, no actors to see if that cluster manages to form? What happens if you reboot all nodes at the same time, does it still disconnect after some time? |
Yes restarting all actors at the same time seems to be more successful and usually fixes the issue. Do you have an idea what in the logic would prevent a regular kubernetes deployment rollout with 4-8 replicas where each member leaves/joins within 30-60 seconds to be functional? |
It is using the "watch" feature, so it is as soon as Kubernetes announces it. Do you by any chance have some custom settings applied to e.g. gossip interval or anything special for the Proto.Cluster configurations? |
We've used this ProtoActor library in one of our systems and despite setting CPU requests and memory requests/limits we do see a cluster of 4-8 ProtoActor member suddenly stop talking to eachother. It feels like this is happening during times of little to no traffic.
We're using AKS (k8s v1.29.4). What concerns me is that our devs hardcoded the ClusterName and we have multiple instances of this service running in different Namespaces (is there namespace isolation done in protoactor code?).
The text was updated successfully, but these errors were encountered: