-
Notifications
You must be signed in to change notification settings - Fork 302
Running units on part of the cluster stopped and started after master disconnect #1690
Comments
I'd want to know which version of fleet it is, CoreOS version too, etc. |
@dongsupark sorry for the missing info, I've added it to the initial message. We've already had to tune |
The exact same thing happened yesterday morning and again units got stopped, the main question we're having is that once the machines in the cluster are connected to the masters again fleet on the machines where units were scheduled stops these units before they have been (fully) started on the other machine the unit was moved to. See below for an example:
|
Looking into the log and code, my guess is like so. First, you could tune Second, I could not understand how unscheduling tasks followed the lease renewal failure. Reading code from v0.11.5, I think now I can understand. Maybe this issue could be fixed via #1496, which was already merged, and available since v0.12. Before that PR, a monitor failure resulted in a complete shutdown + start of the entire server. OTOH, after that PR, the shutdown procedure is gracefully handled. Of course I'm not sure if the PR really fixes this issue. I'm not familiar with the code base of 0.11.x. Anyway please try to upgrade v0.12 or newer. |
I believe that fleet should no longer stop units if it loses it's connection to the cluster, but that's what's seemed to have happened to us.
We run a cluster of 17 machines, of which we've dedicated 3 to master duty.
We're running CoreOS 899.13.0 (because we had stability issues with the 1000 series).
It's using the following versions for fleetd and etcd2
It started with a single non-master node having etcd2 connectivity issues
This is something that eventually all regular nodes showed.
We then got a new etcd2 leader election
Then there's the following that repeats 300+ times from the other 2 master nodes that weren't disconnected
And finally we see the following in fleet
It seems like the reconciler was triggered, though IMHO it shouldn't be. What could be the cause of this?
The text was updated successfully, but these errors were encountered: