Running units on part of the cluster stopped and started after master disconnect #1690

simonvanderveldt · 2016-10-19T14:41:45Z

I believe that fleet should no longer stop units if it loses it's connection to the cluster, but that's what's seemed to have happened to us.
We run a cluster of 17 machines, of which we've dedicated 3 to master duty.

We're running CoreOS 899.13.0 (because we had stability issues with the 1000 series).
It's using the following versions for fleetd and etcd2

$ fleetd --version
fleetd version 0.11.5
$ etcd2 --version
etcd2 --version
etcd Version: 2.2.3
Git SHA: 05b564a
Go Version: go1.4.3
Go OS/Arch: linux/amd64

It started with a single non-master node having etcd2 connectivity issues

proxy: client 127.0.0.1:58758 closed request prematurely
proxy: client 127.0.0.1:58878 closed request prematurely
proxy: client 127.0.0.1:58876 closed request prematurely
etc.

This is something that eventually all regular nodes showed.

We then got a new etcd2 leader election

2016-10-19 10:34:07.697 infra-prod-master-10.9.8.200    e109f17bc0f23c3f is starting a new election at term 28
2016-10-19 10:34:07.701 infra-prod-master-10.9.8.150    fc9295b9360ddd69 [term: 28] received a MsgVote message with higher term from e109f17bc0f23c3f [term: 29]
2016-10-19 10:34:07.701 infra-prod-master-10.9.8.200    e109f17bc0f23c3f became candidate at term 29
2016-10-19 10:34:07.705 infra-prod-master-10.9.8.200    e109f17bc0f23c3f received vote from e109f17bc0f23c3f at term 29
2016-10-19 10:34:07.705 infra-prod-master-10.9.8.150    fc9295b9360ddd69 became follower at term 29
2016-10-19 10:34:07.710 infra-prod-master-10.9.8.150    fc9295b9360ddd69 [logterm: 28, index: 1760268102, vote: 0] voted for e109f17bc0f23c3f [logterm: 28, index: 1760268102] at term 29
2016-10-19 10:34:07.711 infra-prod-master-10.9.8.200    e109f17bc0f23c3f [logterm: 28, index: 1760268102] sent vote request to d2703a25587493c9 at term 29
2016-10-19 10:34:07.715 infra-prod-master-10.9.8.200    e109f17bc0f23c3f [logterm: 28, index: 1760268102] sent vote request to fc9295b9360ddd69 at term 29
2016-10-19 10:34:07.716 infra-prod-master-10.9.8.150    raft.node: fc9295b9360ddd69 lost leader d2703a25587493c9 at term 29
2016-10-19 10:34:07.720 infra-prod-master-10.9.8.150    raft.node: fc9295b9360ddd69 elected leader e109f17bc0f23c3f at term 29
2016-10-19 10:34:07.720 infra-prod-master-10.9.8.200    raft.node: e109f17bc0f23c3f lost leader d2703a25587493c9 at term 29
2016-10-19 10:34:07.726 infra-prod-master-10.9.8.200    e109f17bc0f23c3f received vote from fc9295b9360ddd69 at term 29
2016-10-19 10:34:07.730 infra-prod-master-10.9.8.200    e109f17bc0f23c3f [q:2] has received 2 votes and 0 vote rejections
2016-10-19 10:34:07.734 infra-prod-master-10.9.8.200    e109f17bc0f23c3f became leader at term 29
2016-10-19 10:34:07.738 infra-prod-master-10.9.8.200    raft.node: e109f17bc0f23c3f elected leader e109f17bc0f23c3f at term 29

Then there's the following that repeats 300+ times from the other 2 master nodes that weren't disconnected

2016-10-19 10:34:12.601 infra-prod-master-10.9.8.150    fc9295b9360ddd69 [term: 29] ignored a MsgApp message with lower term from d2703a25587493c9 [term: 28]
2016-10-19 10:34:12.602 infra-prod-master-10.9.8.200    e109f17bc0f23c3f [term: 29] ignored a MsgApp message with lower term from d2703a25587493c9 [term: 28]
2016-10-19 10:34:12.607 infra-prod-master-10.9.8.150    fc9295b9360ddd69 [term: 29] ignored a MsgApp message with lower term from d2703a25587493c9 [term: 28]
2016-10-19 10:34:12.612 infra-prod-master-10.9.8.150    fc9295b9360ddd69 [term: 29] ignored a MsgApp message with lower term from d2703a25587493c9 [term: 28]
2016-10-19 10:34:12.613 infra-prod-master-10.9.8.200    e109f17bc0f23c3f [term: 29] ignored a MsgApp message with lower term from d2703a25587493c9 [term: 28]

And finally we see the following in fleet

2016-10-19 10:34:08.884 infra-prod-dev-10.9.8.131   ERROR engine.go:217: Engine leadership lost, renewal failed: context deadline exceeded
2016-10-19 10:34:12.213 infra-prod-dev-10.9.8.103   ERROR server.go:189: Server monitor triggered: Monitor timed out before successful heartbeat
2016-10-19 10:34:12.217 infra-prod-dev-10.9.8.103   INFO server.go:157: Establishing etcd connectivity
2016-10-19 10:34:12.250 infra-prod-dev-10.9.8.103   INFO server.go:168: Starting server components
2016-10-19 10:34:40.498 infra-prod-tst-10.9.8.246   ERROR server.go:189: Server monitor triggered: Monitor timed out before successful heartbeat
2016-10-19 10:34:40.503 infra-prod-tst-10.9.8.246   INFO server.go:157: Establishing etcd connectivity
2016-10-19 10:34:40.648 infra-prod-master-10.9.8.191    ERROR server.go:189: Server monitor triggered: Monitor timed out before successful heartbeat
2016-10-19 10:34:40.652 infra-prod-master-10.9.8.191    INFO server.go:157: Establishing etcd connectivity
2016-10-19 10:34:46.573 infra-prod-tst-10.9.8.246   INFO server.go:168: Starting server components
2016-10-19 10:34:47.915 infra-prod-dev-10.9.8.22    ERROR server.go:189: Server monitor triggered: Monitor timed out before successful heartbeat
2016-10-19 10:34:47.919 infra-prod-dev-10.9.8.22    INFO server.go:157: Establishing etcd connectivity
2016-10-19 10:34:50.744 infra-prod-tools-10.9.8.118 ERROR server.go:189: Server monitor triggered: Monitor timed out before successful heartbeat
2016-10-19 10:34:50.747 infra-prod-tools-10.9.8.118 INFO server.go:157: Establishing etcd connectivity
2016-10-19 10:34:55.312 infra-prod-dev-10.9.8.131   INFO engine.go:256: Unscheduled Job([email protected]) from Machine(0f2c840a208949cdbf2dfbc12ce35caf)
2016-10-19 10:34:55.318 infra-prod-dev-10.9.8.131   INFO reconciler.go:161: EngineReconciler completed task: {Type: UnscheduleUnit, JobName: [email protected], MachineID: 0f2c840a208949cdbf2dfbc12ce35caf, Reason: "target Machine(0f2c840a208949cdbf2dfbc12ce35caf) went away"}
2016-10-19 10:34:55.360 infra-prod-dev-10.9.8.131   INFO engine.go:256: Unscheduled Job([email protected]) from Machine(0f2c840a208949cdbf2dfbc12ce35caf)
2016-10-19 10:34:55.366 infra-prod-dev-10.9.8.131   INFO reconciler.go:161: EngineReconciler completed task: {Type: UnscheduleUnit, JobName: [email protected], MachineID: 0f2c840a208949cdbf2dfbc12ce35caf, Reason: "target Machine(0f2c840a208949cdbf2dfbc12ce35caf) went away"}
2016-10-19 10:34:55.415 infra-prod-dev-10.9.8.131   INFO engine.go:256: Unscheduled Job([email protected]) from Machine(0f2c840a208949cdbf2dfbc12ce35caf)
2016-10-19 10:34:55.421 infra-prod-dev-10.9.8.131   INFO reconciler.go:161: EngineReconciler completed task: {Type: UnscheduleUnit, JobName: [email protected], MachineID: 0f2c840a208949cdbf2dfbc12ce35caf, Reason: "target Machine(0f2c840a208949cdbf2dfbc12ce35caf) went away"}
2016-10-19 10:34:55.470 infra-prod-dev-10.9.8.131   INFO engine.go:256: Unscheduled Job([email protected]) from Machine(0f2c840a208949cdbf2dfbc12ce35caf)
2016-10-19 10:34:55.476 infra-prod-dev-10.9.8.131   INFO reconciler.go:161: EngineReconciler completed task: {Type: UnscheduleUnit, JobName: [email protected], MachineID: 0f2c840a208949cdbf2dfbc12ce35caf, Reason: "target Machine(0f2c840a208949cdbf2dfbc12ce35caf) went away"}
2016-10-19 10:34:55.529 infra-prod-dev-10.9.8.131   INFO engine.go:256: Unscheduled Job([email protected]) from Machine(0f2c840a208949cdbf2dfbc12ce35caf)
2016-10-19 10:34:55.536 infra-prod-dev-10.9.8.131   INFO reconciler.go:161: EngineReconciler completed task: {Type: UnscheduleUnit, JobName: [email protected], MachineID: 0f2c840a208949cdbf2dfbc12ce35caf, Reason: "target Machine(0f2c840a208949cdbf2dfbc12ce35caf) went away"}
2016-10-19 10:34:55.623 infra-prod-dev-10.9.8.131   INFO engine.go:256: Unscheduled Job([email protected]) from Machine(0f2c840a208949cdbf2dfbc12ce35caf)
etc...

It seems like the reconciler was triggered, though IMHO it shouldn't be. What could be the cause of this?

The text was updated successfully, but these errors were encountered:

dongsupark · 2016-10-21T10:25:06Z

I'd want to know which version of fleet it is, CoreOS version too, etc.
Anyway I guess it would be probably worth tuning agent_ttl.
See also #1334.

simonvanderveldt · 2016-10-21T10:55:44Z

I'd want to know which version of fleet it is, CoreOS version too, etc.
Anyway I guess it would be probably worth tuning agent_ttl.
See also #1334.

@dongsupark sorry for the missing info, I've added it to the initial message.

We've already had to tune agent_ttl because we ran into a lot of problems when it was at it's default value. It's now at 30s.
Would fleet running out of this ttl cause it to stop units?

simonvanderveldt · 2016-11-07T11:49:53Z

The exact same thing happened yesterday morning and again units got stopped, the main question we're having is that once the machines in the cluster are connected to the masters again fleet on the machines where units were scheduled stops these units before they have been (fully) started on the other machine the unit was moved to.

See below for an example:

ovember 6th 2016, 07:20:22.626  tst-1-master-10.1.21.52 INFO engine.go:256: Unscheduled Job([email protected]) from Machine(59636889a9ed4f3ba191642fd545978e)
November 6th 2016, 07:20:22.633 tst-1-master-10.1.21.52 INFO reconciler.go:161: EngineReconciler completed task: {Type: UnscheduleUnit, JobName: [email protected], MachineID: 59636889a9ed4f3ba191642fd545978e, Reason: "target Machine(59636889a9ed4f3ba191642fd545978e) went away"}
November 6th 2016, 07:20:23.092 tst-1-master-10.1.21.52 INFO engine.go:271: Scheduled Unit([email protected]) to Machine(41a360cf46ef4a4fb674a3fae3af099e)
November 6th 2016, 07:20:23.103 tst-1-master-10.1.21.52 INFO reconciler.go:161: EngineReconciler completed task: {Type: AttemptScheduleUnit, JobName: [email protected], MachineID: 41a360cf46ef4a4fb674a3fae3af099e, Reason: "target state launched and unit not scheduled"}
November 6th 2016, 07:20:23.900 tst-1-worker-10.1.21.87 INFO manager.go:246: Writing systemd unit [email protected] (1846b)
November 6th 2016, 07:20:24.035 tst-1-worker-10.1.21.88 INFO manager.go:138: Triggered systemd unit [email protected] stop: job=1192125
November 6th 2016, 07:20:24.077 tst-1-worker-10.1.21.88 INFO manager.go:259: Removing systemd unit [email protected]
November 6th 2016, 07:20:25.426 tst-1-worker-10.1.21.87 INFO manager.go:127: Triggered systemd unit [email protected] start: job=1195770
November 6th 2016, 07:20:25.497 tst-1-worker-10.1.21.87 INFO reconcile.go:330: AgentReconciler completed task: type=LoadUnit [email protected] reason="unit scheduled here but not loaded"
November 6th 2016, 07:20:26.361 tst-1-worker-10.1.21.87 INFO reconcile.go:330: AgentReconciler completed task: type=StartUnit [email protected] reason="unit currently loaded but desired state is launched"
November 6th 2016, 07:20:32.133 tst-1-worker-10.1.21.88 ERROR unit_state.go:204: Failed to destroy UnitState([email protected]) in Registry: context deadline exceeded
November 6th 2016, 07:20:32.197 tst-1-worker-10.1.21.88 INFO reconcile.go:330: AgentReconciler completed task: type=UnloadUnit [email protected] reason="unit loaded but not scheduled here"```

dongsupark · 2016-11-07T14:41:38Z

Looking into the log and code, my guess is like so.

First, you could tune etcd_request_timeout too, to reduce errors like "Engine leadership lost, renewal failed: context deadline exceeded". Of course it's still hard to figure out the real reason of the lease renewal failure, but such a tuning could be workaround.

Second, I could not understand how unscheduling tasks followed the lease renewal failure. Reading code from v0.11.5, I think now I can understand. Maybe this issue could be fixed via #1496, which was already merged, and available since v0.12. Before that PR, a monitor failure resulted in a complete shutdown + start of the entire server. OTOH, after that PR, the shutdown procedure is gracefully handled. Of course I'm not sure if the PR really fixes this issue. I'm not familiar with the code base of 0.11.x. Anyway please try to upgrade v0.12 or newer.

dongsupark added the kind/support label Oct 19, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running units on part of the cluster stopped and started after master disconnect #1690

Running units on part of the cluster stopped and started after master disconnect #1690

simonvanderveldt commented Oct 19, 2016 •

edited

Loading

dongsupark commented Oct 21, 2016

simonvanderveldt commented Oct 21, 2016

simonvanderveldt commented Nov 7, 2016 •

edited

Loading

dongsupark commented Nov 7, 2016

Running units on part of the cluster stopped and started after master disconnect #1690

Running units on part of the cluster stopped and started after master disconnect #1690

Comments

simonvanderveldt commented Oct 19, 2016 • edited Loading

dongsupark commented Oct 21, 2016

simonvanderveldt commented Oct 21, 2016

simonvanderveldt commented Nov 7, 2016 • edited Loading

dongsupark commented Nov 7, 2016

simonvanderveldt commented Oct 19, 2016 •

edited

Loading

simonvanderveldt commented Nov 7, 2016 •

edited

Loading