CoreOs cluster restarted all containers due to fleet or etcd errors #1725

ghost · 2017-01-18T01:48:40Z

Hello
We just saw a pretty server issue on our production CoreOs setup. Details are:

3 CoreOs nodes running in AWS EC2 Us East1
m3.2xlarge instance types
CoreOS nodes - 2 are DISTRIB_RELEASE=1068.2.0 and 1 is at DISTRIB_RELEASE=1081.5.0
etcd version 0.4.9
we have auto update disabled on CoreOs
Around 21:56 UTC on Jan 17 we saw all our containers go down and the logs seemed to suggest an issue with etcd

Jan 17 21:56:22 ip-10-26-31-100.ec2.internal fleetd[999]: ERROR server.go:189: Server monitor triggered: Monitor timed out before successful heartbeat
Jan 17 21:56:22 ip-10-26-31-100.ec2.internal fleetd[999]: INFO server.go:157: Establishing etcd connectivity
Jan 17 21:56:22 ip-10-26-31-100.ec2.internal fleetd[999]: ERROR engine.go:179: Engine leadership acquisition failed: context deadline exceeded
Jan 17 21:59:41 ip-10-26-31-100.ec2.internal fleetd[999]: INFO server.go:168: Starting server components
Jan 17 21:59:42 ip-10-26-31-100.ec2.internal fleetd[999]: INFO engine.go:185: Engine leadership acquired
Jan 17 21:59:43 ip-10-26-31-100.ec2.internal fleetd[999]: ERROR engine.go:254: Failed unscheduling Unit(kafka-broker-1.service) from Machine(6ca65ead2f164b2682c0d941c8a75d9b): context deadline exceeded
Jan 17 21:59:43 ip-10-26-31-100.ec2.internal fleetd[999]: ERROR reconciler.go:62: Failed resolving task: task={Type: UnscheduleUnit, JobName: kafka-broker-1.service, MachineID: 6ca65ead2f164b2682c0d941c
Jan 17 21:59:44 ip-10-26-31-100.ec2.internal fleetd[999]: ERROR engine.go:254: Failed unscheduling Unit(newNewApps.service) from Machine(6ca65ead2f164b2682c0d941c8a75d9b): context deadline exceeded

We checked the CPU and disk IO for all 3 instances, there is NO indication of any CPU spike per AWS Cloudwatch
ETCD config is as below
core@ip-10-26-33-251 ~ $ sudo systemctl cat etcd

/usr/lib64/systemd/system/etcd.service

[Unit]
Description=etcd
Conflicts=etcd2.service

[Service]
User=etcd
PermissionsStartOnly=true
Environment=ETCD_DATA_DIR=/var/lib/etcd
Environment=ETCD_NAME=%m
ExecStart=/usr/bin/etcd
Restart=always
RestartSec=10s
LimitNOFILE=40000

/run/systemd/system/etcd.service.d/10-oem.conf

[Service]
Environment=ETCD_PEER_ELECTION_TIMEOUT=1200

/run/systemd/system/etcd.service.d/20-cloudinit.conf

[Service]
Environment="ETCD_ADDR=10.26.33.251:4001"
Environment="ETCD_CERT_FILE=/home/etcd/certs/cert.crt"
Environment="ETCD_DISCOVERY=https://discovery.etcd.io/"
Environment="ETCD_KEY_FILE=/home/etcd/certs/key.pem"
Environment="ETCD_PEER_ADDR=10.26.33.251:7001"

Attached the fleet & etcd logs from all nodes

etcd-10-26-31-100.txt
etcd-10-26-32-94.txt
etcd-10-26-33-251.txt
fleet-10-26-31-100.txt
fleet-10-26-32-94.txt
fleet-10-26-33-251.txt

AWS status dashboard does not show any errors or issues on their end

Appreciate if someone can take a look at the above and give us any pointers on what to look at and what we can do to mitigate this.

I opened a fleet ticket - etcd-io/etcd#7177 and was redirected to here

Thx
Maulik
etcd-10-26-31-100.txt
etcd-10-26-32-94.txt
etcd-10-26-33-251.txt
fleet-10-26-31-100.txt
fleet-10-26-32-94.txt
fleet-10-26-33-251.txt

ghost · 2017-01-18T01:51:10Z

Following up - we have done the below:

Restarted fleet on all 3 nodes - still not getting to a stable state
Restarted etcd on 1 node - 10.26.32.94 - still did not help
When we try to bring up containers, randomly see the containers being stopped or failing because of below errors:

Jan 18 01:39:04 ip-10-26-31-100.ec2.internal fleetd[20619]: ERROR units.go:231: Failed creating Unit(discoveryAppdiscoveryApps.service) in Registry: context deadline exceeded
Jan 18 01:39:05 ip-10-26-31-100.ec2.internal fleetd[20619]: ERROR units.go:231: Failed creating Unit(discoveryAppdiscoveryApps_syslog.service) in Registry: context deadline exceeded
Jan 18 01:39:09 ip-10-26-31-100.ec2.internal fleetd[20619]: ERROR server.go:189: Server monitor triggered: Monitor timed out before successful heartbeat
Jan 18 01:39:09 ip-10-26-31-100.ec2.internal fleetd[20619]: INFO server.go:157: Establishing etcd connectivity
Jan 18 01:39:09 ip-10-26-31-100.ec2.internal fleetd[20619]: INFO server.go:168: Starting server components
Jan 18 01:39:49 ip-10-26-31-100.ec2.internal fleetd[20619]: ERROR units.go:231: Failed creating Unit(discoveryAppdiscoveryApps.service) in Registry: context deadline exceeded
Jan 18 01:39:50 ip-10-26-31-100.ec2.internal fleetd[20619]: ERROR units.go:231: Failed creating Unit(discoveryAppdiscoveryApps_syslog.service) in Registry: context deadline exceeded
Jan 18 01:40:05 ip-10-26-31-100.ec2.internal fleetd[20619]: ERROR server.go:189: Server monitor triggered: Monitor timed out before successful heartbeat
Jan 18 01:40:05 ip-10-26-31-100.ec2.internal fleetd[20619]: INFO server.go:157: Establishing etcd connectivity
Jan 18 01:40:18 ip-10-26-31-100.ec2.internal fleetd[20619]: INFO server.go:168: Starting server components
Jan 18 01:40:20 ip-10-26-31-100.ec2.internal fleetd[20619]: INFO engine.go:79: Engine leader is 6ca65ead2f164b2682c0d941c8a75d9b
Jan 18 01:40:34 ip-10-26-31-100.ec2.internal fleetd[20619]: ERROR units.go:231: Failed creating Unit(discoveryAppdiscoveryApps_syslog.service) in Registry: client: response is invalid json. The endpoint is probably not valid etcd cluster endpoint

This entire cluster has been working fine for last 3 months, no changes or updates and all of sudden we are seeing these errors. We have not upgraded the nodes or containers in last week or so. Appreciate any insight

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CoreOs cluster restarted all containers due to fleet or etcd errors #1725

CoreOs cluster restarted all containers due to fleet or etcd errors #1725

ghost commented Jan 18, 2017

ghost commented Jan 18, 2017

CoreOs cluster restarted all containers due to fleet or etcd errors #1725

CoreOs cluster restarted all containers due to fleet or etcd errors #1725

Comments

ghost commented Jan 18, 2017

/usr/lib64/systemd/system/etcd.service

/run/systemd/system/etcd.service.d/10-oem.conf

/run/systemd/system/etcd.service.d/20-cloudinit.conf

ghost commented Jan 18, 2017