Skip to content
This repository has been archived by the owner on Jan 30, 2020. It is now read-only.

CoreOs cluster restarted all containers due to fleet or etcd errors #1725

Open
ghost opened this issue Jan 18, 2017 · 1 comment
Open

CoreOs cluster restarted all containers due to fleet or etcd errors #1725

ghost opened this issue Jan 18, 2017 · 1 comment

Comments

@ghost
Copy link

ghost commented Jan 18, 2017

Hello
We just saw a pretty server issue on our production CoreOs setup. Details are:

  • 3 CoreOs nodes running in AWS EC2 Us East1
  • m3.2xlarge instance types
  • CoreOS nodes - 2 are DISTRIB_RELEASE=1068.2.0 and 1 is at DISTRIB_RELEASE=1081.5.0
  • etcd version 0.4.9
  • we have auto update disabled on CoreOs
  • Around 21:56 UTC on Jan 17 we saw all our containers go down and the logs seemed to suggest an issue with etcd

Jan 17 21:56:22 ip-10-26-31-100.ec2.internal fleetd[999]: ERROR server.go:189: Server monitor triggered: Monitor timed out before successful heartbeat
Jan 17 21:56:22 ip-10-26-31-100.ec2.internal fleetd[999]: INFO server.go:157: Establishing etcd connectivity
Jan 17 21:56:22 ip-10-26-31-100.ec2.internal fleetd[999]: ERROR engine.go:179: Engine leadership acquisition failed: context deadline exceeded
Jan 17 21:59:41 ip-10-26-31-100.ec2.internal fleetd[999]: INFO server.go:168: Starting server components
Jan 17 21:59:42 ip-10-26-31-100.ec2.internal fleetd[999]: INFO engine.go:185: Engine leadership acquired
Jan 17 21:59:43 ip-10-26-31-100.ec2.internal fleetd[999]: ERROR engine.go:254: Failed unscheduling Unit(kafka-broker-1.service) from Machine(6ca65ead2f164b2682c0d941c8a75d9b): context deadline exceeded
Jan 17 21:59:43 ip-10-26-31-100.ec2.internal fleetd[999]: ERROR reconciler.go:62: Failed resolving task: task={Type: UnscheduleUnit, JobName: kafka-broker-1.service, MachineID: 6ca65ead2f164b2682c0d941c
Jan 17 21:59:44 ip-10-26-31-100.ec2.internal fleetd[999]: ERROR engine.go:254: Failed unscheduling Unit(newNewApps.service) from Machine(6ca65ead2f164b2682c0d941c8a75d9b): context deadline exceeded

  • We checked the CPU and disk IO for all 3 instances, there is NO indication of any CPU spike per AWS Cloudwatch
  • ETCD config is as below
    core@ip-10-26-33-251 ~ $ sudo systemctl cat etcd

/usr/lib64/systemd/system/etcd.service

[Unit]
Description=etcd
Conflicts=etcd2.service

[Service]
User=etcd
PermissionsStartOnly=true
Environment=ETCD_DATA_DIR=/var/lib/etcd
Environment=ETCD_NAME=%m
ExecStart=/usr/bin/etcd
Restart=always
RestartSec=10s
LimitNOFILE=40000

/run/systemd/system/etcd.service.d/10-oem.conf

[Service]
Environment=ETCD_PEER_ELECTION_TIMEOUT=1200

/run/systemd/system/etcd.service.d/20-cloudinit.conf

[Service]
Environment="ETCD_ADDR=10.26.33.251:4001"
Environment="ETCD_CERT_FILE=/home/etcd/certs/cert.crt"
Environment="ETCD_DISCOVERY=https://discovery.etcd.io/"
Environment="ETCD_KEY_FILE=/home/etcd/certs/key.pem"
Environment="ETCD_PEER_ADDR=10.26.33.251:7001"

  • Attached the fleet & etcd logs from all nodes

etcd-10-26-31-100.txt
etcd-10-26-32-94.txt
etcd-10-26-33-251.txt
fleet-10-26-31-100.txt
fleet-10-26-32-94.txt
fleet-10-26-33-251.txt

  • AWS status dashboard does not show any errors or issues on their end

Appreciate if someone can take a look at the above and give us any pointers on what to look at and what we can do to mitigate this.

I opened a fleet ticket - etcd-io/etcd#7177 and was redirected to here

Thx
Maulik
etcd-10-26-31-100.txt
etcd-10-26-32-94.txt
etcd-10-26-33-251.txt
fleet-10-26-31-100.txt
fleet-10-26-32-94.txt
fleet-10-26-33-251.txt

@ghost
Copy link
Author

ghost commented Jan 18, 2017

Following up - we have done the below:

  • Restarted fleet on all 3 nodes - still not getting to a stable state
  • Restarted etcd on 1 node - 10.26.32.94 - still did not help
  • When we try to bring up containers, randomly see the containers being stopped or failing because of below errors:

Jan 18 01:39:04 ip-10-26-31-100.ec2.internal fleetd[20619]: ERROR units.go:231: Failed creating Unit(discoveryAppdiscoveryApps.service) in Registry: context deadline exceeded
Jan 18 01:39:05 ip-10-26-31-100.ec2.internal fleetd[20619]: ERROR units.go:231: Failed creating Unit(discoveryAppdiscoveryApps_syslog.service) in Registry: context deadline exceeded
Jan 18 01:39:09 ip-10-26-31-100.ec2.internal fleetd[20619]: ERROR server.go:189: Server monitor triggered: Monitor timed out before successful heartbeat
Jan 18 01:39:09 ip-10-26-31-100.ec2.internal fleetd[20619]: INFO server.go:157: Establishing etcd connectivity
Jan 18 01:39:09 ip-10-26-31-100.ec2.internal fleetd[20619]: INFO server.go:168: Starting server components
Jan 18 01:39:49 ip-10-26-31-100.ec2.internal fleetd[20619]: ERROR units.go:231: Failed creating Unit(discoveryAppdiscoveryApps.service) in Registry: context deadline exceeded
Jan 18 01:39:50 ip-10-26-31-100.ec2.internal fleetd[20619]: ERROR units.go:231: Failed creating Unit(discoveryAppdiscoveryApps_syslog.service) in Registry: context deadline exceeded
Jan 18 01:40:05 ip-10-26-31-100.ec2.internal fleetd[20619]: ERROR server.go:189: Server monitor triggered: Monitor timed out before successful heartbeat
Jan 18 01:40:05 ip-10-26-31-100.ec2.internal fleetd[20619]: INFO server.go:157: Establishing etcd connectivity
Jan 18 01:40:18 ip-10-26-31-100.ec2.internal fleetd[20619]: INFO server.go:168: Starting server components
Jan 18 01:40:20 ip-10-26-31-100.ec2.internal fleetd[20619]: INFO engine.go:79: Engine leader is 6ca65ead2f164b2682c0d941c8a75d9b
Jan 18 01:40:34 ip-10-26-31-100.ec2.internal fleetd[20619]: ERROR units.go:231: Failed creating Unit(discoveryAppdiscoveryApps_syslog.service) in Registry: client: response is invalid json. The endpoint is probably not valid etcd cluster endpoint

  • This entire cluster has been working fine for last 3 months, no changes or updates and all of sudden we are seeing these errors. We have not upgraded the nodes or containers in last week or so. Appreciate any insight

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

0 participants