You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Jan 30, 2020. It is now read-only.
Restarted fleet on all 3 nodes - still not getting to a stable state
Restarted etcd on 1 node - 10.26.32.94 - still did not help
When we try to bring up containers, randomly see the containers being stopped or failing because of below errors:
Jan 18 01:39:04 ip-10-26-31-100.ec2.internal fleetd[20619]: ERROR units.go:231: Failed creating Unit(discoveryAppdiscoveryApps.service) in Registry: context deadline exceeded
Jan 18 01:39:05 ip-10-26-31-100.ec2.internal fleetd[20619]: ERROR units.go:231: Failed creating Unit(discoveryAppdiscoveryApps_syslog.service) in Registry: context deadline exceeded
Jan 18 01:39:09 ip-10-26-31-100.ec2.internal fleetd[20619]: ERROR server.go:189: Server monitor triggered: Monitor timed out before successful heartbeat
Jan 18 01:39:09 ip-10-26-31-100.ec2.internal fleetd[20619]: INFO server.go:157: Establishing etcd connectivity
Jan 18 01:39:09 ip-10-26-31-100.ec2.internal fleetd[20619]: INFO server.go:168: Starting server components
Jan 18 01:39:49 ip-10-26-31-100.ec2.internal fleetd[20619]: ERROR units.go:231: Failed creating Unit(discoveryAppdiscoveryApps.service) in Registry: context deadline exceeded
Jan 18 01:39:50 ip-10-26-31-100.ec2.internal fleetd[20619]: ERROR units.go:231: Failed creating Unit(discoveryAppdiscoveryApps_syslog.service) in Registry: context deadline exceeded
Jan 18 01:40:05 ip-10-26-31-100.ec2.internal fleetd[20619]: ERROR server.go:189: Server monitor triggered: Monitor timed out before successful heartbeat
Jan 18 01:40:05 ip-10-26-31-100.ec2.internal fleetd[20619]: INFO server.go:157: Establishing etcd connectivity
Jan 18 01:40:18 ip-10-26-31-100.ec2.internal fleetd[20619]: INFO server.go:168: Starting server components
Jan 18 01:40:20 ip-10-26-31-100.ec2.internal fleetd[20619]: INFO engine.go:79: Engine leader is 6ca65ead2f164b2682c0d941c8a75d9b
Jan 18 01:40:34 ip-10-26-31-100.ec2.internal fleetd[20619]: ERROR units.go:231: Failed creating Unit(discoveryAppdiscoveryApps_syslog.service) in Registry: client: response is invalid json. The endpoint is probably not valid etcd cluster endpoint
This entire cluster has been working fine for last 3 months, no changes or updates and all of sudden we are seeing these errors. We have not upgraded the nodes or containers in last week or so. Appreciate any insight
Sign up for freeto subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Hello
We just saw a pretty server issue on our production CoreOs setup. Details are:
Jan 17 21:56:22 ip-10-26-31-100.ec2.internal fleetd[999]: ERROR server.go:189: Server monitor triggered: Monitor timed out before successful heartbeat
Jan 17 21:56:22 ip-10-26-31-100.ec2.internal fleetd[999]: INFO server.go:157: Establishing etcd connectivity
Jan 17 21:56:22 ip-10-26-31-100.ec2.internal fleetd[999]: ERROR engine.go:179: Engine leadership acquisition failed: context deadline exceeded
Jan 17 21:59:41 ip-10-26-31-100.ec2.internal fleetd[999]: INFO server.go:168: Starting server components
Jan 17 21:59:42 ip-10-26-31-100.ec2.internal fleetd[999]: INFO engine.go:185: Engine leadership acquired
Jan 17 21:59:43 ip-10-26-31-100.ec2.internal fleetd[999]: ERROR engine.go:254: Failed unscheduling Unit(kafka-broker-1.service) from Machine(6ca65ead2f164b2682c0d941c8a75d9b): context deadline exceeded
Jan 17 21:59:43 ip-10-26-31-100.ec2.internal fleetd[999]: ERROR reconciler.go:62: Failed resolving task: task={Type: UnscheduleUnit, JobName: kafka-broker-1.service, MachineID: 6ca65ead2f164b2682c0d941c
Jan 17 21:59:44 ip-10-26-31-100.ec2.internal fleetd[999]: ERROR engine.go:254: Failed unscheduling Unit(newNewApps.service) from Machine(6ca65ead2f164b2682c0d941c8a75d9b): context deadline exceeded
core@ip-10-26-33-251 ~ $ sudo systemctl cat etcd
/usr/lib64/systemd/system/etcd.service
[Unit]
Description=etcd
Conflicts=etcd2.service
[Service]
User=etcd
PermissionsStartOnly=true
Environment=ETCD_DATA_DIR=/var/lib/etcd
Environment=ETCD_NAME=%m
ExecStart=/usr/bin/etcd
Restart=always
RestartSec=10s
LimitNOFILE=40000
/run/systemd/system/etcd.service.d/10-oem.conf
[Service]
Environment=ETCD_PEER_ELECTION_TIMEOUT=1200
/run/systemd/system/etcd.service.d/20-cloudinit.conf
[Service]
Environment="ETCD_ADDR=10.26.33.251:4001"
Environment="ETCD_CERT_FILE=/home/etcd/certs/cert.crt"
Environment="ETCD_DISCOVERY=https://discovery.etcd.io/"
Environment="ETCD_KEY_FILE=/home/etcd/certs/key.pem"
Environment="ETCD_PEER_ADDR=10.26.33.251:7001"
etcd-10-26-31-100.txt
etcd-10-26-32-94.txt
etcd-10-26-33-251.txt
fleet-10-26-31-100.txt
fleet-10-26-32-94.txt
fleet-10-26-33-251.txt
Appreciate if someone can take a look at the above and give us any pointers on what to look at and what we can do to mitigate this.
I opened a fleet ticket - etcd-io/etcd#7177 and was redirected to here
Thx
Maulik
etcd-10-26-31-100.txt
etcd-10-26-32-94.txt
etcd-10-26-33-251.txt
fleet-10-26-31-100.txt
fleet-10-26-32-94.txt
fleet-10-26-33-251.txt
The text was updated successfully, but these errors were encountered: