Unit spontaneously fails and doesn't restart #1130

timfpark · 2015-02-19T16:57:54Z

I'm using CoreOS to launch a set of containers that back an application. I launch them via:

$ fleetctl start ../common/web@{1,2,3,4,5}.service

This works great initially but suddenly down the road the units are inexplicably stopped and then don't restart per the unit but are left in the failed state. Both the inexplicable stop and the failure to restart are mysteries to me because per the docs I expect the unit to restart:

● [email protected]
Loaded: not-found (Reason: No such file or directory)
Active: failed (Result: exit-code) since Thu 2015-02-19 15:11:58 UTC; 1h 35min ago
Main PID: 10747 (code=exited, status=255)

Feb 18 23:47:04 web-prod-coreos-2 docker[3199]: 168.63.129.16 - - [Wed, 18 Feb 2015 23:47:04 GMT] "GET /api/v1/ops/health HTTP/1.1" 200 125 "-" "Load Balancer Agent"
Feb 18 23:47:19 web-prod-coreos-2 docker[3199]: 168.63.129.16 - - [Wed, 18 Feb 2015 23:47:19 GMT] "GET /api/v1/ops/health HTTP/1.1" 200 125 "-" "Load Balancer Agent"
Feb 18 23:47:28 web-prod-coreos-2 systemd[1]: Stopping Web Service...
Feb 18 23:47:30 web-prod-coreos-2 docker[4706]: web
Feb 18 23:47:30 web-prod-coreos-2 systemd[1]: [email protected]: main process exited, code=exited, status=255/n/a
Feb 18 23:47:30 web-prod-coreos-2 systemd[1]: Stopped Web Service.
Feb 18 23:47:30 web-prod-coreos-2 systemd[1]: Unit [email protected] entered failed state.
Feb 18 23:47:30 web-prod-coreos-2 systemd[1]: [email protected] failed.

I'm perplexed that status returns Loaded: not-found on the failed unit because a health unit's status reports this which indicates that it does have the unit loaded:

● [email protected] - Web Service
Loaded: loaded (/run/fleet/units/[email protected]; linked-runtime; vendor preset: disabled)
Active: active (running) since Thu 2015-02-19 16:42:09 UTC; 10min ago
Process: 17719 ExecStartPre=/usr/bin/docker pull timfpark/web:latest (code=exited, status=0/SUCCESS)
Process: 17702 ExecStartPre=/usr/bin/docker rm web (code=exited, status=1/FAILURE)
Process: 17695 ExecStartPre=/usr/bin/docker kill web (code=exited, status=1/FAILURE)
Main PID: 17727 (docker)
CGroup: /system.slice/system-web.slice/[email protected]
└─17727 /usr/bin/docker run --rm --name web -p 80:8080 timfpark/web

I'm using the following unit file for this service:

[Unit]
Description=Web Service
After=docker.service
Requires=docker.service

[Service]
TimeoutStartSec=0
ExecStartPre=-/usr/bin/docker kill web
ExecStartPre=-/usr/bin/docker rm web
ExecStartPre=/usr/bin/docker pull timfpark/web:latest
ExecStart=/usr/bin/docker run --rm --name web -p 80:8080 timfpark/web
ExecStop=/usr/bin/docker stop web

[X-Fleet]
X-Conflicts=web@*.service

guruvan · 2015-02-20T23:03:39Z

I'm seeing this and/or similar issues constantly on 557.2
One suggestion was to add NeedsDaemonReload=yes to the unit. - this is pretty crucial imo as this is preventing services from launching automatically on new hosts (and basically causing rolling outages of services)

bcwaldon · 2015-02-27T03:36:04Z

NeedsDaemonReload is not an option you can set in your unit files. This is a property provided directly by systemd over the DBus API.

@timfpark This could be related to #900, but I'm not ready to assume that without a little more info. Do you happen to have any logs from fleet.service around the time that your unit entered this unexpected state?

timfpark · 2015-03-02T16:15:16Z

@bcwaldon: Here you go...

Mar 02 16:00:23 frontdoor-prod-coreos-3 fleetd[1657]: INFO server.go:153: Establishing etcd connectivity
Mar 02 16:00:23 frontdoor-prod-coreos-3 fleetd[1657]: INFO server.go:164: Starting server components
Mar 02 16:00:23 frontdoor-prod-coreos-3 fleetd[1657]: INFO engine.go:82: Engine leadership changed from 4b148412e4374434a4b2b92655f6ef32 to 64b646305c6d46e9953486371e93cb3d
Mar 02 16:03:39 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Mar 02 16:03:39 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Mar 02 16:03:39 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Mar 02 16:03:39 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Mar 02 16:03:39 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Mar 02 16:03:40 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Mar 02 16:03:40 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Mar 02 16:03:40 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Mar 02 16:03:43 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Mar 02 16:03:44 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Mar 02 16:03:54 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Mar 02 16:03:54 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Mar 02 16:03:54 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Mar 02 16:03:54 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Mar 02 16:03:54 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Mar 02 16:03:55 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Mar 02 16:03:55 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Mar 02 16:03:55 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Mar 02 16:05:58 frontdoor-prod-coreos-3 fleetd[1657]: INFO manager.go:145: Triggered systemd unit [email protected] stop: job=1360760
Mar 02 16:05:58 frontdoor-prod-coreos-3 fleetd[1657]: INFO manager.go:275: Removing systemd unit [email protected]
Mar 02 16:05:58 frontdoor-prod-coreos-3 fleetd[1657]: INFO reconcile.go:321: AgentReconciler completed task: type=UnloadUnit job=[email protected] reason="unit loaded but not scheduled here"
Mar 02 16:06:24 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Mar 02 16:06:24 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Mar 02 16:06:24 frontdoor-prod-coreos-3 fleetd[1657]: INFO manager.go:145: Triggered systemd unit [email protected] stop: job=1360841
Mar 02 16:06:24 frontdoor-prod-coreos-3 fleetd[1657]: INFO manager.go:275: Removing systemd unit [email protected]
Mar 02 16:06:24 frontdoor-prod-coreos-3 fleetd[1657]: INFO reconcile.go:321: AgentReconciler completed task: type=UnloadUnit job=[email protected] reason="unit loaded but not scheduled here"
Mar 02 16:06:51 frontdoor-prod-coreos-3 fleetd[1657]: INFO manager.go:262: Writing systemd unit [email protected] (465b)
Mar 02 16:06:51 frontdoor-prod-coreos-3 fleetd[1657]: INFO reconcile.go:321: AgentReconciler completed task: type=LoadUnit job=[email protected] reason="unit scheduled here but not loaded"
Mar 02 16:06:52 frontdoor-prod-coreos-3 fleetd[1657]: ERROR manager.go:136: Failed to trigger systemd unit [email protected] start: Unit [email protected] failed to load: No such file or directory.
Mar 02 16:06:52 frontdoor-prod-coreos-3 fleetd[1657]: INFO reconcile.go:321: AgentReconciler completed task: type=StartUnit job=[email protected] reason="unit currently loaded but desired state is launched"
Mar 02 16:07:25 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Mar 02 16:08:09 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Mar 02 16:08:09 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Mar 02 16:08:10 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Mar 02 16:08:10 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Mar 02 16:08:39 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Mar 02 16:08:39 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Mar 02 16:08:39 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Mar 02 16:09:18 frontdoor-prod-coreos-3 fleetd[1657]: INFO manager.go:262: Writing systemd unit [email protected] (698b)
Mar 02 16:09:18 frontdoor-prod-coreos-3 fleetd[1657]: INFO reconcile.go:321: AgentReconciler completed task: type=LoadUnit job=[email protected] reason="unit scheduled here but not loaded"
Mar 02 16:09:18 frontdoor-prod-coreos-3 fleetd[1657]: INFO manager.go:134: Triggered systemd unit [email protected] start: job=1361002
Mar 02 16:09:18 frontdoor-prod-coreos-3 fleetd[1657]: INFO reconcile.go:321: AgentReconciler completed task: type=StartUnit job=[email protected] reason="unit currently loaded but desired state is launched"
Mar 02 16:12:09 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Mar 02 16:12:09 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Mar 02 16:12:09 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Mar 02 16:12:09 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Mar 02 16:12:09 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Mar 02 16:12:25 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Mar 02 16:12:25 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled

imjacobclark · 2015-03-08T09:41:05Z

I'm seeing similar issues regularly on CoreOS 557.2.0 where my units fail that are started with Fleet, none of which give much useful output, I see main process exited, code=exited, status=255/n/a in the service logs.

bcwaldon · 2015-03-08T16:47:56Z

@imjacobclark the first step in debugging that is to take fleet out of the equation. Can you use systemctl commands to start the unit directly?

guruvan · 2015-03-10T16:03:28Z

@bcwaldon - Sorry if I was short. It's been rough the past couple weeks... I haven't got the logs yet, still need to sanitize units, but I will push the whole lot up to github asap. however - I made some significant changes to my network this past weekend to accommodate 607.0. I've updated all the workers, but the etcd-cluster is all still on 557.2 or 584. (which seems ok so far!)

The rolling outages I described above were primarily caused by etcd/flannel/fleet lagging - I've moved load around, and cut down the overall amount of DNS traffic (which operates via skydns), and don't see my services dying off. This behavior, however, greatly enhanced the effects of:
the BUG described above seems to be caused either by

mismatched versions of fleet
mismatched template/instance units in the fleet registry

I had enough host churn as I rebuilt workers this weekend to identify that I've only got this issue with units that don't load via fleet with a small handful of instances of units

[email protected] loads fine
[email protected] loads fine
[email protected] loads dead no matter which host it goes to

The "no such file or directory" units seem very consistent across the network. Trying to identify the log snippets you need exactly. (i.e. "egrep" is better than "egrep -v everything_else" - but not sure what I'm egrepping for) - In every case, going to the host, doing a systemctl daemon-reload && systemctl restart [email protected] gets the unit running again.

While I'm sure it's not recommended - it's not clear how one would correctly edit a production unit file and submit this to fleet, without destroying all the instances of the unit together. It's not really practical to replace all instances simultaneously in production. In my best guess this is likely the primary cause in my situation. (I know I've had to do this with most, if not all the unit instances I have that are seeing this issue) - my service discovery depends on the unit instance name which names the container, and provides the structure for the DNS names, so it's not very easy for me to simply use a new unit name to achieve the result.

I don't know how to duplicate the scenario that caused this, but as soon as I have an opportunity later this week, I'm going to move some more services around, so that I can destroy all the instances and the templates for units with offending instances - I think this will cure the issue. I'll report back my findings.

Apologies - I meant this to go into #1134 - with your previous comment this weekend.

bcwaldon · 2015-05-01T14:00:08Z

@guruvan Have your issues been addressed via fleet v0.9.2?

bcwaldon · 2015-05-01T14:00:26Z

@imjacobclark any more information you can share?

alxeg · 2015-06-02T15:56:20Z

Any news on this issue?

bcwaldon · 2015-07-09T21:45:29Z

closing due to inactivity

bcwaldon added bug reviewed/needs more information labels Feb 27, 2015

bcwaldon closed this as completed Jul 9, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unit spontaneously fails and doesn't restart #1130

Unit spontaneously fails and doesn't restart #1130

timfpark commented Feb 19, 2015

guruvan commented Feb 20, 2015

bcwaldon commented Feb 27, 2015

timfpark commented Mar 2, 2015

imjacobclark commented Mar 8, 2015

bcwaldon commented Mar 8, 2015

guruvan commented Mar 10, 2015

bcwaldon commented May 1, 2015

bcwaldon commented May 1, 2015

alxeg commented Jun 2, 2015

bcwaldon commented Jul 9, 2015

Unit spontaneously fails and doesn't restart #1130

Unit spontaneously fails and doesn't restart #1130

Comments

timfpark commented Feb 19, 2015

guruvan commented Feb 20, 2015

bcwaldon commented Feb 27, 2015

timfpark commented Mar 2, 2015

imjacobclark commented Mar 8, 2015

bcwaldon commented Mar 8, 2015

guruvan commented Mar 10, 2015

bcwaldon commented May 1, 2015

bcwaldon commented May 1, 2015

alxeg commented Jun 2, 2015

bcwaldon commented Jul 9, 2015