Skip to content
This repository has been archived by the owner on Jan 30, 2020. It is now read-only.

Unit spontaneously fails and doesn't restart #1130

Closed
timfpark opened this issue Feb 19, 2015 · 10 comments
Closed

Unit spontaneously fails and doesn't restart #1130

timfpark opened this issue Feb 19, 2015 · 10 comments

Comments

@timfpark
Copy link

I'm using CoreOS to launch a set of containers that back an application. I launch them via:

$ fleetctl start ../common/web@{1,2,3,4,5}.service

This works great initially but suddenly down the road the units are inexplicably stopped and then don't restart per the unit but are left in the failed state. Both the inexplicable stop and the failure to restart are mysteries to me because per the docs I expect the unit to restart:

[email protected]
Loaded: not-found (Reason: No such file or directory)
Active: failed (Result: exit-code) since Thu 2015-02-19 15:11:58 UTC; 1h 35min ago
Main PID: 10747 (code=exited, status=255)

Feb 18 23:47:04 web-prod-coreos-2 docker[3199]: 168.63.129.16 - - [Wed, 18 Feb 2015 23:47:04 GMT] "GET /api/v1/ops/health HTTP/1.1" 200 125 "-" "Load Balancer Agent"
Feb 18 23:47:19 web-prod-coreos-2 docker[3199]: 168.63.129.16 - - [Wed, 18 Feb 2015 23:47:19 GMT] "GET /api/v1/ops/health HTTP/1.1" 200 125 "-" "Load Balancer Agent"
Feb 18 23:47:28 web-prod-coreos-2 systemd[1]: Stopping Web Service...
Feb 18 23:47:30 web-prod-coreos-2 docker[4706]: web
Feb 18 23:47:30 web-prod-coreos-2 systemd[1]: [email protected]: main process exited, code=exited, status=255/n/a
Feb 18 23:47:30 web-prod-coreos-2 systemd[1]: Stopped Web Service.
Feb 18 23:47:30 web-prod-coreos-2 systemd[1]: Unit [email protected] entered failed state.
Feb 18 23:47:30 web-prod-coreos-2 systemd[1]: [email protected] failed.

I'm perplexed that status returns Loaded: not-found on the failed unit because a health unit's status reports this which indicates that it does have the unit loaded:

[email protected] - Web Service
Loaded: loaded (/run/fleet/units/[email protected]; linked-runtime; vendor preset: disabled)
Active: active (running) since Thu 2015-02-19 16:42:09 UTC; 10min ago
Process: 17719 ExecStartPre=/usr/bin/docker pull timfpark/web:latest (code=exited, status=0/SUCCESS)
Process: 17702 ExecStartPre=/usr/bin/docker rm web (code=exited, status=1/FAILURE)
Process: 17695 ExecStartPre=/usr/bin/docker kill web (code=exited, status=1/FAILURE)
Main PID: 17727 (docker)
CGroup: /system.slice/system-web.slice/[email protected]
└─17727 /usr/bin/docker run --rm --name web -p 80:8080 timfpark/web

I'm using the following unit file for this service:

[Unit]
Description=Web Service
After=docker.service
Requires=docker.service

[Service]
TimeoutStartSec=0
ExecStartPre=-/usr/bin/docker kill web
ExecStartPre=-/usr/bin/docker rm web
ExecStartPre=/usr/bin/docker pull timfpark/web:latest
ExecStart=/usr/bin/docker run --rm --name web -p 80:8080 timfpark/web
ExecStop=/usr/bin/docker stop web

[X-Fleet]
X-Conflicts=web@*.service

@guruvan
Copy link

guruvan commented Feb 20, 2015

I'm seeing this and/or similar issues constantly on 557.2
One suggestion was to add NeedsDaemonReload=yes to the unit. - this is pretty crucial imo as this is preventing services from launching automatically on new hosts (and basically causing rolling outages of services)

@bcwaldon
Copy link
Contributor

NeedsDaemonReload is not an option you can set in your unit files. This is a property provided directly by systemd over the DBus API.

@timfpark This could be related to #900, but I'm not ready to assume that without a little more info. Do you happen to have any logs from fleet.service around the time that your unit entered this unexpected state?

@timfpark
Copy link
Author

timfpark commented Mar 2, 2015

@bcwaldon: Here you go...

Mar 02 16:00:23 frontdoor-prod-coreos-3 fleetd[1657]: INFO server.go:153: Establishing etcd connectivity
Mar 02 16:00:23 frontdoor-prod-coreos-3 fleetd[1657]: INFO server.go:164: Starting server components
Mar 02 16:00:23 frontdoor-prod-coreos-3 fleetd[1657]: INFO engine.go:82: Engine leadership changed from 4b148412e4374434a4b2b92655f6ef32 to 64b646305c6d46e9953486371e93cb3d
Mar 02 16:03:39 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Mar 02 16:03:39 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Mar 02 16:03:39 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Mar 02 16:03:39 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Mar 02 16:03:39 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Mar 02 16:03:40 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Mar 02 16:03:40 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Mar 02 16:03:40 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Mar 02 16:03:43 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Mar 02 16:03:44 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Mar 02 16:03:54 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Mar 02 16:03:54 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Mar 02 16:03:54 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Mar 02 16:03:54 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Mar 02 16:03:54 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Mar 02 16:03:55 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Mar 02 16:03:55 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Mar 02 16:03:55 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Mar 02 16:05:58 frontdoor-prod-coreos-3 fleetd[1657]: INFO manager.go:145: Triggered systemd unit [email protected] stop: job=1360760
Mar 02 16:05:58 frontdoor-prod-coreos-3 fleetd[1657]: INFO manager.go:275: Removing systemd unit [email protected]
Mar 02 16:05:58 frontdoor-prod-coreos-3 fleetd[1657]: INFO reconcile.go:321: AgentReconciler completed task: type=UnloadUnit job=[email protected] reason="unit loaded but not scheduled here"
Mar 02 16:06:24 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Mar 02 16:06:24 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Mar 02 16:06:24 frontdoor-prod-coreos-3 fleetd[1657]: INFO manager.go:145: Triggered systemd unit [email protected] stop: job=1360841
Mar 02 16:06:24 frontdoor-prod-coreos-3 fleetd[1657]: INFO manager.go:275: Removing systemd unit [email protected]
Mar 02 16:06:24 frontdoor-prod-coreos-3 fleetd[1657]: INFO reconcile.go:321: AgentReconciler completed task: type=UnloadUnit job=[email protected] reason="unit loaded but not scheduled here"
Mar 02 16:06:51 frontdoor-prod-coreos-3 fleetd[1657]: INFO manager.go:262: Writing systemd unit [email protected] (465b)
Mar 02 16:06:51 frontdoor-prod-coreos-3 fleetd[1657]: INFO reconcile.go:321: AgentReconciler completed task: type=LoadUnit job=[email protected] reason="unit scheduled here but not loaded"
Mar 02 16:06:52 frontdoor-prod-coreos-3 fleetd[1657]: ERROR manager.go:136: Failed to trigger systemd unit [email protected] start: Unit [email protected] failed to load: No such file or directory.
Mar 02 16:06:52 frontdoor-prod-coreos-3 fleetd[1657]: INFO reconcile.go:321: AgentReconciler completed task: type=StartUnit job=[email protected] reason="unit currently loaded but desired state is launched"
Mar 02 16:07:25 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Mar 02 16:08:09 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Mar 02 16:08:09 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Mar 02 16:08:10 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Mar 02 16:08:10 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Mar 02 16:08:39 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Mar 02 16:08:39 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Mar 02 16:08:39 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Mar 02 16:09:18 frontdoor-prod-coreos-3 fleetd[1657]: INFO manager.go:262: Writing systemd unit [email protected] (698b)
Mar 02 16:09:18 frontdoor-prod-coreos-3 fleetd[1657]: INFO reconcile.go:321: AgentReconciler completed task: type=LoadUnit job=[email protected] reason="unit scheduled here but not loaded"
Mar 02 16:09:18 frontdoor-prod-coreos-3 fleetd[1657]: INFO manager.go:134: Triggered systemd unit [email protected] start: job=1361002
Mar 02 16:09:18 frontdoor-prod-coreos-3 fleetd[1657]: INFO reconcile.go:321: AgentReconciler completed task: type=StartUnit job=[email protected] reason="unit currently loaded but desired state is launched"
Mar 02 16:12:09 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Mar 02 16:12:09 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Mar 02 16:12:09 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Mar 02 16:12:09 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Mar 02 16:12:09 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Mar 02 16:12:25 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled
Mar 02 16:12:25 frontdoor-prod-coreos-3 fleetd[1657]: INFO client.go:291: Failed getting response from http://localhost:4001/: cancelled

@imjacobclark
Copy link

I'm seeing similar issues regularly on CoreOS 557.2.0 where my units fail that are started with Fleet, none of which give much useful output, I see main process exited, code=exited, status=255/n/a in the service logs.

@bcwaldon
Copy link
Contributor

bcwaldon commented Mar 8, 2015

@imjacobclark the first step in debugging that is to take fleet out of the equation. Can you use systemctl commands to start the unit directly?

@guruvan
Copy link

guruvan commented Mar 10, 2015

@bcwaldon - Sorry if I was short. It's been rough the past couple weeks... I haven't got the logs yet, still need to sanitize units, but I will push the whole lot up to github asap. however - I made some significant changes to my network this past weekend to accommodate 607.0. I've updated all the workers, but the etcd-cluster is all still on 557.2 or 584. (which seems ok so far!)

  1. The rolling outages I described above were primarily caused by etcd/flannel/fleet lagging - I've moved load around, and cut down the overall amount of DNS traffic (which operates via skydns), and don't see my services dying off. This behavior, however, greatly enhanced the effects of:

  2. the BUG described above seems to be caused either by

  • mismatched versions of fleet
  • mismatched template/instance units in the fleet registry

I had enough host churn as I rebuilt workers this weekend to identify that I've only got this issue with units that don't load via fleet with a small handful of instances of units

The "no such file or directory" units seem very consistent across the network. Trying to identify the log snippets you need exactly. (i.e. "egrep" is better than "egrep -v everything_else" - but not sure what I'm egrepping for) - In every case, going to the host, doing a systemctl daemon-reload && systemctl restart [email protected] gets the unit running again.

While I'm sure it's not recommended - it's not clear how one would correctly edit a production unit file and submit this to fleet, without destroying all the instances of the unit together. It's not really practical to replace all instances simultaneously in production. In my best guess this is likely the primary cause in my situation. (I know I've had to do this with most, if not all the unit instances I have that are seeing this issue) - my service discovery depends on the unit instance name which names the container, and provides the structure for the DNS names, so it's not very easy for me to simply use a new unit name to achieve the result.

I don't know how to duplicate the scenario that caused this, but as soon as I have an opportunity later this week, I'm going to move some more services around, so that I can destroy all the instances and the templates for units with offending instances - I think this will cure the issue. I'll report back my findings.

Apologies - I meant this to go into #1134 - with your previous comment this weekend.

@bcwaldon
Copy link
Contributor

bcwaldon commented May 1, 2015

@guruvan Have your issues been addressed via fleet v0.9.2?

@bcwaldon
Copy link
Contributor

bcwaldon commented May 1, 2015

@imjacobclark any more information you can share?

@alxeg
Copy link

alxeg commented Jun 2, 2015

Any news on this issue?

@bcwaldon
Copy link
Contributor

bcwaldon commented Jul 9, 2015

closing due to inactivity

@bcwaldon bcwaldon closed this as completed Jul 9, 2015
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

5 participants