-
Notifications
You must be signed in to change notification settings - Fork 302
Unit spontaneously fails and doesn't restart #1130
Comments
I'm seeing this and/or similar issues constantly on 557.2 |
NeedsDaemonReload is not an option you can set in your unit files. This is a property provided directly by systemd over the DBus API. @timfpark This could be related to #900, but I'm not ready to assume that without a little more info. Do you happen to have any logs from fleet.service around the time that your unit entered this unexpected state? |
@bcwaldon: Here you go... Mar 02 16:00:23 frontdoor-prod-coreos-3 fleetd[1657]: INFO server.go:153: Establishing etcd connectivity |
I'm seeing similar issues regularly on CoreOS 557.2.0 where my units fail that are started with Fleet, none of which give much useful output, I see |
@imjacobclark the first step in debugging that is to take fleet out of the equation. Can you use systemctl commands to start the unit directly? |
@bcwaldon - Sorry if I was short. It's been rough the past couple weeks... I haven't got the logs yet, still need to sanitize units, but I will push the whole lot up to github asap. however - I made some significant changes to my network this past weekend to accommodate 607.0. I've updated all the workers, but the etcd-cluster is all still on 557.2 or 584. (which seems ok so far!)
I had enough host churn as I rebuilt workers this weekend to identify that I've only got this issue with units that don't load via fleet with a small handful of instances of units
The "no such file or directory" units seem very consistent across the network. Trying to identify the log snippets you need exactly. (i.e. "egrep" is better than "egrep -v everything_else" - but not sure what I'm egrepping for) - In every case, going to the host, doing a While I'm sure it's not recommended - it's not clear how one would correctly edit a production unit file and submit this to fleet, without destroying all the instances of the unit together. It's not really practical to replace all instances simultaneously in production. In my best guess this is likely the primary cause in my situation. (I know I've had to do this with most, if not all the unit instances I have that are seeing this issue) - my service discovery depends on the unit instance name which names the container, and provides the structure for the DNS names, so it's not very easy for me to simply use a new unit name to achieve the result. I don't know how to duplicate the scenario that caused this, but as soon as I have an opportunity later this week, I'm going to move some more services around, so that I can destroy all the instances and the templates for units with offending instances - I think this will cure the issue. I'll report back my findings. Apologies - I meant this to go into #1134 - with your previous comment this weekend. |
@guruvan Have your issues been addressed via fleet v0.9.2? |
@imjacobclark any more information you can share? |
Any news on this issue? |
closing due to inactivity |
I'm using CoreOS to launch a set of containers that back an application. I launch them via:
$ fleetctl start ../common/web@{1,2,3,4,5}.service
This works great initially but suddenly down the road the units are inexplicably stopped and then don't restart per the unit but are left in the failed state. Both the inexplicable stop and the failure to restart are mysteries to me because per the docs I expect the unit to restart:
● [email protected]
Loaded: not-found (Reason: No such file or directory)
Active: failed (Result: exit-code) since Thu 2015-02-19 15:11:58 UTC; 1h 35min ago
Main PID: 10747 (code=exited, status=255)
Feb 18 23:47:04 web-prod-coreos-2 docker[3199]: 168.63.129.16 - - [Wed, 18 Feb 2015 23:47:04 GMT] "GET /api/v1/ops/health HTTP/1.1" 200 125 "-" "Load Balancer Agent"
Feb 18 23:47:19 web-prod-coreos-2 docker[3199]: 168.63.129.16 - - [Wed, 18 Feb 2015 23:47:19 GMT] "GET /api/v1/ops/health HTTP/1.1" 200 125 "-" "Load Balancer Agent"
Feb 18 23:47:28 web-prod-coreos-2 systemd[1]: Stopping Web Service...
Feb 18 23:47:30 web-prod-coreos-2 docker[4706]: web
Feb 18 23:47:30 web-prod-coreos-2 systemd[1]: [email protected]: main process exited, code=exited, status=255/n/a
Feb 18 23:47:30 web-prod-coreos-2 systemd[1]: Stopped Web Service.
Feb 18 23:47:30 web-prod-coreos-2 systemd[1]: Unit [email protected] entered failed state.
Feb 18 23:47:30 web-prod-coreos-2 systemd[1]: [email protected] failed.
I'm perplexed that status returns Loaded: not-found on the failed unit because a health unit's status reports this which indicates that it does have the unit loaded:
● [email protected] - Web Service
Loaded: loaded (/run/fleet/units/[email protected]; linked-runtime; vendor preset: disabled)
Active: active (running) since Thu 2015-02-19 16:42:09 UTC; 10min ago
Process: 17719 ExecStartPre=/usr/bin/docker pull timfpark/web:latest (code=exited, status=0/SUCCESS)
Process: 17702 ExecStartPre=/usr/bin/docker rm web (code=exited, status=1/FAILURE)
Process: 17695 ExecStartPre=/usr/bin/docker kill web (code=exited, status=1/FAILURE)
Main PID: 17727 (docker)
CGroup: /system.slice/system-web.slice/[email protected]
└─17727 /usr/bin/docker run --rm --name web -p 80:8080 timfpark/web
I'm using the following unit file for this service:
[Unit]
Description=Web Service
After=docker.service
Requires=docker.service
[Service]
TimeoutStartSec=0
ExecStartPre=-/usr/bin/docker kill web
ExecStartPre=-/usr/bin/docker rm web
ExecStartPre=/usr/bin/docker pull timfpark/web:latest
ExecStart=/usr/bin/docker run --rm --name web -p 80:8080 timfpark/web
ExecStop=/usr/bin/docker stop web
[X-Fleet]
X-Conflicts=web@*.service
The text was updated successfully, but these errors were encountered: