-
Notifications
You must be signed in to change notification settings - Fork 302
Conversation
Agents need to execute all unit file load operations before attemping to start anything. The taskChain approach did not provide this safetly. An ordered list of tasks gives us what we need and greatly simplifies the codebase.
return | ||
} | ||
|
||
go func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is it now safe to get rid of the go routine here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's sort of a long story. The short version is that it used to be safe to spin off a goroutine here because the TaskManager would reject new taskChains for units that already have work in progress. Now that the TaskManager does not track the in-flight tasks (and therefore is no longer threadsafe), the reconciler should not interact with the TaskManager concurrently (which this goroutine would allow).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Independent of the architectural change being made here, this goroutine could be removed from master with minimal impact on the operation of the Agent.
Moving forward with this and cutting a v0.9.1 release. |
fleet v0.9.1 has been released to the CoreOS Alpha channel. It will be rolled out to Beta and Stable later this week. If anyone who was affected by this bug can validate the fix on Alpha, we would appreciate it. |
We've got the update on prod. So far so good! |
@bluedragonx thanks for the verification! |
I'm still seeing the error |
running |
@rufman Would you please share the exact logs and unit files that you are using to reproduce this bug in a new GitHub issue? And are you confident you are running v0.9.1? |
yes, I double check fleet -version is 0.9.1 |
@bcwaldon just updated some hosts to 607.0 with 0.9.1 fleet - no dice, reboot hosts with running , units don't start, start units and they just load dead, units consistently need daemon-reload or no such file or directory Today, I noticed that fleetctl is no longer destroying services all the time from the command line all the time. I had to manually this issue has me essentially managing every service by hand and logging directly into the host to use systemctl. My options are narrowing - revert back 2 or 3 previous stable releases, or find another solution - my network has been in a state of rolling outages for weeks now (and it's been constant baby sitting since implementing 607.0 this is happening with any of over 50 unit files, on 15-20 host network and sanitizing and sifting the logs is a huge chore - please give me exactly what you need so I can find it, otherwise, I'm literally looking for a needle in a hayfield |
@guruvan I'm more than happy to help debug your issue - could you share any logs and the unit files themselves? Please file a separate issue regarding the fleetctl issue with explicit repro steps and debug logs from fleectl (use the --debug flag). |
I am seeing the same bug with 0.9.1 as 0.9.0. Every other destroy/start cycle causes a failure, but only if you use a template and reuse the unit id (i.e. [email protected] and [email protected]). It should be pretty easy to recreate this on your own. If you cannot let me know. |
@bcwaldon I am consistently seeing the same issue with 0.9.1. I can share any logs and unit files you would like. |
Just released https://github.com/coreos/fleet/releases/tag/v0.9.2 with another major big fixed. Please reverify unexpected behavior with those binaries. They should be available in Alpha on Thursday. |
I'm not sure if I should create a new issue for this, but is it correct one still has to manually execute fleet steps in the correct order to make dependent unit starting work?
The This might be related to #1697 |
@simonvanderveldt Thanks for the report. But this PR is already merged, and it has been inactive since 17 months. It's unlikely for anyone to get back to here and read the comment again. |
@dongsupark thanks for the response, I'll create a new issue! |
On master, a given fleet agent reconciles its current state against the desired state by generating a set of taskChain objects. A taskChain is a series of tasks that should be executed against a given unit. For example, a new unit is started on an agent with a taskChain containing a LoadUnit and a StartUnit task. A unit file is replaced with a chain of UnloadUnit, LoadUnit, StartUnit. The flaw here, however, is that there isn't any ordering between the execution of the taskChains themselves. So if you have unit foo.service that BindsTo bar.service, fleet will race to start them both, and sometimes foo.service will fail with "No such file or directory" since bar.service may not have actually been written to disk yet.
The alternative proposed here is to throw the taskChain concept out the window in favor of an ordered set of tasks, where each task is executed in serial. Before starting execution, the tasks are sorted such that all LoadUnit tasks are executed before StartUnit tasks. This fixes the "No such file or directory" problems many folks have been running into by making sure all unit files are available on disk before taking any actions on them. The loss of concurrency should not be an issue as fleet already uses the DBus APIs for job control and doesn't actually block on completion.
Fix #900
Fix #997
Fix #1003
Fix #1127