Replies: 19 comments
-
One more factor which could be important is that our virtual network has been known to glitch a bit. Visible to us by ssh sessions which hang a few seconds. In fact, one of the failures I noted happened exactly during one of those glitches! I remember turning around to ask my colleague if he had done something to the vm because my ssh sessions had frozen, and when I turned back, they had come back and it was displaying the error. |
Beta Was this translation helpful? Give feedback.
-
I just build the raet dependencies, and tried using the raet transport, but got the exact same problem. So it's not transport related at least. |
Beta Was this translation helpful? Give feedback.
-
Increasing the salt timeout in the master config fixes the problem for us. It seems like occasionally our vm, disk or network hangs long enough for some communication to timeout. Is there some kind of polling mechanism during tasks which could cause this? (Since it only seems to happen on either of our two long running / intensive tasks.) And is this the expected behavior? If so, salt-run should probably fail in a manner which is more clear to what went wrong, instead of simply continuing with the next task. (The same goes for the regular salt execution I guess, where it now just returns it should probably raise an error instead?) I'll leave the ticket open since I'm not sure if and how you guys/gals will want to address this, but feel free to close it if you believe the behavior is the way you intended. |
Beta Was this translation helpful? Give feedback.
-
what appears in /var/log/salt/master:
Seems like one of the find_job tasks would've timed out, but can't really see any evidence of it in the log? |
Beta Was this translation helpful? Give feedback.
-
@bernieke can you post a sanitized version of your orchestrate.sls as well as the output of :
and
|
Beta Was this translation helpful? Give feedback.
-
The minion and the master are the same machine.
I'm not sure what you're looking for in my states? This is my complete orchestration.sls:
Do note that the states work fine when there's no timeout conditions being reached. |
Beta Was this translation helpful? Give feedback.
-
Hey @bernieke - sorry we never came back around here. Is this still an issue on the latest release (2014.7.5) for you? Have you had a chance to try this again? |
Beta Was this translation helpful? Give feedback.
-
The issue is still there on 2014.7.5 I've just verified it by setting the timeout to 1sec, and executing a salt-run to 15 nodes. One of the states run by salt-run (the restart state from the above orchestration) only went through on 5 of the 15 nodes, but salt-run happily continued with the next state (populate_frontend), without even so much as a warning printed. (In the output you see the same as you would normally, except it only mentions 5 minion names in the list of updated nodes in the "States ran successfully" comment part of the output instead of the 15 minion names you otherwise see.)
|
Beta Was this translation helpful? Give feedback.
-
@bernieke could you post which property you are setting in the master config to fix this? Is it We have not set |
Beta Was this translation helpful? Give feedback.
-
Yes, I set timeout to 300 to avoid this problem as much as possible. It still occurs even then though, but only on rare occasions when the host is under extreme heavy loads. We're currently modifying our deploy process to use salt-run over salt-ssh instead (by adding ssh: True to all blocks in the orchestration sls, and working with the roster instead of daemons and keys.) Apart from simplifying our orchestration process a lot (I no longer need to separately upgrade minions and masters beforehand, check if the minions are all up and sync modules before I can start the actual orchestration), I'm hoping this will finally completely remove this problem as well. |
Beta Was this translation helpful? Give feedback.
-
Our team has been having a similar issue when trying to run orchestration states. Here's what we've discovered... we have a number of scheduled states setup While the schedule is enabled, we see this issue regularly...especially on longer orchestration states. if we disable the schedule using While this isn't an ideal behavior, its the work-around we are using until this bug is resolved. |
Beta Was this translation helpful? Give feedback.
-
If the issue is related to timeout on 'saltutil.find_job' it cat be "work-arounded" by increasing 'gather_job_timeout' in master config. We saw the similar issue on virtual environment, when traversing of salt minion cache hangs in 'find_job'. However, it also looks like defect in client/saltutil/orchestration, because saltstack considers that state.sls execution was succeed, when it failed due to timeout on 'find_job'. I can reproduce the issue with simple orchestration state: # reproduce.sls
do-sleep-1:
salt.state:
- tgt: 'minion01'
- sls:
- sleep_120
do-sleep-2:
salt.state:
- tgt: 'minion01'
- sls:
- sleep_120
- require:
- salt: do-sleep-1 # sleep_120.sls
do sleep 120:
cmd.run:
- name: sleep 120s I also hacked 'find_job' in saltutil.py module to emulate slow execution: diff --git a/salt/modules/saltutil.py b/salt/modules/saltutil.py
index c0f0f2b..4ba4aa9 100644
--- a/salt/modules/saltutil.py
+++ b/salt/modules/saltutil.py
@@ -814,6 +814,9 @@ def find_job(jid):
'''
for data in running():
if data['jid'] == jid:
+ if data['fun'] == 'state.sls':
+ import time
+ time.sleep(30)
return data
return {}
@@ -1004,6 +1007,8 @@ def _exec(client, tgt, fun, arg, timeout, expr_form, ret, kwarg, **kwargs):
# do not wait for timeout when explicit list matching
# and all results are there
break
+
+ log.error("_exec(tgt=%s, fun=%s, arg=%s, timeout=%s) ret: %s", tgt, fun, arg, timeout, fcn_ret)
return fcn_ret Orchestration output:
As you can see 'saltutil._exec' returns empty (or incomplete) dictionary in case of 'find_job' timeout. I does not know is it bug in client code or orchestration incorrectly handles such result, but definitely it does not look like expected behavior. Probably some of you guys can help me to prepare patch to fix for this?
|
Beta Was this translation helpful? Give feedback.
-
I was having a related issue where the Looks like it was fixed by increasing |
Beta Was this translation helpful? Give feedback.
-
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. If this issue is closed prematurely, please leave a comment and we will gladly reopen the issue. |
Beta Was this translation helpful? Give feedback.
-
Reopening this case. As it can still be a problem with One thing that seems to be needed here is orchestration not assuming the state is run successfully if |
Beta Was this translation helpful? Give feedback.
-
Thank you for updating this issue. It is no longer marked as stale. |
Beta Was this translation helpful? Give feedback.
-
Hello, We experience the exact same problem, which highly impacts our processes and future plans. What I see is that job1 is still running on the minion, while the master sends the next job to the targets.
|
Beta Was this translation helpful? Give feedback.
-
I've begun encountering this issue as well. It looks like the issue is due to orchestration looking for the job before the minion has had the chance to kick off the job. Here is the jid of orchestration kicking off yum.update_all:
Here is the jid of orchestration looking for the status of the yum.update_all (20210215195352093998):
However, if I lookup the jid of the yum.update_all:
The jid on the minion - Started: 12:53:58.873927 (-7 UTC) Orchestration begun looking for the jid via Maybe this is as simple as changing the saltutil.find_job to jobs.lookup_jid via orchestration and adding some checking for a result? |
Beta Was this translation helpful? Give feedback.
-
To start off with, this problem seems somehow related to disk speed. We can reliably reproduce the problem on sata backed virtual machines, but not on ssd backed virtual machines. (Two different openstack flavors on for the rest the same hardware.)
Probably because it takes longer to execute the task on a sata disk. (There's a lot of deb installing going on, so this would make sense.)
This is the relevant part of the output:
The error doesn't always occur in that part of the orchestration, sometimes it happens in a later task, but always right after a task which takes a long time to run (several minutes.)
When I check right after the error, I can see that the job referenced is in fact still ongoing. Also issueing
salt 'awingu' saltutil.running
confirms this.The job will, in time, do its thing and finish properly.
This job runs the "common" task preceding the dns_server task. (And which the dns_server task requires!)
So it's pretty clear that the orchestrate deems the "common" task finished, while in truth it has not.
It looks like a salt execution returns prematurely, before the task actually being finished (I've noted this on one or two occasions), but apart from that being a bug, I would expect the salt runner to be smarter and check the job queue?
For us this currently is a major issue, so any recommendations on how to handle this would be extremely welcome. Even a dirty monkeypatch to apply on top of 2014.7.0 would mean the world to us!
Beta Was this translation helpful? Give feedback.
All reactions