salt-run state.orchestrate fails because it tries to run a second task while a first is ongoing #67229

bernieke · 2014-11-28T13:25:40Z

bernieke
Nov 28, 2014

To start off with, this problem seems somehow related to disk speed. We can reliably reproduce the problem on sata backed virtual machines, but not on ssd backed virtual machines. (Two different openstack flavors on for the rest the same hardware.)
Probably because it takes longer to execute the task on a sata disk. (There's a lot of deb installing going on, so this would make sense.)

This is the relevant part of the output:

# salt-run state.orchestrate orchestration.sls
awingu_master:
----------
          ID: common
    Function: salt.state
      Result: True
     Comment: States ran successfully.
     Started: 12:50:39.093757
    Duration: 63217.649 ms
     Changes:   
----------
          ID: dns_server
    Function: salt.state
      Result: False
     Comment: Run failed on minions: awingu
              Failures:
                  awingu:
                      Data failed to compile:
                  ----------
                      The function "state.sls" is running as PID 6334 and was started at 2014, Nov 28 12:50:39.102732 with jid 20141128125039102732
     Started: 12:51:42.312406
    Duration: 630.701 ms

The error doesn't always occur in that part of the orchestration, sometimes it happens in a later task, but always right after a task which takes a long time to run (several minutes.)

When I check right after the error, I can see that the job referenced is in fact still ongoing. Also issueing salt 'awingu' saltutil.running confirms this.
The job will, in time, do its thing and finish properly.
This job runs the "common" task preceding the dns_server task. (And which the dns_server task requires!)

So it's pretty clear that the orchestrate deems the "common" task finished, while in truth it has not.

It looks like a salt execution returns prematurely, before the task actually being finished (I've noted this on one or two occasions), but apart from that being a bug, I would expect the salt runner to be smarter and check the job queue?

For us this currently is a major issue, so any recommendations on how to handle this would be extremely welcome. Even a dirty monkeypatch to apply on top of 2014.7.0 would mean the world to us!

bernieke · 2014-11-28T14:01:54Z

bernieke
Nov 28, 2014
Author

One more factor which could be important is that our virtual network has been known to glitch a bit. Visible to us by ssh sessions which hang a few seconds.

In fact, one of the failures I noted happened exactly during one of those glitches! I remember turning around to ask my colleague if he had done something to the vm because my ssh sessions had frozen, and when I turned back, they had come back and it was displaying the error.

0 replies

bernieke · 2014-12-01T10:15:24Z

bernieke
Dec 1, 2014
Author

I just build the raet dependencies, and tried using the raet transport, but got the exact same problem.

So it's not transport related at least.

0 replies

bernieke · 2014-12-01T13:25:51Z

bernieke
Dec 1, 2014
Author

Increasing the salt timeout in the master config fixes the problem for us.

It seems like occasionally our vm, disk or network hangs long enough for some communication to timeout.

Is there some kind of polling mechanism during tasks which could cause this? (Since it only seems to happen on either of our two long running / intensive tasks.)

And is this the expected behavior? If so, salt-run should probably fail in a manner which is more clear to what went wrong, instead of simply continuing with the next task. (The same goes for the regular salt execution I guess, where it now just returns it should probably raise an error instead?)

I'll leave the ticket open since I'm not sure if and how you guys/gals will want to address this, but feel free to close it if you believe the behavior is the way you intended.

0 replies

bernieke · 2014-12-01T14:16:38Z

bernieke
Dec 1, 2014
Author

what appears in /var/log/salt/master:

2014-12-01 14:06:11,928 [salt.master      ][INFO    ] Authentication request from awingu
2014-12-01 14:06:11,929 [salt.master      ][INFO    ] Authentication accepted from awingu
2014-12-01 14:06:12,023 [salt.utils.event ][DEBUG   ] Sending event - data = {'id': 'awingu', '_stamp': '2014-12-01T14:06:12.022995', 'result': True, 'pub': '-----BEGIN PUBLIC KEY-----\nMIICIjANBgkqhkiG9w0BAQEFAAOCAg8AMIICCgKCAgEA59rxDBn1Q8kaMYuNsb0S\nyrFKEHlhFw/gNTRiAphS8pGoZsYwFvTwYm/J3LmrsxP0HtU6m7vNXGK1LnvVfgP1\nKSjnkv49Q6IPcz7Nyqfzz9gyxz5fAsAzGRj4iDj4qGPRjv60PgEujYUTH4N8nhTY\noKW8hIClyXh2eiX5mjb4kY/98KXsPeSRjTQnBTMkvXlXSHllKoCKjgD3FBAVZF/m\ncRNAUvaPIGvZiaracd6cDStT3GXeK9PxCBzpL9Hg7QH28myLaVwcDNPe5MESFpvS\nu3TL2wWaIURQWAndg28lqDEg9MSp8jRy/nViy3O7RaVypGQ0ySLEjX0adv1EC0w2\nFh0wvktYXa2r7ixL9OdKbGgExrZSWv+T5VR8QwndAaJmwVkHENndLv5JMq+eLXLx\nlBdaHLZkYf/S2c5OtZC7LlNhEHNYrSGqriUBdOIHpT1NFabgzAkH3QZY+BgXf/un\nCq4HWFJJL8TWDTLeJm1xSu3n6ylwiaGZFSn/Rqig3P2c3urgQx662eVcSD7YjmW3\nN2Hp6Zrl7DqU/sw2pF5qEFcc2zdJpLycZlXFeVpBcgaPh4Ph8mWSPcxUPij2638T\nMZLZZsOfYXM1GSQiqHsdSFeAcY1RdPyuJKw8y64hFF7lsVF5/qS4xqGbMp3i0GeA\nPe90QbaJPoKcqrrJ2WfEEBcCAwEAAQ==\n-----END PUBLIC KEY-----\n', 'act': 'accept'}
2014-12-01 14:06:12,125 [salt.master      ][INFO    ] AES payload received with command _return
2014-12-01 14:06:12,126 [salt.master      ][INFO    ] Got return from awingu for job 20141201140611872022
2014-12-01 14:06:12,126 [salt.utils.event ][DEBUG   ] Sending event - data = {'fun_args': ['20141201140511389244'], 'jid': '20141201140611872022', 'return': {'tgt_type': 'glob', 'jid': '20141201140511389244', 'tgt': '*', 'pid': 5298, 'ret': {}, 'user': 'root', 'arg': ['bootstrap,packages.circus,packages.diamond,packages.logstash', {'concurrent': False, 'saltenv': 'base', '__kwarg__': True}], 'fun': 'state.sls'}, 'retcode': 0, 'success': True, 'cmd': '_return', '_stamp': '2014-12-01T14:06:12.125992', 'fun': 'saltutil.find_job', 'id': 'awingu'}
2014-12-01 14:06:12,127 [salt.utils.event ][DEBUG   ] Sending event - data = {'fun_args': ['20141201140511389244'], 'jid': '20141201140611872022', 'return': {'tgt_type': 'glob', 'jid': '20141201140511389244', 'tgt': '*', 'pid': 5298, 'ret': {}, 'user': 'root', 'arg': ['bootstrap,packages.circus,packages.diamond,packages.logstash', {'concurrent': False, 'saltenv': 'base', '__kwarg__': True}], 'fun': 'state.sls'}, 'retcode': 0, 'success': True, 'cmd': '_return', '_stamp': '2014-12-01T14:06:12.126474', 'fun': 'saltutil.find_job', 'id': 'awingu'}
2014-12-01 14:06:16,923 [salt.master      ][INFO    ] Clear payload received with command publish
2014-12-01 14:06:16,925 [salt.utils.event ][DEBUG   ] Sending event - data = {'_stamp': '2014-12-01T14:06:16.924474', 'minions': ['awingu']}
2014-12-01 14:06:16,927 [salt.utils.event ][DEBUG   ] Sending event - data = {'tgt_type': 'glob', 'jid': '20141201140616923716', 'tgt': '*', '_stamp': '2014-12-01T14:06:16.926372', 'user': 'root', 'arg': ['20141201140511389244'], 'fun': 'saltutil.find_job', 'minions': ['awingu']}
2014-12-01 14:06:16,928 [salt.utils.event ][DEBUG   ] Sending event - data = {'tgt_type': 'glob', 'jid': '20141201140616923716', 'tgt': '*', '_stamp': '2014-12-01T14:06:16.928203', 'user': 'root', 'arg': ['20141201140511389244'], 'fun': 'saltutil.find_job', 'minions': ['awingu']}
2014-12-01 14:06:16,930 [salt.master      ][INFO    ] User root Published command saltutil.find_job with jid 20141201140616923716
2014-12-01 14:06:16,931 [salt.master      ][DEBUG   ] Published command details {'tgt_type': 'glob', 'jid': '20141201140616923716', 'tgt': '*', 'ret': '', 'user': 'root', 'arg': ['20141201140511389244'], 'fun': 'saltutil.find_job'}
2014-12-01 14:06:16,980 [salt.master      ][INFO    ] Clear payload received with command _auth
2014-12-01 14:06:16,980 [salt.master      ][INFO    ] Authentication request from awingu
2014-12-01 14:06:16,980 [salt.master      ][INFO    ] Authentication accepted from awingu
2014-12-01 14:06:17,073 [salt.utils.event ][DEBUG   ] Sending event - data = {'id': 'awingu', '_stamp': '2014-12-01T14:06:17.073252', 'result': True, 'pub': '-----BEGIN PUBLIC KEY-----\nMIICIjANBgkqhkiG9w0BAQEFAAOCAg8AMIICCgKCAgEA59rxDBn1Q8kaMYuNsb0S\nyrFKEHlhFw/gNTRiAphS8pGoZsYwFvTwYm/J3LmrsxP0HtU6m7vNXGK1LnvVfgP1\nKSjnkv49Q6IPcz7Nyqfzz9gyxz5fAsAzGRj4iDj4qGPRjv60PgEujYUTH4N8nhTY\noKW8hIClyXh2eiX5mjb4kY/98KXsPeSRjTQnBTMkvXlXSHllKoCKjgD3FBAVZF/m\ncRNAUvaPIGvZiaracd6cDStT3GXeK9PxCBzpL9Hg7QH28myLaVwcDNPe5MESFpvS\nu3TL2wWaIURQWAndg28lqDEg9MSp8jRy/nViy3O7RaVypGQ0ySLEjX0adv1EC0w2\nFh0wvktYXa2r7ixL9OdKbGgExrZSWv+T5VR8QwndAaJmwVkHENndLv5JMq+eLXLx\nlBdaHLZkYf/S2c5OtZC7LlNhEHNYrSGqriUBdOIHpT1NFabgzAkH3QZY+BgXf/un\nCq4HWFJJL8TWDTLeJm1xSu3n6ylwiaGZFSn/Rqig3P2c3urgQx662eVcSD7YjmW3\nN2Hp6Zrl7DqU/sw2pF5qEFcc2zdJpLycZlXFeVpBcgaPh4Ph8mWSPcxUPij2638T\nMZLZZsOfYXM1GSQiqHsdSFeAcY1RdPyuJKw8y64hFF7lsVF5/qS4xqGbMp3i0GeA\nPe90QbaJPoKcqrrJ2WfEEBcCAwEAAQ==\n-----END PUBLIC KEY-----\n', 'act': 'accept'}
2014-12-01 14:06:26,735 [salt.master      ][INFO    ] Clear payload received with command publish
2014-12-01 14:06:26,736 [salt.utils.event ][DEBUG   ] Sending event - data = {'_stamp': '2014-12-01T14:06:26.735978', 'minions': ['awingu']}
2014-12-01 14:06:26,736 [salt.utils.event ][DEBUG   ] Sending event - data = {'tgt_type': 'list', 'jid': '20141201140626735604', 'tgt': 'awingu', '_stamp': '2014-12-01T14:06:26.736329', 'user': 'root', 'arg': ['services.dns', {'concurrent': False, 'saltenv': 'base', '__kwarg__': True}], 'fun': 'state.sls', 'minions': ['awingu']}
2014-12-01 14:06:26,737 [salt.utils.event ][DEBUG   ] Sending event - data = {'tgt_type': 'list', 'jid': '20141201140626735604', 'tgt': 'awingu', '_stamp': '2014-12-01T14:06:26.736586', 'user': 'root', 'arg': ['services.dns', {'concurrent': False, 'saltenv': 'base', '__kwarg__': True}], 'fun': 'state.sls', 'minions': ['awingu']}
2014-12-01 14:06:26,737 [salt.master      ][INFO    ] User root Published command state.sls with jid 20141201140626735604
2014-12-01 14:06:26,737 [salt.master      ][DEBUG   ] Published command details {'tgt_type': 'list', 'jid': '20141201140626735604', 'tgt': 'awingu', 'ret': {}, 'user': 'root', 'arg': ['services.dns', {'concurrent': False, 'saltenv': 'base', '__kwarg__': True}], 'fun': 'state.sls'}
2014-12-01 14:06:26,769 [salt.master      ][INFO    ] AES payload received with command _return
2014-12-01 14:06:26,769 [salt.master      ][INFO    ] Got return from awingu for job 20141201140616923716
2014-12-01 14:06:26,769 [salt.utils.event ][DEBUG   ] Sending event - data = {'fun_args': ['20141201140511389244'], 'jid': '20141201140616923716', 'return': {'tgt_type': 'glob', 'jid': '20141201140511389244', 'tgt': '*', 'pid': 5298, 'ret': {}, 'user': 'root', 'arg': ['bootstrap,packages.circus,packages.diamond,packages.logstash', {'concurrent': False, 'saltenv': 'base', '__kwarg__': True}], 'fun': 'state.sls'}, 'retcode': 0, 'success': True, 'cmd': '_return', '_stamp': '2014-12-01T14:06:26.769319', 'fun': 'saltutil.find_job', 'id': 'awingu'}
2014-12-01 14:06:26,770 [salt.utils.event ][DEBUG   ] Sending event - data = {'fun_args': ['20141201140511389244'], 'jid': '20141201140616923716', 'return': {'tgt_type': 'glob', 'jid': '20141201140511389244', 'tgt': '*', 'pid': 5298, 'ret': {}, 'user': 'root', 'arg': ['bootstrap,packages.circus,packages.diamond,packages.logstash', {'concurrent': False, 'saltenv': 'base', '__kwarg__': True}], 'fun': 'state.sls'}, 'retcode': 0, 'success': True, 'cmd': '_return', '_stamp': '2014-12-01T14:06:26.769699', 'fun': 'saltutil.find_job', 'id': 'awingu'}
2014-12-01 14:06:26,785 [salt.master      ][INFO    ] Clear payload received with command _auth
2014-12-01 14:06:26,786 [salt.master      ][INFO    ] Authentication request from awingu
2014-12-01 14:06:26,786 [salt.master      ][INFO    ] Authentication accepted from awingu
2014-12-01 14:06:27,018 [salt.utils.event ][DEBUG   ] Sending event - data = {'id': 'awingu', '_stamp': '2014-12-01T14:06:27.017407', 'result': True, 'pub': '-----BEGIN PUBLIC KEY-----\nMIICIjANBgkqhkiG9w0BAQEFAAOCAg8AMIICCgKCAgEA59rxDBn1Q8kaMYuNsb0S\nyrFKEHlhFw/gNTRiAphS8pGoZsYwFvTwYm/J3LmrsxP0HtU6m7vNXGK1LnvVfgP1\nKSjnkv49Q6IPcz7Nyqfzz9gyxz5fAsAzGRj4iDj4qGPRjv60PgEujYUTH4N8nhTY\noKW8hIClyXh2eiX5mjb4kY/98KXsPeSRjTQnBTMkvXlXSHllKoCKjgD3FBAVZF/m\ncRNAUvaPIGvZiaracd6cDStT3GXeK9PxCBzpL9Hg7QH28myLaVwcDNPe5MESFpvS\nu3TL2wWaIURQWAndg28lqDEg9MSp8jRy/nViy3O7RaVypGQ0ySLEjX0adv1EC0w2\nFh0wvktYXa2r7ixL9OdKbGgExrZSWv+T5VR8QwndAaJmwVkHENndLv5JMq+eLXLx\nlBdaHLZkYf/S2c5OtZC7LlNhEHNYrSGqriUBdOIHpT1NFabgzAkH3QZY+BgXf/un\nCq4HWFJJL8TWDTLeJm1xSu3n6ylwiaGZFSn/Rqig3P2c3urgQx662eVcSD7YjmW3\nN2Hp6Zrl7DqU/sw2pF5qEFcc2zdJpLycZlXFeVpBcgaPh4Ph8mWSPcxUPij2638T\nMZLZZsOfYXM1GSQiqHsdSFeAcY1RdPyuJKw8y64hFF7lsVF5/qS4xqGbMp3i0GeA\nPe90QbaJPoKcqrrJ2WfEEBcCAwEAAQ==\n-----END PUBLIC KEY-----\n', 'act': 'accept'}
2014-12-01 14:06:28,413 [salt.master      ][INFO    ] AES payload received with command _return
2014-12-01 14:06:28,414 [salt.master      ][INFO    ] Got return from awingu for job 20141201140626735604
2014-12-01 14:06:28,415 [salt.utils.event ][DEBUG   ] Sending event - data = {'fun_args': ['services.dns', {'concurrent': False, 'saltenv': 'base'}], 'jid': '20141201140626735604', 'return': ['The function "state.sls" is running as PID 5298 and was started at 2014, Dec 01 14:05:11.389244 with jid 20141201140511389244'], 'retcode': 1, 'success': True, 'cmd': '_return', '_stamp': '2014-12-01T14:06:28.414676', 'fun': 'state.sls', 'id': 'awingu', 'out': 'highstate'}
2014-12-01 14:06:28,416 [salt.utils.event ][DEBUG   ] Sending event - data = {'fun_args': ['services.dns', {'concurrent': False, 'saltenv': 'base'}], 'jid': '20141201140626735604', 'return': ['The function "state.sls" is running as PID 5298 and was started at 2014, Dec 01 14:05:11.389244 with jid 20141201140511389244'], 'retcode': 1, 'success': True, 'cmd': '_return', '_stamp': '2014-12-01T14:06:28.415561', 'fun': 'state.sls', 'id': 'awingu', 'out': 'highstate'}
2014-12-01 14:06:28,458 [salt.state       ][ERROR   ] Run failed on minions: awingu
Failures:
    ^[[01;31mawingu:^[[0m
        ^[[01;31mData failed to compile:^[[0m
    ^[[01;31m----------
        The function "state.sls" is running as PID 5298 and was started at 2014, Dec 01 14:05:11.389244 with jid 20141201140511389244^[[0m

Seems like one of the find_job tasks would've timed out, but can't really see any evidence of it in the log?

0 replies

UtahDave · 2014-12-02T18:40:12Z

UtahDave
Dec 2, 2014

@bernieke can you post a sanitized version of your orchestrate.sls as well as the output of :

salt <minion> test.versions_report

and

salt-master --versions-report

0 replies

bernieke · 2014-12-04T07:52:49Z

bernieke
Dec 4, 2014
Author

The minion and the master are the same machine.

root@awingu ❯ salt 'awingu' test.versions_report
awingu:
               Salt: 2014.7.0
             Python: 2.7.6 (default, Mar 22 2014, 22:59:56)
             Jinja2: 2.7.2
           M2Crypto: 0.21.1
     msgpack-python: 0.3.0
       msgpack-pure: Not Installed
           pycrypto: 2.6.1
            libnacl: Not Installed
             PyYAML: 3.10
              ioflo: Not Installed
              PyZMQ: 14.0.1
               RAET: Not Installed
                ZMQ: 4.0.4
               Mako: Not Installed
~
root@awingu ❯ salt-master --versions-report
           Salt: 2014.7.0
         Python: 2.7.6 (default, Mar 22 2014, 22:59:56)
         Jinja2: 2.7.2
       M2Crypto: 0.21.1
 msgpack-python: 0.3.0
   msgpack-pure: Not Installed
       pycrypto: 2.6.1
        libnacl: Not Installed
         PyYAML: 3.10
          ioflo: Not Installed
          PyZMQ: 14.3.1
           RAET: Not Installed
            ZMQ: 4.0.4
           Mako: Not Installed

I'm not sure what you're looking for in my states? This is my complete orchestration.sls:

root@awingu ❯ cat /srv/salt/orchestration.sls 
common:
  salt.state:
    - tgt: '*'
    - sls:
      - bootstrap
      - packages.circus
      - packages.diamond
      - packages.logstash

dns_server:
  salt.state:
    - require:
      - salt: common
    - tgt_type: list
    - tgt: '{{ pillar.services.dns|join(',') }}'
    - sls: services.dns

dns:
  salt.state:
    - require:
      - salt: common
      - salt: dns_server
    - tgt: '*'
    - sls: dns

mq:
  salt.state:
    - require:
      - salt: dns
    - tgt: '{{ pillar.services.mq[0] }}'
    - sls: services.mq

all:
  salt.state:
    - require:
      - salt: dns
      - salt: mq
    - tgt: '*'
    - highstate: true

populate_metering:
  salt.state:
    - require:
      - salt: all
    - tgt: '{{ pillar.services.metering[0] }}'
    - sls: populate_metering

restart:
  salt.state:
    - require:
      - salt: all
      - salt: populate_metering
    - tgt: '*'
    - sls: restart

populate_frontend:
  salt.state:
    - require:
      - salt: restart
    - tgt: '{{ pillar.services.frontend[0] }}'
    - sls: populate_frontend

Do note that the states work fine when there's no timeout conditions being reached.

0 replies

rallytime · 2015-05-01T21:40:11Z

rallytime
May 1, 2015

Hey @bernieke - sorry we never came back around here. Is this still an issue on the latest release (2014.7.5) for you? Have you had a chance to try this again?

0 replies

bernieke · 2015-05-28T11:49:43Z

bernieke
May 28, 2015
Author

The issue is still there on 2014.7.5

I've just verified it by setting the timeout to 1sec, and executing a salt-run to 15 nodes.

One of the states run by salt-run (the restart state from the above orchestration) only went through on 5 of the 15 nodes, but salt-run happily continued with the next state (populate_frontend), without even so much as a warning printed. (In the output you see the same as you would normally, except it only mentions 5 minion names in the list of updated nodes in the "States ran successfully" comment part of the output instead of the 15 minion names you otherwise see.)

----------
          ID: restart
    Function: salt.state
      Result: True
     Comment: States ran successfully. Updating bernie-timeout-node12, bernie-timeout-node1, bernie-timeout-node11, bernie-timeout-node3, bernie-timeout-node4.
     Started: 10:38:15.569083
    Duration: 15929.754 ms
     Changes:

0 replies

anitakrueger · 2015-12-17T19:32:43Z

anitakrueger
Dec 17, 2015

@bernieke could you post which property you are setting in the master config to fix this? Is it timeout?
We are running into the exact same problem (running 2015.8.3) with the orchestrate runner when it sometimes things that a state has already returned and hence all the following states fail because they depend on this one. While that state is still running in the background and eventually finishes successfully (just taking a long time because it's installing packages).

We have not set timeout at all so it should default to 5 seconds. I can't see that during the orchestration, any of the states have ever "timed out" after 5 seconds though.

0 replies

bernieke · 2015-12-18T07:46:08Z

bernieke
Dec 18, 2015
Author

Yes, I set timeout to 300 to avoid this problem as much as possible.
Note that the states themselves don't time out. While the minion is executing a state, the master continuously queries the minion to check if it's done yet, and when one of those queries times out, the problem occurs.

It still occurs even then though, but only on rare occasions when the host is under extreme heavy loads.

We're currently modifying our deploy process to use salt-run over salt-ssh instead (by adding ssh: True to all blocks in the orchestration sls, and working with the roster instead of daemons and keys.) Apart from simplifying our orchestration process a lot (I no longer need to separately upgrade minions and masters beforehand, check if the minions are all up and sync modules before I can start the actual orchestration), I'm hoping this will finally completely remove this problem as well.

0 replies

dgmorrisjr · 2016-06-04T19:14:59Z

dgmorrisjr
Jun 4, 2016

Our team has been having a similar issue when trying to run orchestration states. Here's what we've discovered... we have a number of scheduled states setup While the schedule is enabled, we see this issue regularly...especially on longer orchestration states. if we disable the schedule using salt \* schedule.disable then run the orchestration state, it runs without failure. Then we have to re-enabled the scheduled state.

While this isn't an ideal behavior, its the work-around we are using until this bug is resolved.

0 replies

yauhen-matsiushonak · 2016-07-08T14:53:46Z

yauhen-matsiushonak
Jul 8, 2016

If the issue is related to timeout on 'saltutil.find_job' it cat be "work-arounded" by increasing 'gather_job_timeout' in master config. We saw the similar issue on virtual environment, when traversing of salt minion cache hangs in 'find_job'. However, it also looks like defect in client/saltutil/orchestration, because saltstack considers that state.sls execution was succeed, when it failed due to timeout on 'find_job'.

I can reproduce the issue with simple orchestration state:

# reproduce.sls
do-sleep-1:
  salt.state:
    - tgt: 'minion01'
    - sls:
      - sleep_120


do-sleep-2:
  salt.state:
    - tgt: 'minion01'
    - sls:
      - sleep_120
    - require:
      - salt: do-sleep-1

# sleep_120.sls
do sleep 120:
  cmd.run:
    - name: sleep 120s

I also hacked 'find_job' in saltutil.py module to emulate slow execution:

diff --git a/salt/modules/saltutil.py b/salt/modules/saltutil.py
index c0f0f2b..4ba4aa9 100644
--- a/salt/modules/saltutil.py
+++ b/salt/modules/saltutil.py
@@ -814,6 +814,9 @@ def find_job(jid):
     '''
     for data in running():
         if data['jid'] == jid:
+            if data['fun'] == 'state.sls':
+                import time
+                time.sleep(30)
             return data
     return {}

@@ -1004,6 +1007,8 @@ def _exec(client, tgt, fun, arg, timeout, expr_form, ret, kwarg, **kwargs):
             # do not wait for timeout when explicit list matching
             # and all results are there
             break
+
+    log.error("_exec(tgt=%s, fun=%s, arg=%s, timeout=%s) ret: %s", tgt, fun, arg, timeout, fcn_ret)
     return fcn_ret

Orchestration output:

root@salt:/srv/salt# salt-run state.orchestrate reproduce
[ERROR   ] _exec(tgt=minion01, fun=state.sls, arg=['sleep_120'], timeout=None) ret: {}
[ERROR   ] _exec(tgt=minion01, fun=state.sls, arg=['sleep_120'], timeout=None) ret: {'minion01': {'retcode': 1, 'ret': ['The function "state.sls" is running as PID 12453 and was started at 2016, Jun 30 15:51:43.751796 with jid 20160630155143751796'], 'out': 'highstate'}}
[ERROR   ] Run failed on minions: minion01
Failures:
    minion01:
        Data failed to compile:
    ----------
        The function "state.sls" is running as PID 12453 and was started at 2016, Jun 30 15:51:43.751796 with jid 20160630155143751796

salt_master:
----------
          ID: do-sleep-1
    Function: salt.state
      Result: True
     Comment: States ran successfully.
     Started: 15:51:43.688091
    Duration: 15310.313 ms
     Changes:   
----------
          ID: do-sleep-2
    Function: salt.state
      Result: False
     Comment: Run failed on minions: minion01
              Failures:
                  minion01:
                      Data failed to compile:
                  ----------
                      The function "state.sls" is running as PID 12453 and was started at 2016, Jun 30 15:51:43.751796 with jid 20160630155143751796
     Started: 15:51:58.998706
    Duration: 206.86 ms
     Changes:   

Summary for salt_master
------------
Succeeded: 1
Failed:    1
------------
Total states run:     2

As you can see 'saltutil._exec' returns empty (or incomplete) dictionary in case of 'find_job' timeout.

I does not know is it bug in client code or orchestration incorrectly handles such result, but definitely it does not look like expected behavior. Probably some of you guys can help me to prepare patch to fix for this?

root@salt:/srv/salt# salt 'minion01' test.versions_report
minion01:
    Salt Version:
               Salt: 2016.3.1

    Dependency Versions:
               cffi: Not Installed
           cherrypy: Not Installed
           dateutil: Not Installed
              gitdb: Not Installed
          gitpython: Not Installed
              ioflo: Not Installed
             Jinja2: 2.7.2
            libgit2: Not Installed
            libnacl: Not Installed
           M2Crypto: Not Installed
               Mako: Not Installed
       msgpack-pure: Not Installed
     msgpack-python: 0.4.6
       mysql-python: Not Installed
          pycparser: Not Installed
           pycrypto: 2.6.1
             pygit2: Not Installed
             Python: 2.7.5 (default, Nov 20 2015, 02:00:19)
       python-gnupg: Not Installed
             PyYAML: 3.11
              PyZMQ: 14.7.0
               RAET: Not Installed
              smmap: Not Installed
            timelib: Not Installed
            Tornado: 4.2.1
                ZMQ: 4.0.5

    System Versions:
               dist: centos 7.2.1511 Core
            machine: x86_64
            release: 3.10.0-327.22.2.el7.x86_64
             system: Linux
            version: CentOS Linux 7.2.1511 Core

0 replies

pirxthepilot · 2017-05-08T16:35:12Z

pirxthepilot
May 8, 2017

I was having a related issue where the state.orch command would complete prematurely with a success, while the actual states were still running on the minion. My orchestrate script calls 13 states which added together runs for about 2-3 minutes.

Looks like it was fixed by increasing gather_job_timeout in the salt master, from 10 to 60. Thank you @yauhen-matsiushonak for the workaround!

0 replies

2018-10-19T13:43:24Z

stale[bot]
bot Oct 19, 2018

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

If this issue is closed prematurely, please leave a comment and we will gladly reopen the issue.

0 replies

whytewolf · 2020-01-30T17:54:38Z

whytewolf
Jan 30, 2020
Collaborator

Reopening this case. As it can still be a problem with find_job timing out.

One thing that seems to be needed here is orchestration not assuming the state is run successfully if find_job times out. At that point, it still is an unknown state. Maybe we need to add a function that if find_job times out try find_job again. or. if find_job times out should we count the state as failed?

0 replies

2020-01-30T17:54:47Z

stale[bot]
bot Jan 30, 2020

Thank you for updating this issue. It is no longer marked as stale.

0 replies

whytewolf · 2020-01-30T17:56:34Z

whytewolf
Jan 30, 2020
Collaborator

ZD-4570

0 replies

sztimi · 2020-03-26T09:25:00Z

sztimi
Mar 26, 2020

Hello,

We experience the exact same problem, which highly impacts our processes and future plans.
Unfortunately, the timeout option does not cut it as the disk and network speed is different on each servers. There is no fix value which guarantees success.

What I see is that job1 is still running on the minion, while the master sends the next job to the targets.
Isn't it possible to send 'still in progress' messages by the minions? Seems like the master waits only for finished/failed messages, hence the issue.

salt --versions-report
Salt Version:
           Salt: 2019.2.0
 
Dependency Versions:
           cffi: 1.9.1
       cherrypy: unknown
       dateutil: Not Installed
      docker-py: Not Installed
          gitdb: Not Installed
      gitpython: Not Installed
          ioflo: Not Installed
         Jinja2: 2.8.1
        libgit2: Not Installed
        libnacl: Not Installed
       M2Crypto: Not Installed
           Mako: Not Installed
   msgpack-pure: Not Installed
 msgpack-python: 0.5.6
   mysql-python: Not Installed
      pycparser: 2.14
       pycrypto: 2.6.1
   pycryptodome: Not Installed
         pygit2: Not Installed
         Python: 3.6.8 (default, Aug  7 2019, 17:28:10)
   python-gnupg: Not Installed
         PyYAML: 3.11
          PyZMQ: 15.3.0
           RAET: Not Installed
          smmap: Not Installed
        timelib: Not Installed
        Tornado: 4.4.2
            ZMQ: 4.1.4
 
System Versions:
           dist: centos 7.6.1810 Core
         locale: ANSI_X3.4-1968
        machine: x86_64
        release: 4.9.188-35.el7.x86_64
         system: Linux
        version: CentOS Linux 7.6.1810 Core

0 replies

stupiddr · 2021-02-15T21:16:38Z

stupiddr
Feb 15, 2021

I've begun encountering this issue as well. It looks like the issue is due to orchestration looking for the job before the minion has had the chance to kick off the job.

Here is the jid of orchestration kicking off yum.update_all:

20210215195352093998:
    ----------
    Arguments:
        - yum.update_all
        |_
          ----------
          __kwarg__:
              True
          concurrent:
              False
          pillarenv:
              base
          queue:
              False
          saltenv:
              base
    Function:
        state.sls
    StartTime:
        2021, Feb 15 19:53:52.093998
    Target:
        - minion-id
    Target-type:
        glob
    User:
        root

Here is the jid of orchestration looking for the status of the yum.update_all (20210215195352093998):

20210215195357111507:
    ----------
    Arguments:
        - 20210215195352093998
    Function:
        saltutil.find_job
    StartTime:
        2021, Feb 15 19:53:57.111507
    Target:
        - minion-id
    Target-type:
        list
    User:
        root

However, if I lookup the jid of the yum.update_all:

salt-run jobs.lookup_jid 20210215195352093998
minion-id:
----------
          ID: update_all_packages
    Function: pkg.uptodate
        Name: updateall
      Result: True
     Comment: System is already up-to-date
     Started: 12:53:58.873927
    Duration: 14131.721 ms
     Changes:

Summary for minion-id
------------
Succeeded: 1
Failed:    0
------------
Total states run:     1
Total run time:  14.132 s

The jid on the minion - Started: 12:53:58.873927 (-7 UTC)
Orchestration looking for (jid 20210215195352093998) - StartTime: 19:53:57.111507

Orchestration begun looking for the jid via saltutil.find_job 1.76 seconds before it started on the minion since the job isn't running at the time orch considers the job complete and continues running causing errors on the next step of the orchestration.

Maybe this is as simple as changing the saltutil.find_job to jobs.lookup_jid via orchestration and adding some checking for a result?

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

salt-run state.orchestrate fails because it tries to run a second task while a first is ongoing #67229

{{title}}

Replies: 19 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

salt-run state.orchestrate fails because it tries to run a second task while a first is ongoing #67229

Replies: 19 comments

bernieke Nov 28, 2014 Author

bernieke Dec 1, 2014 Author

bernieke Dec 1, 2014 Author

bernieke Dec 1, 2014 Author

bernieke Dec 4, 2014 Author

bernieke May 28, 2015 Author

bernieke Dec 18, 2015 Author

stale[bot] bot Oct 19, 2018

whytewolf Jan 30, 2020 Collaborator

stale[bot] bot Jan 30, 2020

whytewolf Jan 30, 2020 Collaborator

bernieke
Nov 28, 2014
Author

bernieke
Dec 1, 2014
Author

bernieke
Dec 1, 2014
Author

bernieke
Dec 1, 2014
Author

bernieke
Dec 4, 2014
Author

bernieke
May 28, 2015
Author

bernieke
Dec 18, 2015
Author

stale[bot]
bot Oct 19, 2018

whytewolf
Jan 30, 2020
Collaborator

stale[bot]
bot Jan 30, 2020

whytewolf
Jan 30, 2020
Collaborator