-
Notifications
You must be signed in to change notification settings - Fork 966
fix: Update retry logic for certain ansible tasks. #6977
Conversation
dfdb410
to
7b62b20
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure if there's a consistent syntax since you're just checking whatever the job returned. Not being able to test is super annoying :/. Would SRE know?
I see. From ansible docs:
|
@@ -30,7 +30,9 @@ | |||
- name: "Take security updates during ansible runs" | |||
command: "{{ item }}" | |||
when: SECURITY_UPGRADE_ON_ANSIBLE | |||
register: result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the ansible docs, I wonder if result
is the default, and you don't actually need to register it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it hurts to keep it around.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we agreed that @timmc-edx will do some local testing and help decide the best approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're looking at the example at the end of the Conditions based on registered variables section, the result
variable is explicitly registered by the first task and then used by the four following tasks (such that some subset of them will run).
There are a few things to consider here:
- Can variables be overwritten?
- Are there risks in re-using a name?
- Are there risks in using a short name in a global scope?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overwriting a variable
Here's a test that overwrites the result
variable and then prints out the stdout
field of the result
:
- shell: "echo first"
register: result
- shell: "echo second"
register: result
- debug:
msg: "Done, with result stdout = {{ result.stdout }}"
Output:
TASK [timmc : shell] *********************************************************************************************************************************************************************
changed: [edx-timmc]
TASK [timmc : shell] *********************************************************************************************************************************************************************
changed: [edx-timmc]
TASK [timmc : debug] *********************************************************************************************************************************************************************
ok: [edx-timmc] => {
"msg": "Done, with result stdout = second"
}
So yeah, we can overwrite it within a task. And we could use "result", but it would just be another variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Risks in re-using a name
Are there risks in re-using that name across multiple tasks, including in multiple roles?
The main potential issue I see is that variable names are effectively global across the entire run. (Technically they are only set within the "play" scope, but they're global variables in the generic sense.) So if you set it in one role, you can read it in another role.
The danger I see here is that we would have to be absolutely sure to register it each time we want to use it -- otherwise we'd be retrying based on some previous task's outcome, which would of course already be a success. That could have weird results, like the later task failing but the failure being ignored because the until
condition is returning true.
Simple test
Here's a simple test, with a second task that always fails, and that does register result
properly:
- name: "First task"
shell: "echo 'runs fine'"
register: result
- name: "Second task"
shell: "exit 1"
register: result
retries: 3
until: result is success
Output:
TASK [timmc : First task] ****************************************************************************************************************************************************************
changed: [edx-timmc]
TASK [timmc : Second task] ***************************************************************************************************************************************************************
FAILED - RETRYING: Second task (3 retries left).
FAILED - RETRYING: Second task (2 retries left).
FAILED - RETRYING: Second task (1 retries left).
fatal: [edx-timmc]: FAILED! => {"attempts": 3, "changed": true, "cmd": "exit 1", "delta": "0:00:00.003823", "end": "2023-08-04 00:33:21.133466", "msg": "non-zero return code", "rc": 1, "start": "2023-08-04 00:33:21.129643", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}
That's as expected. What if I drop the second register: result
, though?
TASK [timmc : First task] ****************************************************************************************************************************************************************
changed: [edx-timmc]
TASK [timmc : Second task] ***************************************************************************************************************************************************************
fatal: [edx-timmc]: FAILED! => {"attempts": 1, "changed": true, "cmd": "exit 1", "delta": "0:00:00.004023", "end": "2023-08-04 00:32:16.587208", "msg": "non-zero return code", "rc": 1, "start": "2023-08-04 00:32:16.583185", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}
Well, I'm surprised -- it seems that if you don't register a variable, then retrying isn't enabled! But I'm not sure I want to rely on this.
More complicated test
A more concerning scenario would be where the second task fails on its first attempt, but passes on its second one. Here's what I came up with for testing it:
- name: "First task"
shell: "rm -rf /tmp/counter /tmp/debug.log"
register: result
- name: "Second task"
shell:
cmd: |
if [[ ! -f /tmp/counter ]]; then
echo "Simulating initial failure" >> /tmp/debug.log
touch /tmp/counter
exit 1
else
echo "Simulating subsequent success" >> /tmp/debug.log
exit 0
fi
executable: /bin/bash
register: result
retries: 3
until: result is success
It behaves as expected -- the second task fails, but then succeeds on retry. However:
- If I use
register: different
in the second task, retries don't happen. So I think Ansible won't do retries unless all of the following are true:until
is presentregister
is present- The registered variable is _used) in the
until
clause
- Out of curiosity, I also tried to use
{{ result }}
inside the second task's shell script. If it was registered under a different name in the second task, I got an error saying the variable was unset. If I instead register the same variable, I get the output of the previous task!
It seems like Ansible has some guardrails here against variable collisions specifically for the retries case, which probably means someone else shot themselves in the foot with this in the past. I'm not entirely sure if we want to rely on that behavior, but it would be a second factor of protection if we omit a register
by mistake or make a typo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(The "registered a variable. but under the wrong name" scenario works like this: You forget to do register: result
and retries are therefore quietly disabled and then later someone adds a register: my_task_with_specific_result_name
clause because they actually care about using the stdout or something in a later task. But it seems like Ansible is protecting against that as well.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Risks in using a short name in a global scope
The last concern is just about whether someone else is setting a variable named result
somewhere and expecting it not to get stomped on.
Honestly, I don't think this one is a huge concern; result
would be a pretty dodgy and unlikely name to use as a play variable. So at least I would hope no one is trying to use it non-locally. And anyone using it locally (one task sets it, and then another task immediately reads it) shouldn't be affected by this.
When we added retry logic, we missed that we also needed to register a variable and add an until condition to ensure that retries happen. This should ensure that this matches. edx/edx-arch-experiments#359
7b62b20
to
bee7eaa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That works too!
When we added retry logic, we missed that we also needed to register a variable and add an until condition to ensure that retries happen. This should ensure that this matches.
edx/edx-arch-experiments#359
Ansible documentation: https://docs.ansible.com/ansible/latest/playbook_guide/playbooks_loops.html#retrying-a-task-until-a-condition-is-met
Configuration Pull Request
Make sure that the following steps are done before merging: