fix: Update retry logic for certain ansible tasks. #6977

dianakhuang · 2023-08-03T15:52:09Z

When we added retry logic, we missed that we also needed to register a variable and add an until condition to ensure that retries happen. This should ensure that this matches.

edx/edx-arch-experiments#359

Ansible documentation: https://docs.ansible.com/ansible/latest/playbook_guide/playbooks_loops.html#retrying-a-task-until-a-condition-is-met

Configuration Pull Request

Make sure that the following steps are done before merging:

A SRE team member has approved the PR if it is code shared across multiple services and you don't own all of the services.
If you are making a complicated change, have you performed the proper testing specified on the Ops Ansible Testing Checklist? Adding a new variable does not require the full list (although testing on a sandbox is a great idea to ensure it links with your downstream code changes).
Think about how this change will affect Open edX operators. Have you updated the wiki page for the next Open edX release?

playbooks/roles/insightvm_agent/tasks/main.yml

rgraber

I'm not sure if there's a consistent syntax since you're just checking whatever the job returned. Not being able to test is super annoying :/. Would SRE know?

robrap · 2023-08-03T17:57:08Z

I see. From ansible docs:

Older versions of Ansible used success and fail, but succeeded and failed use the correct tense. All of these options are now valid.

robrap · 2023-08-03T18:00:01Z

playbooks/roles/security/tasks/security-amazon.yml

@@ -30,7 +30,9 @@
 - name: "Take security updates during ansible runs"
  command: "{{ item }}"
  when: SECURITY_UPGRADE_ON_ANSIBLE
+  register: result


Based on the ansible docs, I wonder if result is the default, and you don't actually need to register it?

I don't think it hurts to keep it around.

I think we agreed that @timmc-edx will do some local testing and help decide the best approach.

If you're looking at the example at the end of the Conditions based on registered variables section, the result variable is explicitly registered by the first task and then used by the four following tasks (such that some subset of them will run).

There are a few things to consider here:

Can variables be overwritten?

Are there risks in re-using a name?

Are there risks in using a short name in a global scope?

Overwriting a variable

Here's a test that overwrites the result variable and then prints out the stdout field of the result:

- shell: "echo first" register: result - shell: "echo second" register: result - debug: msg: "Done, with result stdout = {{ result.stdout }}"

Output:

TASK [timmc : shell] ********************************************************************************************************************************************************************* changed: [edx-timmc] TASK [timmc : shell] ********************************************************************************************************************************************************************* changed: [edx-timmc] TASK [timmc : debug] ********************************************************************************************************************************************************************* ok: [edx-timmc] => { "msg": "Done, with result stdout = second" }

So yeah, we can overwrite it within a task. And we could use "result", but it would just be another variable.

Risks in re-using a name

Are there risks in re-using that name across multiple tasks, including in multiple roles?

The main potential issue I see is that variable names are effectively global across the entire run. (Technically they are only set within the "play" scope, but they're global variables in the generic sense.) So if you set it in one role, you can read it in another role.

The danger I see here is that we would have to be absolutely sure to register it each time we want to use it -- otherwise we'd be retrying based on some previous task's outcome, which would of course already be a success. That could have weird results, like the later task failing but the failure being ignored because the until condition is returning true.

Simple test

Here's a simple test, with a second task that always fails, and that does register result properly:

- name: "First task" shell: "echo 'runs fine'" register: result - name: "Second task" shell: "exit 1" register: result retries: 3 until: result is success

Output:

TASK [timmc : First task] **************************************************************************************************************************************************************** changed: [edx-timmc] TASK [timmc : Second task] *************************************************************************************************************************************************************** FAILED - RETRYING: Second task (3 retries left). FAILED - RETRYING: Second task (2 retries left). FAILED - RETRYING: Second task (1 retries left). fatal: [edx-timmc]: FAILED! => {"attempts": 3, "changed": true, "cmd": "exit 1", "delta": "0:00:00.003823", "end": "2023-08-04 00:33:21.133466", "msg": "non-zero return code", "rc": 1, "start": "2023-08-04 00:33:21.129643", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}

That's as expected. What if I drop the second register: result, though?

TASK [timmc : First task] **************************************************************************************************************************************************************** changed: [edx-timmc] TASK [timmc : Second task] *************************************************************************************************************************************************************** fatal: [edx-timmc]: FAILED! => {"attempts": 1, "changed": true, "cmd": "exit 1", "delta": "0:00:00.004023", "end": "2023-08-04 00:32:16.587208", "msg": "non-zero return code", "rc": 1, "start": "2023-08-04 00:32:16.583185", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}

Well, I'm surprised -- it seems that if you don't register a variable, then retrying isn't enabled! But I'm not sure I want to rely on this.

More complicated test

A more concerning scenario would be where the second task fails on its first attempt, but passes on its second one. Here's what I came up with for testing it:

- name: "First task" shell: "rm -rf /tmp/counter /tmp/debug.log" register: result - name: "Second task" shell: cmd: | if [[ ! -f /tmp/counter ]]; then echo "Simulating initial failure" >> /tmp/debug.log touch /tmp/counter exit 1 else echo "Simulating subsequent success" >> /tmp/debug.log exit 0 fi executable: /bin/bash register: result retries: 3 until: result is success

It behaves as expected -- the second task fails, but then succeeds on retry. However:

If I use register: different in the second task, retries don't happen. So I think Ansible won't do retries unless all of the following are true:

until is present

register is present

The registered variable is _used) in the until clause

Out of curiosity, I also tried to use {{ result }} inside the second task's shell script. If it was registered under a different name in the second task, I got an error saying the variable was unset. If I instead register the same variable, I get the output of the previous task!

It seems like Ansible has some guardrails here against variable collisions specifically for the retries case, which probably means someone else shot themselves in the foot with this in the past. I'm not entirely sure if we want to rely on that behavior, but it would be a second factor of protection if we omit a register by mistake or make a typo.

(The "registered a variable. but under the wrong name" scenario works like this: You forget to do register: result and retries are therefore quietly disabled and then later someone adds a register: my_task_with_specific_result_name clause because they actually care about using the stdout or something in a later task. But it seems like Ansible is protecting against that as well.)

Risks in using a short name in a global scope

The last concern is just about whether someone else is setting a variable named result somewhere and expecting it not to get stomped on.

Honestly, I don't think this one is a huge concern; result would be a pretty dodgy and unlikely name to use as a play variable. So at least I would hope no one is trying to use it non-locally. And anyone using it locally (one task sets it, and then another task immediately reads it) shouldn't be affected by this.

When we added retry logic, we missed that we also needed to register a variable and add an until condition to ensure that retries happen. This should ensure that this matches. edx/edx-arch-experiments#359

timmc-edx

That works too!

robrap reviewed Aug 3, 2023

View reviewed changes

playbooks/roles/insightvm_agent/tasks/main.yml Outdated Show resolved Hide resolved

dianakhuang force-pushed the diana/update-retry-logic branch from dfdb410 to 7b62b20 Compare August 3, 2023 16:27

rgraber reviewed Aug 3, 2023

View reviewed changes

robrap reviewed Aug 3, 2023

View reviewed changes

fix: Update retry logic for certain ansible tasks.

bee7eaa

When we added retry logic, we missed that we also needed to register a variable and add an until condition to ensure that retries happen. This should ensure that this matches. edx/edx-arch-experiments#359

dianakhuang force-pushed the diana/update-retry-logic branch from 7b62b20 to bee7eaa Compare August 4, 2023 13:40

timmc-edx approved these changes Aug 4, 2023

View reviewed changes

dianakhuang merged commit 1f6b89f into master Aug 4, 2023

dianakhuang deleted the diana/update-retry-logic branch August 4, 2023 14:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: Update retry logic for certain ansible tasks. #6977

fix: Update retry logic for certain ansible tasks. #6977

dianakhuang commented Aug 3, 2023 •

edited

Loading

rgraber left a comment

robrap commented Aug 3, 2023

robrap Aug 3, 2023

dianakhuang Aug 3, 2023

robrap Aug 3, 2023

timmc-edx Aug 4, 2023

timmc-edx Aug 4, 2023 •

edited

Loading

timmc-edx Aug 4, 2023

timmc-edx Aug 4, 2023

timmc-edx Aug 4, 2023

timmc-edx left a comment

fix: Update retry logic for certain ansible tasks. #6977

fix: Update retry logic for certain ansible tasks. #6977

Conversation

dianakhuang commented Aug 3, 2023 • edited Loading

Configuration Pull Request

rgraber left a comment

Choose a reason for hiding this comment

robrap commented Aug 3, 2023

robrap Aug 3, 2023

Choose a reason for hiding this comment

dianakhuang Aug 3, 2023

Choose a reason for hiding this comment

robrap Aug 3, 2023

Choose a reason for hiding this comment

timmc-edx Aug 4, 2023

Choose a reason for hiding this comment

timmc-edx Aug 4, 2023 • edited Loading

Choose a reason for hiding this comment

Overwriting a variable

timmc-edx Aug 4, 2023

Choose a reason for hiding this comment

Risks in re-using a name

Simple test

More complicated test

timmc-edx Aug 4, 2023

Choose a reason for hiding this comment

timmc-edx Aug 4, 2023

Choose a reason for hiding this comment

Risks in using a short name in a global scope

timmc-edx left a comment

Choose a reason for hiding this comment

dianakhuang commented Aug 3, 2023 •

edited

Loading

timmc-edx Aug 4, 2023 •

edited

Loading