Skip to content
This repository has been archived by the owner on May 6, 2024. It is now read-only.

fix: Update retry logic for certain ansible tasks. #6977

Merged
merged 1 commit into from
Aug 4, 2023

Conversation

dianakhuang
Copy link
Contributor

@dianakhuang dianakhuang commented Aug 3, 2023

When we added retry logic, we missed that we also needed to register a variable and add an until condition to ensure that retries happen. This should ensure that this matches.

edx/edx-arch-experiments#359

Ansible documentation: https://docs.ansible.com/ansible/latest/playbook_guide/playbooks_loops.html#retrying-a-task-until-a-condition-is-met

Configuration Pull Request

Make sure that the following steps are done before merging:

  • A SRE team member has approved the PR if it is code shared across multiple services and you don't own all of the services.
  • If you are making a complicated change, have you performed the proper testing specified on the Ops Ansible Testing Checklist? Adding a new variable does not require the full list (although testing on a sandbox is a great idea to ensure it links with your downstream code changes).
  • Think about how this change will affect Open edX operators. Have you updated the wiki page for the next Open edX release?

@dianakhuang dianakhuang force-pushed the diana/update-retry-logic branch from dfdb410 to 7b62b20 Compare August 3, 2023 16:27
Copy link
Contributor

@rgraber rgraber left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if there's a consistent syntax since you're just checking whatever the job returned. Not being able to test is super annoying :/. Would SRE know?

@robrap
Copy link
Contributor

robrap commented Aug 3, 2023

I see. From ansible docs:

Older versions of Ansible used success and fail, but succeeded and failed use the correct tense. All of these options are now valid.

@@ -30,7 +30,9 @@
- name: "Take security updates during ansible runs"
command: "{{ item }}"
when: SECURITY_UPGRADE_ON_ANSIBLE
register: result
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on the ansible docs, I wonder if result is the default, and you don't actually need to register it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it hurts to keep it around.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we agreed that @timmc-edx will do some local testing and help decide the best approach.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you're looking at the example at the end of the Conditions based on registered variables section, the result variable is explicitly registered by the first task and then used by the four following tasks (such that some subset of them will run).

There are a few things to consider here:

  • Can variables be overwritten?
  • Are there risks in re-using a name?
  • Are there risks in using a short name in a global scope?

Copy link
Contributor

@timmc-edx timmc-edx Aug 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overwriting a variable

Here's a test that overwrites the result variable and then prints out the stdout field of the result:

- shell: "echo first"
  register: result

- shell: "echo second"
  register: result

- debug:
    msg: "Done, with result stdout = {{ result.stdout }}"

Output:

TASK [timmc : shell] *********************************************************************************************************************************************************************
changed: [edx-timmc]

TASK [timmc : shell] *********************************************************************************************************************************************************************
changed: [edx-timmc]

TASK [timmc : debug] *********************************************************************************************************************************************************************
ok: [edx-timmc] => {
    "msg": "Done, with result stdout = second"
}

So yeah, we can overwrite it within a task. And we could use "result", but it would just be another variable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Risks in re-using a name

Are there risks in re-using that name across multiple tasks, including in multiple roles?

The main potential issue I see is that variable names are effectively global across the entire run. (Technically they are only set within the "play" scope, but they're global variables in the generic sense.) So if you set it in one role, you can read it in another role.

The danger I see here is that we would have to be absolutely sure to register it each time we want to use it -- otherwise we'd be retrying based on some previous task's outcome, which would of course already be a success. That could have weird results, like the later task failing but the failure being ignored because the until condition is returning true.

Simple test

Here's a simple test, with a second task that always fails, and that does register result properly:

- name: "First task"
  shell: "echo 'runs fine'"
  register: result

- name: "Second task"
  shell: "exit 1"
  register: result
  retries: 3
  until: result is success

Output:

TASK [timmc : First task] ****************************************************************************************************************************************************************
changed: [edx-timmc]

TASK [timmc : Second task] ***************************************************************************************************************************************************************
FAILED - RETRYING: Second task (3 retries left).
FAILED - RETRYING: Second task (2 retries left).
FAILED - RETRYING: Second task (1 retries left).
fatal: [edx-timmc]: FAILED! => {"attempts": 3, "changed": true, "cmd": "exit 1", "delta": "0:00:00.003823", "end": "2023-08-04 00:33:21.133466", "msg": "non-zero return code", "rc": 1, "start": "2023-08-04 00:33:21.129643", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}

That's as expected. What if I drop the second register: result, though?

TASK [timmc : First task] ****************************************************************************************************************************************************************
changed: [edx-timmc]

TASK [timmc : Second task] ***************************************************************************************************************************************************************
fatal: [edx-timmc]: FAILED! => {"attempts": 1, "changed": true, "cmd": "exit 1", "delta": "0:00:00.004023", "end": "2023-08-04 00:32:16.587208", "msg": "non-zero return code", "rc": 1, "start": "2023-08-04 00:32:16.583185", "stderr": "", "stderr_lines": [], "stdout": "", "stdout_lines": []}

Well, I'm surprised -- it seems that if you don't register a variable, then retrying isn't enabled! But I'm not sure I want to rely on this.

More complicated test

A more concerning scenario would be where the second task fails on its first attempt, but passes on its second one. Here's what I came up with for testing it:

- name: "First task"
  shell: "rm -rf /tmp/counter /tmp/debug.log"
  register: result

- name: "Second task"
  shell:
    cmd: |
      if [[ ! -f /tmp/counter ]]; then
        echo "Simulating initial failure" >> /tmp/debug.log
        touch /tmp/counter
        exit 1
      else
        echo "Simulating subsequent success" >> /tmp/debug.log
        exit 0
      fi
    executable: /bin/bash
  register: result
  retries: 3
  until: result is success

It behaves as expected -- the second task fails, but then succeeds on retry. However:

  • If I use register: different in the second task, retries don't happen. So I think Ansible won't do retries unless all of the following are true:
    • until is present
    • register is present
    • The registered variable is _used) in the until clause
  • Out of curiosity, I also tried to use {{ result }} inside the second task's shell script. If it was registered under a different name in the second task, I got an error saying the variable was unset. If I instead register the same variable, I get the output of the previous task!

It seems like Ansible has some guardrails here against variable collisions specifically for the retries case, which probably means someone else shot themselves in the foot with this in the past. I'm not entirely sure if we want to rely on that behavior, but it would be a second factor of protection if we omit a register by mistake or make a typo.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(The "registered a variable. but under the wrong name" scenario works like this: You forget to do register: result and retries are therefore quietly disabled and then later someone adds a register: my_task_with_specific_result_name clause because they actually care about using the stdout or something in a later task. But it seems like Ansible is protecting against that as well.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Risks in using a short name in a global scope

The last concern is just about whether someone else is setting a variable named result somewhere and expecting it not to get stomped on.

Honestly, I don't think this one is a huge concern; result would be a pretty dodgy and unlikely name to use as a play variable. So at least I would hope no one is trying to use it non-locally. And anyone using it locally (one task sets it, and then another task immediately reads it) shouldn't be affected by this.

When we added retry logic, we missed that we also needed
to register a variable and add an until condition to ensure
that retries happen. This should ensure that this matches.

edx/edx-arch-experiments#359
@dianakhuang dianakhuang force-pushed the diana/update-retry-logic branch from 7b62b20 to bee7eaa Compare August 4, 2023 13:40
Copy link
Contributor

@timmc-edx timmc-edx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That works too!

@dianakhuang dianakhuang merged commit 1f6b89f into master Aug 4, 2023
@dianakhuang dianakhuang deleted the diana/update-retry-logic branch August 4, 2023 14:10
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants