Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scylla Ansible Node - Upgrade rework #144

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

fee-mendes
Copy link
Member

This commit is a rework on the steps to upgrade a cluster. In particular, the previous version didn't upgraded nodes serially, nor would abort the entire process in case of a failure, nor covered the scenarios described in #134 and #133

Upgrading a cluster programatically is complex in general, as we must ensure that the failure of a single node aborts the entire process. In that sense, we must also check several things such as whether the version we are upgrading to is a valid one or not, and whether we are already on 'latest' or not to prevent introducing an unecessary rolling restart to the cluster.

These aspects - and many others - make testing of this extremely complex, lengthy and time consuming. However, I have ensured we at least have "minimal" coverage for some common operations.

That said, the following tests have been run, whereas all tests were run concurrently in an Ubuntu 20.04/Debian 10/CentOS 7 mixed cluster:

Test 1 : Install 2021.1.1 == success
Test 2 : Downgrade to 2021.1.0 without specifying upgrade_version == expected failure
Test 3 : Upgrade to 2021.1.2 without specifying upgrade_version == expected failure
Test 4 : Upgrade to 2021.1.2 specifying upgrade_version == success
Test 5 : Upgrade to 2021.1.50 specifying upgrade_version == expected failure
Test 6 : Try to upgrade to 'latest' without specifying upgrade_version == expected failure
Test 7 : Try to upgrade to 'latest' specifying upgrade_version == success
Test 8 : Try to downgrade to 2021.1.11 with upgrade_version and NOT specifying upgrade_allow_user_manual_downgrade == expected failure
Test 9 : Try to downgrade to 2021.1.11 with upgrade_version and specifying upgrade_allow_user_manual_downgrade == success
Test 10 : Major upgrade to 2022.1 latest == Appears to work, can't confirm due to a bug in the release.

Fixes #133, #134

This commit is a rework on the steps to upgrade a cluster. In particular, the previous version didn't upgraded nodes serially, nor would abort the entire process in case of a failure.

Upgrading a cluster programatically is complex in general, as we must ensure that the failure of a single node aborts the entire process. In that sense, we must also check several things such as whether the version we are upgrading to is a valid one or not, and whether we are already on 'latest' or not to prevent introducing an unecessary rolling restart to the cluster.

These aspects - and many others - make testing of this extremely complex, lengthy and time consuming. However, I have ensured we at least have "minimal" coverage for some common operations.

That said, the following tests have been run, whereas all tests were run concurrently in an Ubuntu 20.04/Debian 10/CentOS 7 mixed cluster:

Test 1 : Install 2021.1.1 == success
Test 2 : Downgrade to 2021.1.0 without specifying upgrade_version == expected failure
Test 3 : Upgrade to 2021.1.2 without specifying upgrade_version == expected failure
Test 4 : Upgrade to 2021.1.2 specifying upgrade_version == success
Test 5 : Upgrade to 2021.1.50 specifying upgrade_version == expected failure
Test 6 : Try to upgrade to 'latest' without specifying upgrade_version == expected failure
Test 7 : Try to upgrade to 'latest' specifying upgrade_version == success
Test 8 : Try to downgrade to 2021.1.11 with upgrade_version and NOT specifying upgrade_allow_user_manual_downgrade == expected failure
Test 9 : Try to downgrade to 2021.1.11 with upgrade_version and specifying upgrade_allow_user_manual_downgrade == success
Test 10 : Major upgrade to 2022.1 latest == Appears to work, can't confirm due to a bug in the release.
@vladzcloudius vladzcloudius self-requested a review July 6, 2022 21:18

# Scylla installation validation
- name: Check existing Scylla installation operating system incompatibility
ansible.builtin.fail:
msg: "{{ ansible_facts['distribution'] }} {{ ansible_facts['distribution_version'] }} is not supported."
any_errors_fatal: True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not to put all these tasks in a block and define any_errors_fatal: True once instead repeating it over and over?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will make sense to put this in a separate patch because it is going to be a lot of diff and a little functional change in it. Therefore better not to mix it with other (functional) patches.


# Scylla installation validation
- name: Check existing Scylla installation operating system incompatibility
ansible.builtin.fail:
msg: "{{ ansible_facts['distribution'] }} {{ ansible_facts['distribution_version'] }} is not supported."
any_errors_fatal: True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will make sense to put this in a separate patch because it is going to be a lot of diff and a little functional change in it. Therefore better not to mix it with other (functional) patches.

version: "{{ ansible_facts.packages[package_installed][0].version }}",
# Debian variants will return version == <release>-n.<date>.<commit>-n
# RHEL/CentOS will return version == <release>
version: "{{ (ansible_facts.packages[package_installed][0].version).split('-')[0] }}",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is actually very wrong: you want the whole string and not just the 2021.1.11 part - there may be multiple different versions with the same full version substring, e.g. custom builds.

Assuming that the part before the first - is always going to be unique is dangerous and not really necessary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree partially. Although I tried to keep the former code "intact" as best as I could. I'll need to double check, but IIRC adding the whole string would basically mean that we shouldn't be relying on the package_facts module any longer (or maybe we can, I need to check).

It's worth to mention that - as explained under the comment section - that issue already exists. And Debian-variant upgrades were effectively broken :-P

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And Debian-variant upgrades were effectively broken :-P

I don't think they are broken anymore after #130

This code IS broken however - fixing it is one of the goals of this PR.

@@ -135,13 +143,13 @@
upgrade: true
}


# Scylla version validation
- name: Check if Scylla version was specified incorrectly
vars:
version_format: "{{ 'X.Y.Z' if scylla_upgrade['edition'] else 'UVWX.Y.Z' }}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these checks (in the original code) are bullshit really.
Instead of restricting and then checking that the user follows our restrictions we should not restring them at all and instead try to resolve whatever version string he gives into a full version and make sure it resolves to exactly one such version - like I did for Debian.

After that all what's left is to compare it to a full version of the currently installed package.

So, the code that resolves a scylla version for Debian is already there. We need to create the same for RH.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You do need to cut major/minor version values for the full version - however this is trivial the moment you have the full version

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these checks (in the original code) are bullshit really.

Well, at least you mentioned it. I tried to be "gentler" ;-)

That said, I agree. In fact, I also want to remove the "supported" array section [the one with OS, releases and bla bla bla]. No one really should be maintaining that. And even then, if someone wants to run Scylla on Slackware, who are we to determine whether the role should work or not.

(scylla_detected['version'] == scylla_upgrade['version'] or
scylla_detected['major_version'] == scylla_upgrade['major_version'])
scylla_detected['major_version'] == scylla_upgrade['major_version'])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No real change here - just a whitespace added at the end of the line.

when: >
upgrade_major and
scylla_version != 'latest' and upgrade_major and
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And what if it's 'latest'? Don't the same checks apply?
We should have resolved the actual version that needs to be installed ('latest' or not) by here and then execute the check as it was.

In particular scylla_detected['major_version'] has to have the corresponding major version.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is no surprise, but if you stop and think this creates a chicken and the egg situation somehow:

If the user specified that he wants a major upgrade, then it means that we must re-create the repositories. If we are to re-create the repositories, then it means that we are going to install something new, and at this point we have passed a validation step. It makes no sense to create a repository, to query whatever latest is just to fail later.

In fact, the original writer of the upgrade process very likely fell under the same thinking. When one specifies latest today, the value will be automatically be selected from whatever scylla_support first, this - in my view - is worse, because one may want to run a major upgrade that is not going to the latest and greatest major. :-)

So you need to decide: Whether we go through the whole pain of re-creating repos just for checking if latest is latest, or if we consider latest in general as a dangerous and non-recommended value. ;-)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Installing multiple repos doesn't represent an issue as long as you don't install anything. Therefore I don't see any issue in adding a new repo when the play begins.

After that, if latest was requested - the Role needs to "resolve" it and from that point forward act the same way as if the version has been given explicitly.

@@ -186,120 +196,160 @@
id: "{{ scylla_upgrade['id'] }}",
upgrade: false
}
when: upgraded_version is version(detected_version,'<')
when: upgraded_version is version(detected_version,'<') and scylla_version != 'latest'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

again, why 'latest' is special?

Copy link
Member Author

@fee-mendes fee-mendes Jul 8, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


# If we are downgrading, build list of packages to downgrade
# RHEL Specific
- name: Create list of packages to downgrade
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this really needed? I thought you should be able to downgrade using a meta-package on RH, don't you?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will confirm. AFAIR yes, it was needed (at least on RH7).

loop: "{{ installed_pkgs | dict2items }}"
when: not scylla_upgrade['upgrade'] and ansible_os_family == 'RedHat'

# Debian specific
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This definitely deserves a separate block


- name: Fetch version parts of Scylla package
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a code duplication with what we have in Debian_install.yml - better reuse. You would probably need to split the Debian_install.yml into two parts: one that fetches the version and another that actually installs given a version.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed.

include_tasks: upgrade.yml
loop: "{{ groups['scylla'] }}"
loop_control:
loop_var: delegate_host

# Upgrade verification
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is no longer relevant IMO.


# ----- TODO: If the upgrade version is the latest one and if 'latest' was used to specify the version
# replace upgrade version with latest
- name: "Wait {{ upgrade_break_before_verification }} seconds before verifying the upgrade"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole section before this one is quite messy IMO.

Please, correct me if I'm wrong but the section above does following logical steps:

  1. Gets the full version and its parts and checks if upgrade is possible/valid:
    a) If the requested version is not greater than the "latest" - not needed if you do (b) below.
    b) (It should but you don't do that yet AFAIKT, at least for RH you don't): check that the requested version exists at all (my code for Debian you picked does that).
    c) Check if we do a switch to a different major version and only allow this if a special variable is set (to avoid accidental major upgrades).
  2. Check if any action is needed: if the upgrade version is equal to the already installed version. In your patch you seem to only check this for 'latest' however the same logic applies to any version.

So, the way the code is now (before and with your patch) it's very flat and all mixed up: Debian and RH sections are interleaving and appear in a flat (not structured) way. This not only makes it hard to read but also makes it hard to check because same operations and checks need to be done for both OS variants.

It would make sense IMO to rewrite the code the way that the logic described above would be implemented in the abstract way with minimum per-variant "branching".

The generic logic (like comparing that the requested and the installed versions are the same) should be written in a generic way and rely on facts initialized by a variant-specific code.

variant-specific code blocks should be taken into separate files and then "included" from the main.yml.

This way it would be much easier to both review and maintain such a code.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please, review the whole patch with the mindset described above.

It's probably a lot of work but I believe it's worth it - it's going to make life of all that would come after us MUCH easier.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

b) (It should but you don't do that yet AFAIKT, at least for RH you don't): check that the requested version exists at all (my code for Debian you picked does that).

We do check for it. latest_detected_release will return latest. Then, we can infer that any upgrade should be between our current version and whatever latest is. That check is done under:

   when: not upgrade_major and scylla_upgrade['version'] is version(latest_detected_release,'>=') and scylla_version != 'latest'

That means that:

  • If the user specified latest, we will fail if he's already latest in the subsequent check.
  • If the user specified a version, then:
    • If the version is >= latest, we fail
    • If the version == current, we fail
    • If the version < current, we force him to specify that he wants to downgrade

Other than that, your understanding at 1. is correct.

  1. Check if any action is needed: if the upgrade version is equal to the already installed version. In your patch you seem to only check this for 'latest' however the same logic applies to any version.

This check actually happens in line 204. Which is when the user has specified a version other than latest. :-)

Other than that, yes - I agree that its all messy and scattered around. Splitting it is actually fairly simple.

when: >
upgrade_major and
scylla_version != 'latest' and upgrade_major and
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Installing multiple repos doesn't represent an issue as long as you don't install anything. Therefore I don't see any issue in adding a new repo when the play begins.

After that, if latest was requested - the Role needs to "resolve" it and from that point forward act the same way as if the version has been given explicitly.

fail:
msg: "'latest' was specified, but currently installed version {{ scylla_detected['version'] }} is the same as last available {{ latest_detected_release }}"
any_errors_fatal: True
when: not upgrade_major and scylla_detected['version'] is version(latest_detected_release,'>=') and scylla_version == 'latest'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I mean we should detect that there is nothing to upgrade and don't upgrade - simply skip to the next task that follows the upgrade.

The same behavior should be implemented for any "destination" version - not just 'latest'. But this will come naturally the moment you stop special-case 'latest' as I explained above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Scyll Node Role]: upgrade - check if upgrade is needed before backing up node's data
2 participants