Skip to content
This repository has been archived by the owner on May 23, 2024. It is now read-only.

Implement Health Checks w/ Rollback #29

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

imnotteixeira
Copy link
Collaborator

@imnotteixeira imnotteixeira commented Feb 12, 2022

Closes #28

Created a function to health check any web based service (via cURL)

This will try the configured endpoint for 5 minutes, on 10 second intervals. If the checks never return 200, the check fails and the new deployment rolls back (new container is removed, and old one is restarted)

Onboarded nijobs-be for this new feature in the project configs.

Copy link
Collaborator

@miguelpduarte miguelpduarte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cannot find the health-checker.sh file, I think it was not commited by mistake (you may need to change .gitignore(?)).

I have not yet had time to review the changes and test some specificities in Docker/Bash behaviour myself, but am already leaving some general comments to get the ball rolling.

Also, it seems that CircleCI did not run for this PR, what gives? It should have ran shellcheck which would have caught some of the issues highlighted as comments. I guess we'll have to look into that.

deployments/README.md Outdated Show resolved Hide resolved
deployments/deploy-types.sh Outdated Show resolved Hide resolved
deployments/deploy-types.sh Outdated Show resolved Hide resolved
deployments/deploy-types.sh Outdated Show resolved Hide resolved
deployments/deploy-types.sh Outdated Show resolved Hide resolved
deployments/deploy-types.sh Outdated Show resolved Hide resolved
deployments/deploy-types.sh Outdated Show resolved Hide resolved
deployments/deploy-types.sh Outdated Show resolved Hide resolved
deployments/project-configs.sh Outdated Show resolved Hide resolved
deployments/deploy-types.sh Outdated Show resolved Hide resolved
@imnotteixeira
Copy link
Collaborator Author

I cannot find the health-checker.sh file, I think it was not commited by mistake (you may need to change .gitignore(?)).

I have not yet had time to review the changes and test some specificities in Docker/Bash behaviour myself, but am already leaving some general comments to get the ball rolling.

Also, it seems that CircleCI did not run for this PR, what gives? It should have ran shellcheck which would have caught some of the issues highlighted as comments. I guess we'll have to look into that.

Right, I'm still new to this git/github stuff, not sure why the file is missing 🙃

I'll try to add it back later

@imnotteixeira
Copy link
Collaborator Author

Added the missing file. git rebase -i is amazing. I will address the other comments eventually.

Added health checks for nijobs-be on deployment
@imnotteixeira imnotteixeira force-pushed the feature/health-checks branch from c5691de to f1788e6 Compare March 30, 2022 20:52
Copy link
Collaborator

@miguelpduarte miguelpduarte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small changes, overall is looking good and almost ready to merge.

However, there are two issues, IMO:

  1. CI did not run for this PR, and there are a lot of issues that shellcheck flags (I noticed a half-dozen variable quoting issues, at least), so they should be fixed;
  2. We should consider the effects of a rollback in regards to affecting state that is not controlled by this deploy (e.g. external databases). Are there corner cases that should be considered? Is it possible to run, the app will not be healthy, and rolling back would result in dirty(er?) state?

Finally, these are kind of big changes so they should be thoroughly tested. @imnotteixeira did you test this locally? I can test the healthcheck part locally at least, I think, as I don't currently have the setup for the fully integrated test. Can anyone help with this?

deployments/health-checker.sh Outdated Show resolved Hide resolved

# According to 1 retry every 10 seconds, this will try for 5 minutes
local MAX_ATTEMPTS=26
local RETRY_INTERVAL_SECONDS=1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo? The comment above mentions "every 10 seconds", but the value for this variable suggests 1 second of delay between retries.

Also, since these variables are constants we should use either declare -r or readonly here.

See this and this for more info.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, was testing this manually, so I wanted to to see it faster

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was not updated to use declare -r nor readonly. We should do that, but this is not reason enough to block the PR any longer if this is the only issue

@imnotteixeira imnotteixeira force-pushed the feature/health-checks branch 6 times, most recently from a3c9da2 to 9741b75 Compare April 13, 2022 13:46
@imnotteixeira imnotteixeira force-pushed the feature/health-checks branch from 9741b75 to ab268d0 Compare April 13, 2022 13:52
@imnotteixeira
Copy link
Collaborator Author

Some small changes, overall is looking good and almost ready to merge.

However, there are two issues, IMO:

1. CI did not run for this PR, and there are a lot of issues that shellcheck flags (I noticed a half-dozen variable quoting issues, at least), so they should be fixed;

2. We should consider the effects of a rollback in regards to affecting state that is not controlled by this deploy (e.g. external databases). Are there corner cases that should be considered? Is it possible to run, the app will not be healthy, and rolling back would result in dirty(er?) state?

Finally, these are kind of big changes so they should be thoroughly tested. @imnotteixeira did you test this locally? I can test the healthcheck part locally at least, I think, as I don't currently have the setup for the fully integrated test. Can anyone help with this?

I did test the health checker locally, but when trying to deploy something, it changes to the master branch, and loses the changes in this PR, so not sure how to do it

@miguelpduarte
Copy link
Collaborator

Some small changes, overall is looking good and almost ready to merge.
However, there are two issues, IMO:

1. CI did not run for this PR, and there are a lot of issues that shellcheck flags (I noticed a half-dozen variable quoting issues, at least), so they should be fixed;

2. We should consider the effects of a rollback in regards to affecting state that is not controlled by this deploy (e.g. external databases). Are there corner cases that should be considered? Is it possible to run, the app will not be healthy, and rolling back would result in dirty(er?) state?

Finally, these are kind of big changes so they should be thoroughly tested. @imnotteixeira did you test this locally? I can test the healthcheck part locally at least, I think, as I don't currently have the setup for the fully integrated test. Can anyone help with this?

I did test the health checker locally, but when trying to deploy something, it changes to the master branch, and loses the changes in this PR, so not sure how to do it

That should not happen. Are you trying to use this to deploy itself? Make sure to clone the repositories to deploy locally into the deployments folder, and then run the deploy script with the correct project name and branch.

@imnotteixeira
Copy link
Collaborator Author

@miguelpduarte Finally managed to test this, and it is working as expected!

Since I had some problem setting the env up, I got to verify it rolls back on service launch failure (cleans up new container, and in case there is an old one, reboots it)

Also checked it succeeds the health check when service actually works. I think we're green to merge this!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add health check when deploying services and rollback in case of errors
2 participants