-
Notifications
You must be signed in to change notification settings - Fork 36
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] refactor and add http and script checks #9
Conversation
README.md
Outdated
|
||
Run a listener on port 6000 that accepts all inbound HTTP connections for any URL. When the request is received, | ||
attempt to open TCP connections to port 5432 and 3306. If both succeed, return `HTTP 200 OK`. If any fails, return `HTTP | ||
attempt to open TCP connections to port 5432 and 3306. If both succeed, return `HTTP 200 OK`. If any fail, return `HTTP | ||
504 Gateway Not Found`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be 504 Gateway Timeout
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch - thank you!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the simplicity of multiple --port
and --http
style arguments, but worry a little about the extensibility. Let me know what you think.
README.md
Outdated
@@ -42,19 +43,38 @@ health-checker [options] | |||
|
|||
| Option | Description | Default | |||
| ------ | ----------- | ------- | |||
| `--port` | The port number on which a TCP connection will be attempted. Specify one or more times. | | | |||
| `--port` | The port number on which a TCP connection will be attempted. Can be specified multiple times. | | | |||
| `--http` | The url:port to check for a 2XX status code. Can be specified multiple times. | | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, should we support other status codes that count as success? For example, Vault has several different status codes: https://www.vaultproject.io/api/system/health.html.
Similarly, does it make sense to be able to scan the body of the HTTP response for some regex? Might be another good, common sanity check.
For clarity, we don't need to support all of these features in the very next iteration, but we should think through an API / UX that makes those possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, we should mention if HTTPS URLs will work here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re: different status codes - I was thinking about possibly allowing anything but a 5XX as success. This can get messy though, even ELB vs ALB has this issue - for ELBs 3XX is considered healthy where ALB only reports healthy for 2XX.
That said, as I think about this - it makes more sense for now to just assume success for any status code other than a 5XX/4XX - the reason I am bundling 4XX is mainly because of 403, 404 and 400 status codes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re: HTTPS - unless we know of a use case for this already - I'd like to just be more clear that it only supports HTTP at this time and track a separate feature/improvement if we have a need.
README.md
Outdated
| `--port` | The port number on which a TCP connection will be attempted. Specify one or more times. | | | ||
| `--port` | The port number on which a TCP connection will be attempted. Can be specified multiple times. | | | ||
| `--http` | The url:port to check for a 2XX status code. Can be specified multiple times. | | | ||
| `--script` | Path to an executable script that should return with a 0 exit status if successful. Can be specified multiple times. | | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't remember if health-checker supports this, but if not, we should file an issue to track it and come back to it later: timeouts. We should be sure to support (configurable) timeouts for making TCP connections, HTTP calls, and running scripts, guaranteeing that health-checker always responds within some SLA whether or not the underlying pieces work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added an issue here to track that separately: #10
Discussed this offline - taking a different more extensible approach. The proposed way of treating all the checks as different types is not usable with feature requests we already have. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking forward to seeing the examples of the yaml syntax
README.md
Outdated
@@ -42,19 +42,23 @@ health-checker [options] | |||
|
|||
| Option | Description | Default | |||
| ------ | ----------- | ------- | |||
| `--port` | The port number on which a TCP connection will be attempted. Specify one or more times. | | | |||
| `--listener` | The IP address and port on which inbound HTTP connections will be accepted. | `0.0.0.0:5000` | |||
| `--config` | A YAML config file containing options and checks | | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps health-checker.yml
in the current working dir should be the default?
@@ -1,6 +1,6 @@ | |||
# health-checker | |||
|
|||
A simple HTTP server that will return `200 OK` if the given TCP ports are all successfully accepting connections. | |||
A simple HTTP server that will return `200 OK` if the configured checks are all successful. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd add a quick start right here so users get a very clear idea of what this is and, most importantly, why they'd want to use it.
## Quick start
Put the following in `health-checker.yml`:
<include simple example of a file with two health checks>
And run `health-checker`. Now, requests to `localhost:xxx` will return a 200 OK if X and Y checks pass.
README.md
Outdated
|
||
Parse configuration from `health-checker.yml` and run a listener that accepts all inbound HTTP connections for any URL. When | ||
the request is received, attempt to run all checks specified in `health-checker.yml`. If all checks succeed, return `HTTP 200 OK`. | ||
If any fail, return `HTTP 504 GATEWAY TIMEOUT`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't look like this paragraph adds much. I'd just say "see the examples folder for all examples."
README.md
Outdated
|
||
``` | ||
health-checker --listener "0.0.0.0:6000" --port 5432 --port 3306 | ||
health-checker --config health-checker.yml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd put this right below the table of options.
@@ -12,14 +12,14 @@ the TCP Listeners of both services are successfully accepting connections. But t | |||
a single TCP port, or an HTTP(S) endpoint. As a result, our use case just isn't supported natively by AWS. | |||
|
|||
We wrote health-checker so that we could run a daemon on the server that reports the true health of the server by | |||
attempting to open a TCP connection to more than one port when it receives an inbound HTTP request on the given listener. | |||
checking more conditions than a just single port or HTTP request while still allowing for a single HTTP request on the given listener. | |||
|
|||
## How It Works |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps rename this section to configuring health-checker?
commands/flags.go
Outdated
} | ||
|
||
var checksFlag = cli.StringFlag{ | ||
Name: "checks", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
config
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I renamed this as I think calling it checks makes more sense - I'm OK with going back to config - but I actually left out "configs" like the listener
and log-level
for example... those still get set via the cli args
commands/flags.go
Outdated
} | ||
checksFileContents, err := ioutil.ReadFile(checksFile) | ||
if err != nil { | ||
fmt.Print(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Go, you should return an error whenever you hit one, not just log it.
commands/flags.go
Outdated
|
||
err = yaml.Unmarshal(checksFileContents, &checks) | ||
if err != nil{ | ||
panic(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Return an error rather than panic
. That'll make testing much easier.
commands/flags.go
Outdated
} | ||
|
||
if len(checks.Ports) == 0 { | ||
panic(err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Return an error
commands/flags.go
Outdated
fmt.Print(err) | ||
} | ||
|
||
var checks Checks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll probably want to extract the config file parsing into a separate method
examples/health-checks.yml.simple
Outdated
@@ -0,0 +1,3 @@ | |||
ports: | |||
- 5500 | |||
- 6500 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these TCP checks? If so, what design were you thinking of for how to handle HTTP checks? And script based checks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
http:
http_checks:
- service1
host: "localhost"
port: 8080
success_status_codes: [200, 201, 202, 204, 429]
- service2
host: "localhost"
port: 8081
body_regex: "Healthy"
script:
scripts:
- /usr/local/bin/foo.sh
success_exit_code: 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice!
Minor suggestions:
Instead of just service1
, you probably want name: service1
. Same for the scripts: script: /usr/local/bin/foo.sh
. That's because the syntax you're using there in yaml is a list of maps, and each item in the map needs a key.
Also, for consistency, ports
should then become:
tcp:
- name: foo
host: localhost
port: 80
- name: bar
host: localhost
port: 443
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll modify this and also put up a sample structure in the README and/or examples so it is clear...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Making good progress! The pieces are starting to come together :)
commands/flags.go
Outdated
} | ||
|
||
if len(checks.TcpChecks) + len(checks.HttpChecks) + len(checks.ScriptChecks) == 0 { | ||
return nil, err |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't return the previous error (which will be nil
), create a new one
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch!
server/server.go
Outdated
var waitGroup = sync.WaitGroup{} | ||
|
||
for _, port := range opts.Ports { | ||
for _, tcpCheck := range opts.Checks.TcpChecks { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So it looks like this is still doing solely TCP checks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is - so my plan was to just iterate over each of the different Checks.* - maybe I need to think up a better solution for this one...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One way to do it w/o loads of copy/paste is to define a common interface
for all the types of checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thought about this some more - definitely want to switch this to using a common interface - it will then have a method set of (what i can think of so far):
DoCheck (as you mentioned)
ValidateCheck - this would be really handy inside the parseChecksFile to ensure the "required" properties of each check is set... otherwise it currently will error when it's too late (during healthchecks)
server/server.go
Outdated
} | ||
|
||
waitGroup.Done() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know it's not your code, but typically, waitGroup.Done()
should be in a defer
to ensure it's always called
@brikis98 what a difference the Remaining tasks (hoping I can get by Monday):
|
Nice. Looks like a good improvement 👍 |
@brikis98 I've got a lot more tests to write to get to an acceptable coverage % and I'm sure writing the tests will lead to some refactoring.. but wanted to see if I can get a re-CR as I work on those, when/if you are up for it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, just did a pass. Overall, the design for the checks looks good, the code structure is reasonable, the tests for config parsing look good, and although I made a bunch of comments, just about all of them are nitpicky things that aren't very important. The only critical things left are (a) the error handling/logging for the script checks, which is the only significant comment on this PR (b) automated tests for the checks themselves. Everything else is optional and/or can be filed as issues for future work.
Run a listener on port 6000 that accepts all inbound HTTP connections for any URL. When the request is received, | ||
attempt to open TCP connections to port 5432 and 3306. If both succeed, return `HTTP 200 OK`. If any fails, return `HTTP | ||
504 Gateway Not Found`. | ||
#### Configuration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might want to add a sentence in this section that says something along the lines of, "You can use a yaml file to configure health-checker with the following settings:"
| `--listener` | The IP address and port on which inbound HTTP connections will be accepted. | `0.0.0.0:5000` | ||
| `--config` | A YAML file containing checks which will be evaluated | `health-checks.yml` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"See the configuration section below for details."
| `name` | (Required) Name of the health check | ||
| `host` | (Required) IP or hostname to check | ||
| `port` | (Required) `port` to check on `host` | ||
| `timeout` | Timeout for health check in seconds - if not specified defaults to 5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice docs 👍
| `host` | (Required) IP or hostname to check | ||
| `port` | (Required) `port` to check on `host` | ||
| `timeout` | Timeout for health check in seconds - if not specified defaults to 5 | ||
| `status_codes` | An array of status codes which should PASS health check |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably default to [200]
.
| `port` | (Required) `port` to check on `host` | ||
| `timeout` | Timeout for health check in seconds - if not specified defaults to 5 | ||
| `status_codes` | An array of status codes which should PASS health check | ||
| `body_regex` | **Will not be checked if `status_codes` specified** - A string to search for in the body of the response, if found will pass health check |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not check both the body and status codes if both are specified?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point - changing this!
if c.Name == "" { | ||
return &InvalidCheck{name: "script", key: "name"} | ||
} | ||
if c.Script == "" { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps check that this script exists and you have execute permissions?
ctx, cancel := context.WithTimeout(context.Background(), timeout) | ||
defer cancel() | ||
|
||
cmd := exec.CommandContext(ctx, c.Script) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't know this existed. Nice find!
```yaml | ||
script: | ||
- name: scriptCheck1 | ||
script: /path/to/some/script.sh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like we should also support args
, a list of arguments to pass to the script.
defer cancel() | ||
|
||
cmd := exec.CommandContext(ctx, c.Script) | ||
_, err := cmd.Output() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Don't throw away the output from the command, as it's essential for debugging what happened. Instead, log that output.
return &CheckTimeout{name: c.Name, timeout: int(timeout)} | ||
} | ||
if err != nil { | ||
return &CheckFail{name: c.Name, reason: "non-zero exit code"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should log this error and return it to help with debugging!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it bad form to log it like this: https://github.com/gruntwork-io/health-checker/pull/9/files#diff-91bbeda7eb98a7adc57b9e47e2cf5c2bR79
It currently returns then logs it in that function
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's totally fine. However, you should then return the underlying err
from this method (e.g., it could be an optional field in the CheckFail
struct that is rendered in its Error()
method) so it gets logged by DoCheck
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get it now - thanks again @brikis98 - working on making sure the error/logging loose ends are all tied up and getting those checks tested... I think I'll feel better about "production" ready after that and a final final final CR :P
This PR has gotten stale, I'm opting to close this out as I don't want this to deter others from contributing to and fixing the issues that this PR was set out to originally fix. A huge lesson learned on scoping here (I took on way too much for a single PR). I don't know of a clean way out of the mess I've created here other than starting over with tighter scoped PRs. |
The main goal of this PR is to refactor
health-checker
to be more extensible and add a couple more health check types (http and script) outlined in #2.Since this project is still under heavy development, there are some known issues that will also be fixed by this PR (see below), (yes I know... scope creep! :p)
Fixes: #7
Fixes: #8
Fixes: #10
Remaining tasks: