Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] refactor and add http and script checks #9

Closed
wants to merge 30 commits into from
Closed

[WIP] refactor and add http and script checks #9

wants to merge 30 commits into from

Conversation

sarkis
Copy link

@sarkis sarkis commented Feb 28, 2018

The main goal of this PR is to refactor health-checker to be more extensible and add a couple more health check types (http and script) outlined in #2.

Since this project is still under heavy development, there are some known issues that will also be fixed by this PR (see below), (yes I know... scope creep! :p)

Fixes: #7
Fixes: #8
Fixes: #10

Remaining tasks:

  • Cleanup - Ensure proper logging to stdout (via logrus)
    • Should have more Info logs (i.e. list all initialized health checks on server start as it is currently not clear what health checks will actually be checked until a request comes through to the server)
  • Tests (at least unit tests and decent coverage) - Started: but need more before this PR is ready for merge

README.md Outdated

Run a listener on port 6000 that accepts all inbound HTTP connections for any URL. When the request is received,
attempt to open TCP connections to port 5432 and 3306. If both succeed, return `HTTP 200 OK`. If any fails, return `HTTP
attempt to open TCP connections to port 5432 and 3306. If both succeed, return `HTTP 200 OK`. If any fail, return `HTTP
504 Gateway Not Found`.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be 504 Gateway Timeout

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch - thank you!

Copy link
Member

@brikis98 brikis98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the simplicity of multiple --port and --http style arguments, but worry a little about the extensibility. Let me know what you think.

README.md Outdated
@@ -42,19 +43,38 @@ health-checker [options]

| Option | Description | Default
| ------ | ----------- | -------
| `--port` | The port number on which a TCP connection will be attempted. Specify one or more times. | |
| `--port` | The port number on which a TCP connection will be attempted. Can be specified multiple times. | |
| `--http` | The url:port to check for a 2XX status code. Can be specified multiple times. | |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, should we support other status codes that count as success? For example, Vault has several different status codes: https://www.vaultproject.io/api/system/health.html.

Similarly, does it make sense to be able to scan the body of the HTTP response for some regex? Might be another good, common sanity check.

For clarity, we don't need to support all of these features in the very next iteration, but we should think through an API / UX that makes those possible.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, we should mention if HTTPS URLs will work here.

Copy link
Author

@sarkis sarkis Feb 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: different status codes - I was thinking about possibly allowing anything but a 5XX as success. This can get messy though, even ELB vs ALB has this issue - for ELBs 3XX is considered healthy where ALB only reports healthy for 2XX.

That said, as I think about this - it makes more sense for now to just assume success for any status code other than a 5XX/4XX - the reason I am bundling 4XX is mainly because of 403, 404 and 400 status codes.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

re: HTTPS - unless we know of a use case for this already - I'd like to just be more clear that it only supports HTTP at this time and track a separate feature/improvement if we have a need.

README.md Outdated
| `--port` | The port number on which a TCP connection will be attempted. Specify one or more times. | |
| `--port` | The port number on which a TCP connection will be attempted. Can be specified multiple times. | |
| `--http` | The url:port to check for a 2XX status code. Can be specified multiple times. | |
| `--script` | Path to an executable script that should return with a 0 exit status if successful. Can be specified multiple times. | |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't remember if health-checker supports this, but if not, we should file an issue to track it and come back to it later: timeouts. We should be sure to support (configurable) timeouts for making TCP connections, HTTP calls, and running scripts, guaranteeing that health-checker always responds within some SLA whether or not the underlying pieces work.

Copy link
Author

@sarkis sarkis Feb 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an issue here to track that separately: #10

@sarkis
Copy link
Author

sarkis commented Feb 28, 2018

I like the simplicity of multiple --port and --http style arguments, but worry a little about the extensibility. Let me know what you think.

Discussed this offline - taking a different more extensible approach. The proposed way of treating all the checks as different types is not usable with feature requests we already have.

Copy link
Member

@brikis98 brikis98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking forward to seeing the examples of the yaml syntax

README.md Outdated
@@ -42,19 +42,23 @@ health-checker [options]

| Option | Description | Default
| ------ | ----------- | -------
| `--port` | The port number on which a TCP connection will be attempted. Specify one or more times. | |
| `--listener` | The IP address and port on which inbound HTTP connections will be accepted. | `0.0.0.0:5000`
| `--config` | A YAML config file containing options and checks | |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps health-checker.yml in the current working dir should be the default?

@@ -1,6 +1,6 @@
# health-checker

A simple HTTP server that will return `200 OK` if the given TCP ports are all successfully accepting connections.
A simple HTTP server that will return `200 OK` if the configured checks are all successful.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd add a quick start right here so users get a very clear idea of what this is and, most importantly, why they'd want to use it.

## Quick start

Put the following in `health-checker.yml`:

<include simple example of a file with two health checks>

And run `health-checker`. Now, requests to `localhost:xxx` will return a 200 OK if X and Y checks pass.

README.md Outdated

Parse configuration from `health-checker.yml` and run a listener that accepts all inbound HTTP connections for any URL. When
the request is received, attempt to run all checks specified in `health-checker.yml`. If all checks succeed, return `HTTP 200 OK`.
If any fail, return `HTTP 504 GATEWAY TIMEOUT`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't look like this paragraph adds much. I'd just say "see the examples folder for all examples."

README.md Outdated

```
health-checker --listener "0.0.0.0:6000" --port 5432 --port 3306
health-checker --config health-checker.yml
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd put this right below the table of options.

@@ -12,14 +12,14 @@ the TCP Listeners of both services are successfully accepting connections. But t
a single TCP port, or an HTTP(S) endpoint. As a result, our use case just isn't supported natively by AWS.

We wrote health-checker so that we could run a daemon on the server that reports the true health of the server by
attempting to open a TCP connection to more than one port when it receives an inbound HTTP request on the given listener.
checking more conditions than a just single port or HTTP request while still allowing for a single HTTP request on the given listener.

## How It Works
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps rename this section to configuring health-checker?

}

var checksFlag = cli.StringFlag{
Name: "checks",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

config?

Copy link
Author

@sarkis sarkis Feb 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I renamed this as I think calling it checks makes more sense - I'm OK with going back to config - but I actually left out "configs" like the listener and log-level for example... those still get set via the cli args

}
checksFileContents, err := ioutil.ReadFile(checksFile)
if err != nil {
fmt.Print(err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Go, you should return an error whenever you hit one, not just log it.


err = yaml.Unmarshal(checksFileContents, &checks)
if err != nil{
panic(err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Return an error rather than panic. That'll make testing much easier.

}

if len(checks.Ports) == 0 {
panic(err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Return an error

fmt.Print(err)
}

var checks Checks
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll probably want to extract the config file parsing into a separate method

@@ -0,0 +1,3 @@
ports:
- 5500
- 6500
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these TCP checks? If so, what design were you thinking of for how to handle HTTP checks? And script based checks?

Copy link
Author

@sarkis sarkis Feb 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

http:

http_checks:
  - service1
    host: "localhost"
    port: 8080
    success_status_codes: [200, 201, 202, 204, 429]
  - service2
    host: "localhost"
    port: 8081
    body_regex: "Healthy"

script:

scripts:
  - /usr/local/bin/foo.sh
    success_exit_code: 0

Copy link
Member

@brikis98 brikis98 Feb 28, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Minor suggestions:

Instead of just service1, you probably want name: service1. Same for the scripts: script: /usr/local/bin/foo.sh. That's because the syntax you're using there in yaml is a list of maps, and each item in the map needs a key.

Also, for consistency, ports should then become:

tcp:
  - name: foo
    host: localhost
    port: 80
  - name: bar
    host: localhost
    port: 443

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll modify this and also put up a sample structure in the README and/or examples so it is clear...

Copy link
Member

@brikis98 brikis98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making good progress! The pieces are starting to come together :)

}

if len(checks.TcpChecks) + len(checks.HttpChecks) + len(checks.ScriptChecks) == 0 {
return nil, err
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't return the previous error (which will be nil), create a new one

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch!

server/server.go Outdated
var waitGroup = sync.WaitGroup{}

for _, port := range opts.Ports {
for _, tcpCheck := range opts.Checks.TcpChecks {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So it looks like this is still doing solely TCP checks?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is - so my plan was to just iterate over each of the different Checks.* - maybe I need to think up a better solution for this one...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One way to do it w/o loads of copy/paste is to define a common interface for all the types of checks.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought about this some more - definitely want to switch this to using a common interface - it will then have a method set of (what i can think of so far):

DoCheck (as you mentioned)
ValidateCheck - this would be really handy inside the parseChecksFile to ensure the "required" properties of each check is set... otherwise it currently will error when it's too late (during healthchecks)

server/server.go Outdated
}

waitGroup.Done()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it's not your code, but typically, waitGroup.Done() should be in a defer to ensure it's always called

@sarkis
Copy link
Author

sarkis commented Mar 2, 2018

@brikis98 what a difference the Check interface made - this feels way more maintainable and extensible, thank you for the suggestion! Though I could probably use a CR when you have some free time as a lot changed (mostly removed a ton of code which is great)

Remaining tasks (hoping I can get by Monday):

  • Implement DoCheck() for HttpCheck and ScriptCheck
  • Implement ValidateCheck() method for each check and validate in parseChecksFile
  • Tests!

@brikis98
Copy link
Member

brikis98 commented Mar 2, 2018

Nice. Looks like a good improvement 👍

@sarkis sarkis changed the title WIP: add http and script checks WIP: Refactor and add http and script checks Mar 4, 2018
@sarkis sarkis changed the title WIP: Refactor and add http and script checks WIP: refactor and add http and script checks Mar 4, 2018
@sarkis sarkis changed the title WIP: refactor and add http and script checks [WIP] refactor and add http and script checks Mar 4, 2018
@sarkis sarkis closed this Mar 6, 2018
@sarkis sarkis reopened this Mar 8, 2018
@sarkis
Copy link
Author

sarkis commented Mar 10, 2018

@brikis98 I've got a lot more tests to write to get to an acceptable coverage % and I'm sure writing the tests will lead to some refactoring.. but wanted to see if I can get a re-CR as I work on those, when/if you are up for it.

Copy link
Member

@brikis98 brikis98 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, just did a pass. Overall, the design for the checks looks good, the code structure is reasonable, the tests for config parsing look good, and although I made a bunch of comments, just about all of them are nitpicky things that aren't very important. The only critical things left are (a) the error handling/logging for the script checks, which is the only significant comment on this PR (b) automated tests for the checks themselves. Everything else is optional and/or can be filed as issues for future work.

Run a listener on port 6000 that accepts all inbound HTTP connections for any URL. When the request is received,
attempt to open TCP connections to port 5432 and 3306. If both succeed, return `HTTP 200 OK`. If any fails, return `HTTP
504 Gateway Not Found`.
#### Configuration
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to add a sentence in this section that says something along the lines of, "You can use a yaml file to configure health-checker with the following settings:"

| `--listener` | The IP address and port on which inbound HTTP connections will be accepted. | `0.0.0.0:5000`
| `--config` | A YAML file containing checks which will be evaluated | `health-checks.yml`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"See the configuration section below for details."

| `name` | (Required) Name of the health check
| `host` | (Required) IP or hostname to check
| `port` | (Required) `port` to check on `host`
| `timeout` | Timeout for health check in seconds - if not specified defaults to 5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice docs 👍

| `host` | (Required) IP or hostname to check
| `port` | (Required) `port` to check on `host`
| `timeout` | Timeout for health check in seconds - if not specified defaults to 5
| `status_codes` | An array of status codes which should PASS health check
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should probably default to [200].

| `port` | (Required) `port` to check on `host`
| `timeout` | Timeout for health check in seconds - if not specified defaults to 5
| `status_codes` | An array of status codes which should PASS health check
| `body_regex` | **Will not be checked if `status_codes` specified** - A string to search for in the body of the response, if found will pass health check
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not check both the body and status codes if both are specified?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point - changing this!

if c.Name == "" {
return &InvalidCheck{name: "script", key: "name"}
}
if c.Script == "" {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps check that this script exists and you have execute permissions?

ctx, cancel := context.WithTimeout(context.Background(), timeout)
defer cancel()

cmd := exec.CommandContext(ctx, c.Script)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know this existed. Nice find!

```yaml
script:
- name: scriptCheck1
script: /path/to/some/script.sh
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like we should also support args, a list of arguments to pass to the script.

defer cancel()

cmd := exec.CommandContext(ctx, c.Script)
_, err := cmd.Output()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't throw away the output from the command, as it's essential for debugging what happened. Instead, log that output.

return &CheckTimeout{name: c.Name, timeout: int(timeout)}
}
if err != nil {
return &CheckFail{name: c.Name, reason: "non-zero exit code"}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should log this error and return it to help with debugging!

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it bad form to log it like this: https://github.com/gruntwork-io/health-checker/pull/9/files#diff-91bbeda7eb98a7adc57b9e47e2cf5c2bR79

It currently returns then logs it in that function

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's totally fine. However, you should then return the underlying err from this method (e.g., it could be an optional field in the CheckFail struct that is rendered in its Error() method) so it gets logged by DoCheck.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get it now - thanks again @brikis98 - working on making sure the error/logging loose ends are all tied up and getting those checks tested... I think I'll feel better about "production" ready after that and a final final final CR :P

@autero1 autero1 mentioned this pull request Jan 8, 2019
@sarkis
Copy link
Author

sarkis commented Jan 14, 2019

This PR has gotten stale, I'm opting to close this out as I don't want this to deter others from contributing to and fixing the issues that this PR was set out to originally fix.

A huge lesson learned on scoping here (I took on way too much for a single PR). I don't know of a clean way out of the mess I've created here other than starting over with tighter scoped PRs.

@sarkis sarkis closed this Jan 14, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants