[WIP] refactor and add http and script checks #9

sarkis · 2018-02-28T17:19:53Z

The main goal of this PR is to refactor health-checker to be more extensible and add a couple more health check types (http and script) outlined in #2.

Since this project is still under heavy development, there are some known issues that will also be fixed by this PR (see below), (yes I know... scope creep! :p)

Fixes: #7
Fixes: #8
Fixes: #10

Remaining tasks:

Cleanup - Ensure proper logging to stdout (via logrus)
- Should have more Info logs (i.e. list all initialized health checks on server start as it is currently not clear what health checks will actually be checked until a request comes through to the server)
Tests (at least unit tests and decent coverage) - Started: but need more before this PR is ready for merge

mcalhoun · 2018-02-28T17:25:50Z

README.md


 Run a listener on port 6000 that accepts all inbound HTTP connections for any URL. When the request is received,
-attempt to open TCP connections to port 5432 and 3306. If both succeed, return `HTTP 200 OK`. If any fails, return `HTTP
+attempt to open TCP connections to port 5432 and 3306. If both succeed, return `HTTP 200 OK`. If any fail, return `HTTP
 504 Gateway Not Found`.


Should be 504 Gateway Timeout

good catch - thank you!

brikis98

I like the simplicity of multiple --port and --http style arguments, but worry a little about the extensibility. Let me know what you think.

brikis98 · 2018-02-28T17:43:16Z

README.md

@@ -42,19 +43,38 @@ health-checker [options]

 | Option | Description | Default 
 | ------ | ----------- | -------
-| `--port` | The port number on which a TCP connection will be attempted. Specify one or more times. | | 
+| `--port` | The port number on which a TCP connection will be attempted. Can be specified multiple times. | |
+| `--http` | The url:port to check for a 2XX status code. Can be specified multiple times. | |


Hm, should we support other status codes that count as success? For example, Vault has several different status codes: https://www.vaultproject.io/api/system/health.html.

Similarly, does it make sense to be able to scan the body of the HTTP response for some regex? Might be another good, common sanity check.

For clarity, we don't need to support all of these features in the very next iteration, but we should think through an API / UX that makes those possible.

Also, we should mention if HTTPS URLs will work here.

re: different status codes - I was thinking about possibly allowing anything but a 5XX as success. This can get messy though, even ELB vs ALB has this issue - for ELBs 3XX is considered healthy where ALB only reports healthy for 2XX.

That said, as I think about this - it makes more sense for now to just assume success for any status code other than a 5XX/4XX - the reason I am bundling 4XX is mainly because of 403, 404 and 400 status codes.

re: HTTPS - unless we know of a use case for this already - I'd like to just be more clear that it only supports HTTP at this time and track a separate feature/improvement if we have a need.

brikis98 · 2018-02-28T17:46:09Z

README.md

-| `--port` | The port number on which a TCP connection will be attempted. Specify one or more times. | | 
+| `--port` | The port number on which a TCP connection will be attempted. Can be specified multiple times. | |
+| `--http` | The url:port to check for a 2XX status code. Can be specified multiple times. | |
+| `--script` | Path to an executable script that should return with a 0 exit status if successful. Can be specified multiple times. | |


I can't remember if health-checker supports this, but if not, we should file an issue to track it and come back to it later: timeouts. We should be sure to support (configurable) timeouts for making TCP connections, HTTP calls, and running scripts, guaranteeing that health-checker always responds within some SLA whether or not the underlying pieces work.

Added an issue here to track that separately: #10

sarkis · 2018-02-28T18:53:32Z

I like the simplicity of multiple --port and --http style arguments, but worry a little about the extensibility. Let me know what you think.

Discussed this offline - taking a different more extensible approach. The proposed way of treating all the checks as different types is not usable with feature requests we already have.

brikis98

Looking forward to seeing the examples of the yaml syntax

brikis98 · 2018-02-28T19:35:24Z

README.md

@@ -42,19 +42,23 @@ health-checker [options]

 | Option | Description | Default 
 | ------ | ----------- | -------
-| `--port` | The port number on which a TCP connection will be attempted. Specify one or more times. | | 
-| `--listener` |  The IP address and port on which inbound HTTP connections will be accepted. | `0.0.0.0:5000`
+| `--config` | A YAML config file containing options and checks | |


Perhaps health-checker.yml in the current working dir should be the default?

brikis98 · 2018-02-28T19:39:33Z

README.md

@@ -1,6 +1,6 @@
 # health-checker

-A simple HTTP server that will return `200 OK` if the given TCP ports are all successfully accepting connections.
+A simple HTTP server that will return `200 OK` if the configured checks are all successful.


I'd add a quick start right here so users get a very clear idea of what this is and, most importantly, why they'd want to use it.

## Quick start Put the following in `health-checker.yml`: <include simple example of a file with two health checks> And run `health-checker`. Now, requests to `localhost:xxx` will return a 200 OK if X and Y checks pass.

brikis98 · 2018-02-28T19:40:10Z

README.md

+
+Parse configuration from `health-checker.yml` and run a listener  that accepts all inbound HTTP connections for any URL. When
+the request is received, attempt to run all checks specified in `health-checker.yml`. If all checks succeed, return `HTTP 200 OK`. 
+If any fail, return `HTTP 504 GATEWAY TIMEOUT`.


Doesn't look like this paragraph adds much. I'd just say "see the examples folder for all examples."

brikis98 · 2018-02-28T19:40:32Z

README.md


 ```
-health-checker --listener "0.0.0.0:6000" --port 5432 --port 3306
+health-checker --config health-checker.yml


I'd put this right below the table of options.

brikis98 · 2018-02-28T19:40:49Z

README.md

@@ -12,14 +12,14 @@ the TCP Listeners of both services are successfully accepting connections. But t
 a single TCP port, or an HTTP(S) endpoint. As a result, our use case just isn't supported natively by AWS.

 We wrote health-checker so that we could run a daemon on the server that reports the true health of the server by
-attempting to open a TCP connection to more than one port when it receives an inbound HTTP request on the given listener.
+checking more conditions than a just single port or HTTP request while still allowing for a single HTTP request on the given listener.

 ## How It Works


Perhaps rename this section to configuring health-checker?

brikis98 · 2018-02-28T22:32:46Z

commands/flags.go

+}
+
+var checksFlag = cli.StringFlag{
+	Name: "checks",


So I renamed this as I think calling it checks makes more sense - I'm OK with going back to config - but I actually left out "configs" like the listener and log-level for example... those still get set via the cli args

brikis98 · 2018-02-28T22:33:18Z

commands/flags.go

+	}
+	checksFileContents, err := ioutil.ReadFile(checksFile)
+	if err != nil {
+		fmt.Print(err)


In Go, you should return an error whenever you hit one, not just log it.

brikis98 · 2018-02-28T22:33:36Z

commands/flags.go

+
+	err = yaml.Unmarshal(checksFileContents, &checks)
+	if err != nil{
+		panic(err)


Return an error rather than panic. That'll make testing much easier.

brikis98 · 2018-02-28T22:33:47Z

commands/flags.go

+	}
+
+	if len(checks.Ports) == 0 {
+		panic(err)


Return an error

brikis98 · 2018-02-28T22:34:17Z

commands/flags.go

+		fmt.Print(err)
+	}
+
+	var checks Checks


You'll probably want to extract the config file parsing into a separate method

brikis98 · 2018-02-28T22:34:45Z

examples/health-checks.yml.simple

@@ -0,0 +1,3 @@
+ports:
+  - 5500
+  - 6500


Are these TCP checks? If so, what design were you thinking of for how to handle HTTP checks? And script based checks?

http:

http_checks: - service1 host: "localhost" port: 8080 success_status_codes: [200, 201, 202, 204, 429] - service2 host: "localhost" port: 8081 body_regex: "Healthy"

script:

scripts: - /usr/local/bin/foo.sh success_exit_code: 0

Nice!

Minor suggestions:

Instead of just service1, you probably want name: service1. Same for the scripts: script: /usr/local/bin/foo.sh. That's because the syntax you're using there in yaml is a list of maps, and each item in the map needs a key.

Also, for consistency, ports should then become:

tcp: - name: foo host: localhost port: 80 - name: bar host: localhost port: 443

I'll modify this and also put up a sample structure in the README and/or examples so it is clear...

brikis98

Making good progress! The pieces are starting to come together :)

brikis98 · 2018-03-01T01:01:06Z

commands/flags.go

+	}
+
+	if len(checks.TcpChecks) + len(checks.HttpChecks) + len(checks.ScriptChecks) == 0 {
+		return nil, err


Don't return the previous error (which will be nil), create a new one

good catch!

brikis98 · 2018-03-01T01:03:04Z

server/server.go

 	var waitGroup = sync.WaitGroup{}

-	for _, port := range opts.Ports {
+	for _, tcpCheck := range opts.Checks.TcpChecks {


So it looks like this is still doing solely TCP checks?

It is - so my plan was to just iterate over each of the different Checks.* - maybe I need to think up a better solution for this one...

One way to do it w/o loads of copy/paste is to define a common interface for all the types of checks.

Thought about this some more - definitely want to switch this to using a common interface - it will then have a method set of (what i can think of so far):

DoCheck (as you mentioned)
ValidateCheck - this would be really handy inside the parseChecksFile to ensure the "required" properties of each check is set... otherwise it currently will error when it's too late (during healthchecks)

brikis98 · 2018-03-01T01:03:41Z

server/server.go

 			}
-
 			waitGroup.Done()


I know it's not your code, but typically, waitGroup.Done() should be in a defer to ensure it's always called

sarkis · 2018-03-02T06:22:44Z

@brikis98 what a difference the Check interface made - this feels way more maintainable and extensible, thank you for the suggestion! Though I could probably use a CR when you have some free time as a lot changed (mostly removed a ton of code which is great)

Remaining tasks (hoping I can get by Monday):

~~Implement DoCheck() for HttpCheck and ScriptCheck~~
~~Implement ValidateCheck() method for each check and validate in parseChecksFile~~
Tests!

brikis98 · 2018-03-02T10:54:21Z

Nice. Looks like a good improvement 👍

sarkis · 2018-03-10T20:04:51Z

@brikis98 I've got a lot more tests to write to get to an acceptable coverage % and I'm sure writing the tests will lead to some refactoring.. but wanted to see if I can get a re-CR as I work on those, when/if you are up for it.

brikis98

OK, just did a pass. Overall, the design for the checks looks good, the code structure is reasonable, the tests for config parsing look good, and although I made a bunch of comments, just about all of them are nitpicky things that aren't very important. The only critical things left are (a) the error handling/logging for the script checks, which is the only significant comment on this PR (b) automated tests for the checks themselves. Everything else is optional and/or can be filed as issues for future work.

brikis98 · 2018-03-10T23:16:15Z

README.md

-Run a listener on port 6000 that accepts all inbound HTTP connections for any URL. When the request is received,
-attempt to open TCP connections to port 5432 and 3306. If both succeed, return `HTTP 200 OK`. If any fails, return `HTTP
-504 Gateway Not Found`.
+#### Configuration


You might want to add a sentence in this section that says something along the lines of, "You can use a yaml file to configure health-checker with the following settings:"

brikis98 · 2018-03-10T23:16:35Z

README.md

 | `--listener` |  The IP address and port on which inbound HTTP connections will be accepted. | `0.0.0.0:5000`
+| `--config` | A YAML file containing checks which will be evaluated | `health-checks.yml`


"See the configuration section below for details."

brikis98 · 2018-03-10T23:19:22Z

README.md

+| `name`    | (Required) Name of the health check
+| `host`    | (Required) IP or hostname to check
+| `port`    | (Required) `port` to check on `host`
+| `timeout` | Timeout for health check in seconds - if not specified defaults to 5


Nice docs 👍

brikis98 · 2018-03-10T23:20:37Z

README.md

+| `host`         | (Required) IP or hostname to check
+| `port`         | (Required) `port` to check on `host`
+| `timeout`      | Timeout for health check in seconds - if not specified defaults to 5
+| `status_codes` | An array of status codes which should PASS health check


This should probably default to [200].

brikis98 · 2018-03-10T23:21:02Z

README.md

+| `port`         | (Required) `port` to check on `host`
+| `timeout`      | Timeout for health check in seconds - if not specified defaults to 5
+| `status_codes` | An array of status codes which should PASS health check
+| `body_regex`   | **Will not be checked if `status_codes` specified** - A string to search for in the body of the response, if found will pass health check


Why not check both the body and status codes if both are specified?

Good point - changing this!

brikis98 · 2018-03-10T23:42:10Z

server/server.go

+	if c.Name == "" {
+		return &InvalidCheck{name: "script", key: "name"}
+	}
+	if c.Script == "" {


Perhaps check that this script exists and you have execute permissions?

brikis98 · 2018-03-10T23:43:55Z

server/server.go

+	ctx, cancel := context.WithTimeout(context.Background(), timeout)
+	defer cancel()
+
+	cmd := exec.CommandContext(ctx, c.Script)


I didn't know this existed. Nice find!

brikis98 · 2018-03-10T23:44:44Z

README.md

+```yaml
+script:
+  - name: scriptCheck1
+    script: /path/to/some/script.sh


Seems like we should also support args, a list of arguments to pass to the script.

brikis98 · 2018-03-10T23:45:44Z

server/server.go

+	defer cancel()
+
+	cmd := exec.CommandContext(ctx, c.Script)
+	_, err := cmd.Output()


Don't throw away the output from the command, as it's essential for debugging what happened. Instead, log that output.

brikis98 · 2018-03-10T23:46:34Z

server/server.go

+		return &CheckTimeout{name: c.Name, timeout: int(timeout)}
+	}
+	if err != nil {
+		return &CheckFail{name: c.Name, reason: "non-zero exit code"}


We should log this error and return it to help with debugging!

Is it bad form to log it like this: https://github.com/gruntwork-io/health-checker/pull/9/files#diff-91bbeda7eb98a7adc57b9e47e2cf5c2bR79

It currently returns then logs it in that function

Yes, that's totally fine. However, you should then return the underlying err from this method (e.g., it could be an optional field in the CheckFail struct that is rendered in its Error() method) so it gets logged by DoCheck.

I get it now - thanks again @brikis98 - working on making sure the error/logging loose ends are all tied up and getting those checks tested... I think I'll feel better about "production" ready after that and a final final final CR :P

sarkis · 2019-01-14T19:36:25Z

This PR has gotten stale, I'm opting to close this out as I don't want this to deter others from contributing to and fixing the issues that this PR was set out to originally fix.

A huge lesson learned on scoping here (I took on way too much for a single PR). I don't know of a clean way out of the mess I've created here other than starting over with tighter scoped PRs.

docs for new http and script checks

c5927cf

mcalhoun reviewed Feb 28, 2018

View reviewed changes

fix 504 http status

b452fc1

brikis98 reviewed Feb 28, 2018

View reviewed changes

a new approach using a config file

7359201

brikis98 reviewed Feb 28, 2018

View reviewed changes

sarkis added 2 commits February 28, 2018 12:24

changes from CR and better documentation

1a8dc74

convert current ports check to work with yaml file

dc2a791

brikis98 reviewed Feb 28, 2018

View reviewed changes

use the struct to do tcp checks

296b46f

brikis98 reviewed Mar 1, 2018

View reviewed changes

sarkis added 7 commits February 28, 2018 17:16

fixes from CR

f8e392f

add http checkers

b8ce7a9

adjust debug mode logic

667aa16

update README and cleanup

f6c2b8c

comments and cleanup

687b550

use common Check interface

de4dae1

no longer need this to be a variable

c115758

sarkis added 9 commits March 2, 2018 13:34

add HttpCheck DoCheck method

7b1127e

add DoCheck for ScriptCheck

c27b432

add ValidateCheck for all checks and run after Unmarshaling

607a644

add a default http behavior to pass on 200 if no conditions defined

e98dd19

add more info on configs in README

0321e74

add an advanced example

500434f

refactor a bit to make unit testing easier

2a8369d

error with a message out if bad or empty config file is passed

4eb333e

configurable timeout with a 5s default if not specified

fd09b6c

fix syntax error

c62118c

sarkis changed the title ~~WIP: add http and script checks~~ WIP: Refactor and add http and script checks Mar 4, 2018

sarkis changed the title ~~WIP: Refactor and add http and script checks~~ WIP: refactor and add http and script checks Mar 4, 2018

sarkis added 2 commits March 4, 2018 12:11

remove call to opts here as it will not exist in the if err != nil

faa39bd

add first internal unit test

61a145b

sarkis mentioned this pull request Mar 4, 2018

Running health-checker with HEALTH_CHECKER_DEBUG="true" fails to start #8

Open

test an invalid config

03544b5

sarkis changed the title ~~WIP: refactor and add http and script checks~~ [WIP] refactor and add http and script checks Mar 4, 2018

sarkis closed this Mar 6, 2018

sarkis reopened this Mar 8, 2018

sarkis added 4 commits March 10, 2018 09:05

cleanup tests and use subtests for better organization

308c457

error cleanup

00fc8e5

add more logging

cf2321b

rename test file

59c1e2a

brikis98 reviewed Mar 10, 2018

View reviewed changes

autero1 mentioned this pull request Jan 8, 2019

Add script checks #12

Merged

sarkis closed this Jan 14, 2019

@@ @@ -0,0 +1,3 @@ @@
+              ports:
+                - 5500
+                - 6500

		\| `--listener` \| The IP address and port on which inbound HTTP connections will be accepted. \| `0.0.0.0:5000`
		\| `--config` \| A YAML file containing checks which will be evaluated \| `health-checks.yml`

[WIP] refactor and add http and script checks #9

[WIP] refactor and add http and script checks #9

Conversation

sarkis commented Feb 28, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brikis98 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sarkis Feb 28, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sarkis Feb 28, 2018 • edited Loading

Choose a reason for hiding this comment

sarkis commented Feb 28, 2018

brikis98 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sarkis Feb 28, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sarkis Feb 28, 2018 • edited Loading

Choose a reason for hiding this comment

brikis98 Feb 28, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

brikis98 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sarkis commented Mar 2, 2018 • edited Loading

brikis98 commented Mar 2, 2018

sarkis commented Mar 10, 2018

brikis98 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sarkis commented Jan 14, 2019

sarkis commented Feb 28, 2018 •

edited

Loading

sarkis Feb 28, 2018 •

edited

Loading

sarkis Feb 28, 2018 •

edited

Loading

sarkis Feb 28, 2018 •

edited

Loading

sarkis Feb 28, 2018 •

edited

Loading

brikis98 Feb 28, 2018 •

edited

Loading

sarkis commented Mar 2, 2018 •

edited

Loading