Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🔊 setup sns topic for alarm notifications; notify of stopped containers #406

Closed
wants to merge 21 commits into from

Conversation

twentylemon
Copy link
Member

@twentylemon twentylemon commented Nov 14, 2021

Summary

Note: maybe do #423 first

I guess fixes #384, though we can add more alarms. If you have better ideas for what to alarm on, let me know.

Spent a while looking into all the different alarming options:

  • ec2 instance alarms; would allow termination of a dead instance (say, CPU redline), but alarm is per-instance, and would have to be recreated when the alarm actually triggered or if the instance is otherwise replaced
  • autoscaling scaling alarms; would allow de-scaling and re-scaling, but requires chaining alarms together (say CPU redline -> downscale -> no instances -> upscale) which I didn't really like
  • ecs container alarms; what I ended up at

I figured:

  • ec2 already does health checks, and I don't really care if the autoscaling group replaces a host; there's nothing we can really do to react to such an alarm
  • we've configured health checks on the ecs containers, so they'll be restarted if health checks fail (or if the containers crash)
  • autoscaling alarms don't really matter if ecs is already going to restart tasks

So, being notified of unintentionally stopped ecs containers seemed like enough, at least for starters. See tutorial I mostly followed.

Note this doesn't actually add subscriptions, we can do that in follow up commits. It'd be specifying the AlarmSubscription* cloudformation parameters, which would likely be github secrets. We'll need a new webhook to publish to.

Tested manually by creating the topic, rule and subscribing myself. I then ssh'd into the ec2 host and did docker kill. Received a single email for the dead instance. See #duckbutt channel in the beta server.

Checklist
  • this is a source code change
    • run format in the repository root
    • run pytest in the repository root
    • link to issue being fixed using a fixes keyword, if such an issue exists
    • update the wiki documentation if necessary
  • or, this is not a source code change

@twentylemon twentylemon added the operational Related to running/debugging/OE label Nov 14, 2021
@codecov
Copy link

codecov bot commented Nov 14, 2021

Codecov Report

Merging #406 (a0ac86d) into main (d75d727) will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##             main     #406   +/-   ##
=======================================
  Coverage   97.09%   97.09%           
=======================================
  Files          81       81           
  Lines        1581     1581           
  Branches      213      213           
=======================================
  Hits         1535     1535           
  Misses         37       37           
  Partials        9        9           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d75d727...a0ac86d. Read the comment docs.

@twentylemon
Copy link
Member Author

Seems lambda's free tier is more than enough to cover alarms. Moving to draft, I'll fiddle with publishing to a discord webhook instead.

@twentylemon twentylemon marked this pull request as draft November 14, 2021 23:28
@twentylemon twentylemon marked this pull request as ready for review November 15, 2021 01:22
@twentylemon
Copy link
Member Author

Seems lambda's free tier is more than enough to cover alarms. Moving to draft, I'll fiddle with publishing to a discord webhook instead.

Now it's all lambda. Neat.

@twentylemon
Copy link
Member Author

Revisiting due to the merge, I'm not sure it's a great alarm. Stopped containers still get restarted by ECS, so it's still an alarm we can't really action on.

Will re-think in the future. Just closing.

@twentylemon twentylemon closed this Dec 1, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
operational Related to running/debugging/OE
Projects
None yet
Development

Successfully merging this pull request may close these issues.

setup cloudwatch alarms
2 participants