Skip to content

Commit

Permalink
alerts to catch 4xx type errors
Browse files Browse the repository at this point in the history
Signed-off-by: Kenny Leung <[email protected]>
  • Loading branch information
k4leung4 committed Dec 19, 2024
1 parent 508651e commit 0419af1
Show file tree
Hide file tree
Showing 3 changed files with 117 additions and 0 deletions.
4 changes: 4 additions & 0 deletions modules/alerting/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,8 @@ No modules.
| [google_monitoring_alert_policy.cloud-run-scaling-failure](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/monitoring_alert_policy) | resource |
| [google_monitoring_alert_policy.cloudrun_timeout](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/monitoring_alert_policy) | resource |
| [google_monitoring_alert_policy.fatal](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/monitoring_alert_policy) | resource |
| [google_monitoring_alert_policy.grpc_error_rate](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/monitoring_alert_policy) | resource |
| [google_monitoring_alert_policy.http_error_rate](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/monitoring_alert_policy) | resource |
| [google_monitoring_alert_policy.oom](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/monitoring_alert_policy) | resource |
| [google_monitoring_alert_policy.panic](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/monitoring_alert_policy) | resource |
| [google_monitoring_alert_policy.panic-stacktrace](https://registry.terraform.io/providers/hashicorp/google/latest/docs/resources/monitoring_alert_policy) | resource |
Expand All @@ -48,6 +50,8 @@ No modules.
| <a name="input_failure_rate_exclude_services"></a> [failure\_rate\_exclude\_services](#input\_failure\_rate\_exclude\_services) | List of service names to exclude from the 5xx failure rate alert | `list(string)` | `[]` | no |
| <a name="input_failure_rate_ratio_threshold"></a> [failure\_rate\_ratio\_threshold](#input\_failure\_rate\_ratio\_threshold) | ratio threshold to alert for cloud run server failure rate. | `number` | `0.2` | no |
| <a name="input_global_only_alerts"></a> [global\_only\_alerts](#input\_global\_only\_alerts) | only enable global alerts. when true, only create alerts that are global. | `bool` | `false` | no |
| <a name="input_grpc_error_threshold"></a> [grpc\_error\_threshold](#input\_grpc\_error\_threshold) | threshold for grpc error. | `number` | `0.25` | no |
| <a name="input_http_error_threshold"></a> [http\_error\_threshold](#input\_http\_error\_threshold) | threshold for http error. | `number` | `0.25` | no |
| <a name="input_notification_channels"></a> [notification\_channels](#input\_notification\_channels) | List of notification channels to alert. | `list(string)` | `[]` | no |
| <a name="input_notification_channels_email"></a> [notification\_channels\_email](#input\_notification\_channels\_email) | Email notification channel. | `list(string)` | `[]` | no |
| <a name="input_notification_channels_pagerduty"></a> [notification\_channels\_pagerduty](#input\_notification\_channels\_pagerduty) | Email notification channel. | `list(string)` | `[]` | no |
Expand Down
101 changes: 101 additions & 0 deletions modules/alerting/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ locals {
locals {
squad_log_filter = var.squad == "" ? "" : "labels.squad=\"${var.squad}\""
name = var.squad == "" ? "global" : var.squad
metric_filter = var.squad == "" ? "" : "metric.labels.team=\"${var.squad}\""
}

locals {
Expand Down Expand Up @@ -916,3 +917,103 @@ resource "google_monitoring_alert_policy" "pinned" {
enabled = "true"
project = var.project_id
}

resource "google_monitoring_alert_policy" "http_error_rate" {
count = var.global_only_alerts ? 0 : 1

alert_strategy {
auto_close = "3600s" // 1 hour
}

combiner = "OR"

conditions {
condition_threshold {
aggregations {
alignment_period = "60s"
cross_series_reducer = "REDUCE_MEAN"
per_series_aligner = "ALIGN_RATE"
group_by_fields = [
"metric.label.team",
"metric.label.service_name",
]
}

comparison = "COMPARISON_GT"
duration = "300s"
# ignore registry service - valid 4xx use cases
# ignore prober - handled by prober alerts
# ignore 2xx and 3xx, only care 4xx and 5xx
filter = <<EOT
resource.type = "prometheus_target"
metric.type = "prometheus.googleapis.com/http_request_status_total/counter"
metric.labels.service_name != monitoring.regex.full_match(".*-registry"
metric.labels.service_name != monitoring.regex.full_match("prb-.*"
metric.labels.code != monitoring.regex.full_match("[23].."))"
${var.metric_filter}
EOT

trigger {
count = "1"
}

threshold_value = var.http_error_threshold
}

display_name = "http error rate ${local.name}"
}
display_name = "http error rate ${local.name}"

notification_channels = length(var.notification_channels) != 0 ? var.notification_channels : local.slack

enabled = "true"
project = var.project_id
}

resource "google_monitoring_alert_policy" "grpc_error_rate" {
count = var.global_only_alerts ? 0 : 1

alert_strategy {
auto_close = "3600s" // 1 hour
}

combiner = "OR"

conditions {
condition_threshold {
aggregations {
alignment_period = "60s"
cross_series_reducer = "REDUCE_MEAN"
per_series_aligner = "ALIGN_RATE"
group_by_fields = [
"metric.label.team",
"metric.label.service_name",
]
}

comparison = "COMPARISON_GT"
duration = "300s"
# ignore OK and AlreadyExists code
filter = <<EOT
resource.type = "prometheus_target"
metric.type = "prometheus.googleapis.com/grpc_server_handled_total/counter"
metric.labels.grpc_code != monitoring.regex.full_match("OK|AlreadyExists"
${var.metric_filter}
EOT

trigger {
count = "1"
}

threshold_value = var.grpc_error_threshold
}

display_name = "grpc error rate ${local.name}"
}
display_name = "grpc error rate ${local.name}"

notification_channels = length(var.notification_channels) != 0 ? var.notification_channels : local.slack

enabled = "true"
project = var.project_id
}
12 changes: 12 additions & 0 deletions modules/alerting/variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -114,3 +114,15 @@ variable "global_only_alerts" {
type = bool
default = false
}

variable "http_error_threshold" {
description = "threshold for http error."
type = number
default = 0.25
}

variable "grpc_error_threshold" {
description = "threshold for grpc error."
type = number
default = 0.25
}

0 comments on commit 0419af1

Please sign in to comment.