Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: grafana 5xx errors #420

Merged
merged 6 commits into from
Mar 18, 2024
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions terraform/ecs/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,8 @@ This module creates an ECS cluster and an autoscaling group of EC2 instances to
| <a name="output_ecs_task_family"></a> [ecs\_task\_family](#output\_ecs\_task\_family) | The family of the task definition |
| <a name="output_load_balancer_arn"></a> [load\_balancer\_arn](#output\_load\_balancer\_arn) | The ARN of the load balancer |
| <a name="output_load_balancer_arn_suffix"></a> [load\_balancer\_arn\_suffix](#output\_load\_balancer\_arn\_suffix) | The ARN suffix of the load balancer |
| <a name="output_log_group_app_arn"></a> [log\_group\_app\_arn](#output\_log\_group\_app\_arn) | The ARN of the log group for the app |
| <a name="output_log_group_app_name"></a> [log\_group\_app\_name](#output\_log\_group\_app\_name) | The name of the log group for the app |
| <a name="output_service_security_group_id"></a> [service\_security\_group\_id](#output\_service\_security\_group\_id) | The ID of the security group for the service |
| <a name="output_target_group_arn"></a> [target\_group\_arn](#output\_target\_group\_arn) | The ARN of the target group |

Expand Down
10 changes: 10 additions & 0 deletions terraform/ecs/outputs.tf
Original file line number Diff line number Diff line change
Expand Up @@ -32,3 +32,13 @@ output "load_balancer_arn_suffix" {
description = "The ARN suffix of the load balancer"
value = aws_lb.load_balancer.arn_suffix
}

output "log_group_app_name" {
description = "The name of the log group for the app"
value = aws_cloudwatch_log_group.cluster.name
}

output "log_group_app_arn" {
description = "The ARN of the log group for the app"
value = aws_cloudwatch_log_group.cluster.arn
}
3 changes: 3 additions & 0 deletions terraform/monitoring/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,11 +27,14 @@ Configure the Grafana dashboards for the application
## Inputs
| Name | Description | Type | Default | Required |
|------|-------------|------|---------|:--------:|
| <a name="input_aws_account_id"></a> [aws\_account\_id](#input\_aws\_account\_id) | The AWS account ID. | <pre lang="json">string</pre> | <pre lang="json">n/a</pre> | yes |
| <a name="input_context"></a> [context](#input\_context) | Single object for setting entire context at once.<br>See description of individual variables for details.<br>Leave string and numeric variables as `null` to use default value.<br>Individual variable settings (non-null) override settings in context object,<br>except for attributes and tags, which are merged. | <pre lang="json">any</pre> | <pre lang="json">n/a</pre> | yes |
| <a name="input_ecs_cluster_name"></a> [ecs\_cluster\_name](#input\_ecs\_cluster\_name) | The name of the ECS cluster. | <pre lang="json">string</pre> | <pre lang="json">n/a</pre> | yes |
| <a name="input_ecs_service_name"></a> [ecs\_service\_name](#input\_ecs\_service\_name) | The name of the ECS service. | <pre lang="json">string</pre> | <pre lang="json">n/a</pre> | yes |
| <a name="input_ecs_target_group_arn"></a> [ecs\_target\_group\_arn](#input\_ecs\_target\_group\_arn) | The ARN of the ECS LB target group. | <pre lang="json">string</pre> | <pre lang="json">n/a</pre> | yes |
| <a name="input_load_balancer_arn"></a> [load\_balancer\_arn](#input\_load\_balancer\_arn) | The ARN of the load balancer. | <pre lang="json">string</pre> | <pre lang="json">n/a</pre> | yes |
| <a name="input_log_group_app_arn"></a> [log\_group\_app\_arn](#input\_log\_group\_app\_arn) | The ARN of the log group for the app | <pre lang="json">string</pre> | <pre lang="json">n/a</pre> | yes |
| <a name="input_log_group_app_name"></a> [log\_group\_app\_name](#input\_log\_group\_app\_name) | The name of the log group for the app | <pre lang="json">string</pre> | <pre lang="json">n/a</pre> | yes |
| <a name="input_monitoring_role_arn"></a> [monitoring\_role\_arn](#input\_monitoring\_role\_arn) | The ARN of the monitoring role. | <pre lang="json">string</pre> | <pre lang="json">n/a</pre> | yes |
| <a name="input_notification_channels"></a> [notification\_channels](#input\_notification\_channels) | The notification channels to send alerts to | <pre lang="json">list(any)</pre> | <pre lang="json">n/a</pre> | yes |
| <a name="input_prometheus_endpoint"></a> [prometheus\_endpoint](#input\_prometheus\_endpoint) | The endpoint for the Prometheus server. | <pre lang="json">string</pre> | <pre lang="json">n/a</pre> | yes |
Expand Down
83 changes: 43 additions & 40 deletions terraform/monitoring/dashboard.jsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -17,16 +17,19 @@ local ds = {
},
};
local vars = {
namespace: 'Notify',
environment: std.extVar('environment'),
notifications: std.parseJson(std.extVar('notifications')),

ecs_service_name: std.extVar('ecs_service_name'),
ecs_cluster_name: std.extVar('ecs_cluster_name'),
rds_cluster_id: std.extVar('rds_cluster_id'),
redis_cluster_id: std.extVar('redis_cluster_id'),
load_balancer: std.extVar('load_balancer'),
target_group: std.extVar('target_group'),
namespace: 'Notify',
environment: std.extVar('environment'),
notifications: std.parseJson(std.extVar('notifications')),

ecs_service_name: std.extVar('ecs_service_name'),
ecs_cluster_name: std.extVar('ecs_cluster_name'),
rds_cluster_id: std.extVar('rds_cluster_id'),
redis_cluster_id: std.extVar('redis_cluster_id'),
load_balancer: std.extVar('load_balancer'),
target_group: std.extVar('target_group'),
log_group_app_name: std.extVar('log_group_app_name'),
log_group_app_arn: std.extVar('log_group_app_arn'),
aws_account_id: std.extVar('aws_account_id'),
};

////////////////////////////////////////////////////////////////////////////////
Expand Down Expand Up @@ -57,38 +60,39 @@ dashboard.new(
.addPanels(layout.generate_grid([
//////////////////////////////////////////////////////////////////////////////
row.new('Application'),
panels.app.http_request_rate(ds, vars) { gridPos: pos._4 },
panels.app.http_request_latency(ds, vars) { gridPos: pos._4 },

panels.app.subscribed_topics(ds, vars) { gridPos: pos._4 },
panels.app.subscribe_latency(ds, vars) { gridPos: pos._4 },

panels.app.relay_incoming_message_rate(ds, vars) {gridPos: pos._6 },
panels.app.relay_incoming_message_latency(ds, vars) {gridPos: pos._6 },
panels.app.relay_incoming_message_server_errors(ds, vars) {gridPos: pos._6 },

panels.app.relay_outgoing_message_rate(ds, vars) {gridPos: pos._6 },
panels.app.relay_outgoing_message_latency(ds, vars) {gridPos: pos._6 },
panels.app.relay_outgoing_message_failures(ds, vars) {gridPos: pos._6 },

panels.app.postgres_query_rate(ds, vars) {gridPos: pos._6 },
panels.app.postgres_query_latency(ds, vars) {gridPos: pos._6 },
panels.app.keys_server_request_rate(ds, vars) {gridPos: pos._6 },
panels.app.keys_server_request_latency(ds, vars) {gridPos: pos._6 },
panels.app.registry_request_rate(ds, vars) {gridPos: pos._6 },
panels.app.registry_request_latency(ds, vars) {gridPos: pos._6 },

panels.app.relay_subscribe_rate(ds, vars) {gridPos: pos._6 },
panels.app.relay_subscribe_latency(ds, vars) {gridPos: pos._6 },
panels.app.relay_subscribe_failures(ds, vars) {gridPos: pos._6 },
panels.app.http_request_rate(ds, vars) { gridPos: pos._3 },
panels.app.http_request_latency(ds, vars) { gridPos: pos._3 },
panels.lb.error_5xx(ds, vars) { gridPos: pos._3 },
panels.lb.error_5xx_logs(ds, vars) { gridPos: pos._3 },

panels.app.relay_incoming_message_rate(ds, vars) { gridPos: pos._6 },
panels.app.relay_incoming_message_latency(ds, vars) { gridPos: pos._6 },
panels.app.relay_incoming_message_server_errors(ds, vars) { gridPos: pos._6 },

panels.app.relay_outgoing_message_rate(ds, vars) { gridPos: pos._6 },
panels.app.relay_outgoing_message_latency(ds, vars) { gridPos: pos._6 },
panels.app.relay_outgoing_message_failures(ds, vars) { gridPos: pos._6 },

panels.app.postgres_query_rate(ds, vars) { gridPos: pos._6 },
panels.app.postgres_query_latency(ds, vars) { gridPos: pos._6 },
panels.app.keys_server_request_rate(ds, vars) { gridPos: pos._6 },
panels.app.keys_server_request_latency(ds, vars) { gridPos: pos._6 },
panels.app.registry_request_rate(ds, vars) { gridPos: pos._6 },
panels.app.registry_request_latency(ds, vars) { gridPos: pos._6 },

panels.app.relay_subscribe_rate(ds, vars) { gridPos: pos._6 },
panels.app.relay_subscribe_latency(ds, vars) { gridPos: pos._6 },
panels.app.relay_subscribe_failures(ds, vars) { gridPos: pos._6 },
panels.app.subscribed_topics(ds, vars) { gridPos: pos._4 },
panels.app.subscribe_latency(ds, vars) { gridPos: pos._4 },

row.new('Notification publisher background service'),
panels.app.publishing_workers_count(ds, vars) {gridPos: pos._5 },
panels.app.publishing_workers_errors(ds, vars) {gridPos: pos._5 },
panels.app.publishing_workers_queued_size(ds, vars) {gridPos: pos._5 },
panels.app.publishing_workers_count(ds, vars) { gridPos: pos._5 },
panels.app.publishing_workers_errors(ds, vars) { gridPos: pos._5 },
panels.app.publishing_workers_queued_size(ds, vars) { gridPos: pos._5 },

panels.app.publishing_workers_processing_size(ds, vars) {gridPos: pos._5 },
panels.app.publishing_workers_published_count(ds, vars) {gridPos: pos._5 },
panels.app.publishing_workers_processing_size(ds, vars) { gridPos: pos._5 },
panels.app.publishing_workers_published_count(ds, vars) { gridPos: pos._5 },

row.new('Deprecated metrics'),
panels.app.notify_latency(ds, vars) { gridPos: pos._4 },
Expand Down Expand Up @@ -120,5 +124,4 @@ dashboard.new(

panels.lb.healthy_hosts(ds, vars) { gridPos: pos._3 },
panels.lb.error_4xx(ds, vars) { gridPos: pos._3 },
panels.lb.error_5xx(ds, vars) { gridPos: pos._3 },
]))
2 changes: 1 addition & 1 deletion terraform/monitoring/grafonnet-lib
15 changes: 9 additions & 6 deletions terraform/monitoring/main.tf
Original file line number Diff line number Diff line change
Expand Up @@ -11,12 +11,15 @@ data "jsonnet_file" "dashboard" {
environment = module.this.stage
notifications = jsonencode(var.notification_channels)

ecs_cluster_name = var.ecs_cluster_name
ecs_service_name = var.ecs_service_name
rds_cluster_id = var.rds_cluster_id
redis_cluster_id = var.redis_cluster_id
load_balancer = var.load_balancer_arn
target_group = var.ecs_target_group_arn
ecs_cluster_name = var.ecs_cluster_name
ecs_service_name = var.ecs_service_name
rds_cluster_id = var.rds_cluster_id
redis_cluster_id = var.redis_cluster_id
load_balancer = var.load_balancer_arn
target_group = var.ecs_target_group_arn
log_group_app_name = var.log_group_app_name
log_group_app_arn = var.log_group_app_arn
aws_account_id = var.aws_account_id
}
}

Expand Down
2 changes: 1 addition & 1 deletion terraform/monitoring/panels/lb/error_5xx.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ local _alert(namespace, env, notifications) = grafana.alert.new(
{
new(ds, vars)::
panels.timeseries(
title = '5XX',
title = 'HTTP 5xx Rate',
datasource = ds.cloudwatch,
)
.configure(_configuration)
Expand Down
34 changes: 34 additions & 0 deletions terraform/monitoring/panels/lb/error_5xx_logs.libsonnet
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
local grafana = import '../../grafonnet-lib/grafana.libsonnet';
local defaults = import '../../grafonnet-lib/defaults.libsonnet';

local cloudwatch_target = import '../../grafonnet-lib/targets/cloudwatch.libsonnet';

local panels = grafana.panels;
local targets = grafana.targets;

{
new(ds, vars)::
panels.timeseries(
title = 'HTTP 5xx Errors',
datasource = ds.cloudwatch,
)
.configure({
fieldConfig: {},
options: {
showHeader: false,
},
})

.addTarget(targets.cloudwatch(
datasource = ds.cloudwatch,
namespace = "",
queryMode = cloudwatch_target.queryModes.Logs,
logGroups = [{
arn: vars.log_group_app_arn,
name: vars.log_group_app_name,
accountId: vars.aws_account_id,
}],
expression = 'fields @timestamp, @message, @logStream, @log\n| filter @message like /HTTP server error/\n| parse @message /^(?<LogTimestamp>[^\\s]+)/\n| display @message\n| sort LogTimestamp desc',
refId = '5xx_Errors',
))
}
1 change: 1 addition & 0 deletions terraform/monitoring/panels/panels.libsonnet
Original file line number Diff line number Diff line change
Expand Up @@ -60,6 +60,7 @@ local docdb_mem_threshold = units.size_bin(GiB = docdb_mem * 0.1);
active_connections: (import 'lb/active_connections.libsonnet' ).new,
error_4xx: (import 'lb/error_4xx.libsonnet' ).new,
error_5xx: (import 'lb/error_5xx.libsonnet' ).new,
error_5xx_logs: (import 'lb/error_5xx_logs.libsonnet' ).new,
healthy_hosts: (import 'lb/healthy_hosts.libsonnet' ).new,
requests: (import 'lb/requests.libsonnet' ).new,
},
Expand Down
15 changes: 15 additions & 0 deletions terraform/monitoring/variables.tf
Original file line number Diff line number Diff line change
Expand Up @@ -42,3 +42,18 @@ variable "load_balancer_arn" {
description = "The ARN of the load balancer."
type = string
}

variable "log_group_app_name" {
description = "The name of the log group for the app"
type = string
}

variable "log_group_app_arn" {
description = "The ARN of the log group for the app"
type = string
}

variable "aws_account_id" {
description = "The AWS account ID."
type = string
}
3 changes: 3 additions & 0 deletions terraform/res_monitoring.tf
Original file line number Diff line number Diff line change
Expand Up @@ -13,4 +13,7 @@ module "monitoring" {
ecs_target_group_arn = module.ecs.target_group_arn
redis_cluster_id = module.redis.cluster_id
load_balancer_arn = module.ecs.load_balancer_arn_suffix
log_group_app_name = module.ecs.log_group_app_name
log_group_app_arn = module.ecs.log_group_app_arn
aws_account_id = data.aws_caller_identity.this.account_id
}
Loading