A comprehensive Terraform solution for automated Datadog monitoring across AWS services
- π³ ECS (Elastic Container Service): Container-level metrics and health monitoring
- ποΈ RDS/Aurora: Database performance and health metrics
- βοΈ Application Load Balancer: Request metrics and latency monitoring
- π¨ SQS/SNS: Message queue monitoring and dead letter queue alerts
- π APM Integration: Full application performance monitoring
- π» Language-Specific Monitoring:
- β Java: JVM metrics, garbage collection, memory usage
- π¦ Node.js: Event loop, heap memory, CPU utilization
- π Log Management: Custom log patterns, error rate tracking
- π¬ Slack Integration: Automated alerts and notifications
git clone https://github.com/Beast12/terraform-datadog-monitoring.git
cd terraform-datadog-monitoring
# Setup Python environment
python3 -m venv venv
source venv/bin/activate
pip install -r requirements.txt
- π Python 3.x
- ποΈ Terraform >= 1.0
- π Valid Datadog account with API and APP keys
- βοΈ AWS credentials with appropriate permissions
- π¬ Slack workspace (for notifications)
Create a configuration file in monitor_configs/applications/
:
π View Example Configuration
name: "example-app-1"
description: "Example API Service"
type: "java"
monitor_sets:
infrastructure:
ecs:
enabled: true
settings:
services:
example-app-1:
thresholds:
cpu_percent: 85
memory_percent: 90
memory_available: 1024
network_errors: 20
alert_settings:
include_tags: true
priority: "3"
example-app-2:
thresholds:
cpu_percent: 85
memory_percent: 90
memory_available: 1024
network_errors: 20
alert_settings:
include_tags: true
priority: "3"
alb:
enabled: true
settings:
services:
example-app-1:
alb_name: "example-app-1-alb"
thresholds:
request_count: 100
latency: 200
error_rate: 20
alert_settings:
include_tags: true
priority: "3"
db:
enabled: true
settings:
databases:
example-app-1:
type: "rds"
identifier: "example-app-1-placeholder"
service_name: "example-app-1"
thresholds:
cpu_percent: 80
memory_threshold: 2048
connection_threshold: 100
alert_settings:
include_tags: true
priority: "3"
messaging:
sqs:
enabled: true
settings:
queues:
example-app-1-application-events:
queue_name: "example-app-1-application-events"
dlq_name: "example-app-1-application-events-dlq"
service_name: "example-app-1"
thresholds:
age_threshold: 300
depth_threshold: 1000
dlq_threshold: 1
alert_settings:
include_tags: true
priority: "3"
sns:
enabled: false
settings:
topics:
example-app-1:
topic_name: "example-app-1-topic"
service_name: "example-app-1"
thresholds:
message_count_threshold: 100
age_threshold: 300
alert_settings:
include_tags: true
priority: "3"
application:
apm:
enabled: true
services:
example-app-1:
thresholds:
latency: 200 # in ms
error_rate: 0.05 # 5% error rate
throughput: 100 # requests per minute
alert_settings:
priority: "3"
include_tags: true
example-app-2:
enabled: false
thresholds:
latency: 250
error_rate: 0.07
throughput: 120
alert_settings:
priority: "3"
include_tags: true
java:
enabled: true
services:
example-app-1:
thresholds:
jvm_memory_used: 1700
minor_gc_time: 200 # Set your desired threshold for minor GC
major_gc_time: 150
alert_settings:
priority: "3"
example-app-2:
thresholds:
jvm_memory_used: 1700
minor_gc_time: 200 # Set your desired threshold for minor GC
major_gc_time: 150
alert_settings:
priority: "3"
logs:
enabled: true
services:
example-app-1:
custom_log_lines:
- "Error getting balance for wallet"
thresholds:
critical: 20
critical_recovery: 15
warning: 10
warning_recovery: 5
example-app-2:
custom_log_lines:
- "io.venly.tokenapi.common.exception.WalletBusinessException: An unexpected error occurred. Please contact support!"
thresholds:
critical: 20
critical_recovery: 15
warning: 10
warning_recovery: 5
Create environment-specific settings in monitor_configs/environments/
:
π View Example Environment Config
environment: "qa"
cluster_name: "example-app-1-qa-cluster"
notification_channels:
infrastructure:
ecs: "slack-ecs-alerts-p2"
alb: "slack-elb-alerts-p2"
rds: "slack-rds-alerts-p2"
messaging:
sns: "slack-sns-alerts-p2"
sqs: "slack-sqs-alerts-p2"
application:
java: "slack-apm-alerts-p2"
node: "slack-apm-alerts-p2"
apm: "slack-apm-alerts-p2"
logs: "slack-logs-alerts-p2"
default: "slack-ecs-alerts-p2"
threshold_overrides:
infrastructure:
ecs:
example-app-1:
cpu_percent: 90
memory_percent: 90
memory_available: 2048
network_errors: 10
alert_settings:
priority: "4"
example-app-2:
cpu_percent: 90
memory_percent: 90
memory_available: 2048
network_errors: 15
alert_settings:
priority: "3"
alb:
example-app-1:
enabled: false
db:
enabled: false
messaging:
sqs:
enabled: false
sns:
enabled: false
application:
apm:
enabled: true
services:
example-app-1:
thresholds:
latency: 150
error_rate: 10
throughput: 90
alert_settings:
priority: "4"
example-app-2:
thresholds:
latency: 220
error_rate: 10
throughput: 110
alert_settings:
priority: "4"
java:
enabled: true
services:
example-app-1:
thresholds:
jvm_memory_used: 2048
alert_settings:
priority: "4"
example-app-2:
thresholds:
jvm_memory_used: 2048
alert_settings:
priority: "4"
logs:
services:
example-app-1:
thresholds:
critical: 50
critical_recovery: 40
warning: 35
warning_recovery: 30
example-app-2:
thresholds:
critical: 50
critical_recovery: 40
warning: 35
warning_recovery: 30
DATADOG_API_KEY: Your Datadog API key
DATADOG_APP_KEY: Your Datadog application key
RESOURCES_DEPLOY_ROLE: AWS IAM role ARN for deployment
- Go to the "Actions" tab in your GitHub repository
- Select "Deploy Datadog Monitoring"
- Choose your options:
- π― Action: apply/destroy
- π¦ Application: specific app or all
- π Environment: specific environment or all
- Click "Run workflow"
Priority | Severity | Use Case | Response Time |
---|---|---|---|
P1 | π΄ Critical | Production-breaking issues | Immediate |
P2 | π High | Significant service degradation | < 30 mins |
P3 | π‘ Medium | Performance issues | < 2 hours |
P4 | π’ Low | Non-critical warnings | Next business day |
- Start with conservative thresholds
- Adjust based on application behavior
- Use different thresholds for different environments
- Group related alerts
- Include relevant tags
- Set appropriate notification channels
- Maintain separate configurations per environment
- Use stricter thresholds in production
- Adjust notification priorities accordingly
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature
) - Commit your changes (
git commit -m 'Add amazing feature'
) - Push to the branch (
git push origin feature/amazing-feature
) - Open a Pull Request
This project is licensed under the MIT License - see the LICENSE file for details.
- Create an issue for bug reports or feature requests
- Check existing issues for solutions
- Contact maintainers for critical issues
Made with β€οΈ for DevOps