Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Status Code Metrics for API Calls in the Agent #1442

Merged
merged 8 commits into from
Dec 16, 2024
Merged

Conversation

Paramadon
Copy link
Contributor

@Paramadon Paramadon commented Nov 26, 2024

Description

Issue Overview

The agent currently lacks visibility into the status codes returned by its API calls, making it challenging to monitor API interactions and debug issues effectively.

Summary of Changes

This PR introduces functionality to collect and report status code metrics for the agent’s API clients, enhancing monitoring capabilities. The metrics cover the following APIs:

  • PutRetentionPolicy
  • DescribeInstances
  • DescribeTags
  • DescribeVolumes
  • DescribeContainerInstances
  • DescribeServices
  • DescribeTaskDefinition
  • ListServices
  • ListTasks
  • DescribeTasks
  • CreateLogGroup
  • CreateLogStream

Implementation Details

  1. Handler Integration

    • For OpenTelemetry (OTel) API calls, the agent’s health is determined using the awsmiddleware component ID. The appropriate status code handler is attached to each client to enable metric collection.
    • For Telegraf metrics, handlers are integrated into plugins like Prometheus during client initialization, ensuring metrics are captured at runtime.
  2. Metric Creation

    • The agent generates a JSON payload containing categorized status codes and sends it via PMD/PLE to the CloudwatchStatsHandler.
    • The CloudwatchStatsHandler processes the JSON and creates status code metric, enabling enhanced visibility.

Example Outputs

Example JSON Payload

Below is a sample JSON structure that categorizes status codes for various API calls:

{
  "codes": {
    "pmd": [1, 2, 3, 4, 5],
    "prp": [6, 7, 8, 9, 10],
    "di": [11, 12, 13, 14, 15],
    "dt": [16, 17, 18, 19, 20],
    "dv": [21, 22, 23, 24, 25],
    "dci": [26, 27, 28, 29, 30],
    "ds": [31, 32, 33, 34, 35],
    "dtd": [36, 37, 38, 39, 40],
    "dts": [41, 42, 43, 44, 45],
    "ls": [46, 47, 48, 49, 50],
    "lt": [51, 52, 53, 54, 55],
    "clg": [56, 57, 58, 59, 60],
    "cls": [61, 62, 63, 64, 65]
  }
}

Example Visualization

The screenshot below demonstrates the captured status codes in the monitoring interface:
Status Code Metrics


Testing Details

  • Configuration Used
    The agent was configured with the following agent.json file to validate metrics collection: (Also this is removed not but added some api calls so we can make sure we can get metrics for all the calls).

    {
      "agent": {
        "debug": true,
        "aws_sdk_log_level": "LogDebugWithHTTPBody",
        "region": "us-west-2"
      },
      "metrics": {
        "append_dimensions": {
          "AutoScalingGroupName": "${aws:AutoScalingGroupName}",
          "ImageId": "${aws:ImageId}",
          "InstanceId": "${aws:InstanceId}",
          "InstanceType": "${aws:InstanceType}"
        },
        "metrics_collected": {
          "cpu": {
            "measurement": ["cpu_usage_idle", "cpu_usage_iowait", "cpu_usage_user", "cpu_usage_system"],
            "totalcpu": false,
            "metrics_collection_interval": 10
          },
          "disk": {
            "resources": ["*"],
            "measurement": ["used_percent", "inodes_free"],
            "metrics_collection_interval": 60
          },
          "diskio": {
            "resources": ["*"],
            "measurement": ["io_time", "write_bytes", "read_bytes", "writes", "reads"],
            "metrics_collection_interval": 60
          },
          "mem": {
            "measurement": ["mem_used_percent"],
            "metrics_collection_interval": 10
          },
          "netstat": {
            "measurement": ["tcp_established", "tcp_time_wait"],
            "metrics_collection_interval": 60
          },
          "swap": {
            "measurement": ["swap_used_percent"],
            "metrics_collection_interval": 10
          },
          "ethtool": {
            "interface_include": ["eth0", "eth1"],
            "metrics_include": ["bw_in_allowance_exceeded", "bw_out_allowance_exceeded", "pps_allowance_exceeded", "conntrack_allowance_exceeded", "linklocal_allowance_exceeded"]
          }
        }
      },
      "logs": {
        "metrics_collected": {
          "prometheus": {
            "prometheus_config_path": "/etc/cwagent/prometheus-config.yml",
            "log_group_name": "Prometheus",
            "emf_processor": {
              "metric_namespace": "CWAgent-Prometheus",
              "metric_unit": {
                "jvm_threads_current": "Count",
                "jvm_gc_collection_seconds_sum": "Milliseconds"
              }
            },
            "ecs_service_discovery": {
              "sd_frequency": "30s",
              "sd_target_cluster": "testCluster",
              "sd_cluster_region": "us-west-2",
              "sd_result_file": "/tmp/ecs-discovery.yml",
              "docker_label": {
                "sd_port_label": "ECS_PROMETHEUS_EXPORTER_PORT",
                "sd_metrics_path_label": "ECS_PROMETHEUS_METRICS_PATH",
                "sd_job_name_label": "job"
              }
            }
          }
        },
        "logs_collected": {
          "files": {
            "collect_list": [
              {
                "file_path": "/var/log/syslog",
                "log_group_name": "/aws/ec2/syslog",
                "log_stream_name": "{instance_id}",
                "retention_in_days": 30
              }
            ]
          }
        }
      }
    }
  • Tests Performed

    • Verified metrics collection for each listed API.
    • Modified agent logic to ensure all status codes were reported.

Tested on EKS by changing agent contrib to this branch.

Ex: replace github.com/open-telemetry/opentelemetry-collector-contrib/processor/resourcedetectionprocessor => github.com/amazon-contributing/opentelemetry-collector-contrib/processor/resourcedetectionprocessor v0.0.0-20241212025412-4bd7a1e9deed

Screenshot 2024-12-16 at 7 17 54 PM ### Requirements

Before committing, complete the following:

  1. Run make fmt and make fmt-sh.
  2. Run make lint.

License

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution under the terms of your choice.

@Paramadon Paramadon requested a review from a team as a code owner November 26, 2024 06:14
@Paramadon Paramadon changed the base branch from main to CodeHandler November 26, 2024 06:16
@Paramadon Paramadon changed the title Add Status Code Metrics for API Calls in Agent Add Status Code Metrics for API Calls in the Agent Nov 26, 2024
reload_interval: 0s
server_name_override: ""
write_buffer_size: 524288
awscloudwatch:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The indentation is because of our default indentation which was altered previously and this is correcting it. You can ignore the whitespace

@@ -280,14 +281,12 @@ func (t *Tagger) ebsVolumesRetrieved() bool {

// Start acts as input validation and serves the purpose of updating ec2 tags and ebs volumes if necessary.
// It will be called when OTel is enabling each processor
func (t *Tagger) Start(ctx context.Context, _ component.Host) error {
func (t *Tagger) Start(ctx context.Context, host component.Host) error {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

adding host so we can grab the handler and attach them to the client.

Copy link
Contributor

@mitali-salvi mitali-salvi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this PR being merged in CodeHandler branch and not main ?

extension/agenthealth/handler/stats/provider/statuscode.go Outdated Show resolved Hide resolved
translator/translate/otel/pipeline/host/translator.go Outdated Show resolved Hide resolved
extension/agenthealth/config.go Outdated Show resolved Hide resolved
sdk/service/cloudwatchlogs/api.go Outdated Show resolved Hide resolved
plugins/outputs/cloudwatchlogs/pusher.go Outdated Show resolved Hide resolved
plugins/outputs/cloudwatchlogs/cloudwatchlogs.go Outdated Show resolved Hide resolved
plugins/inputs/prometheus/prometheus.go Outdated Show resolved Hide resolved
plugins/inputs/prometheus/prometheus.go Outdated Show resolved Hide resolved
plugins/processors/ec2tagger/ec2tagger.go Outdated Show resolved Hide resolved
plugins/processors/ec2tagger/ec2tagger.go Show resolved Hide resolved
plugins/processors/ec2tagger/ec2tagger.go Outdated Show resolved Hide resolved
@aws aws deleted a comment from Paramadon Nov 27, 2024
Copy link
Contributor

@okankoAMZ okankoAMZ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have 2 questions:

  1. Have you tested this on EKS
  2. Have you ran go race test on this

plugins/inputs/prometheus/prometheus.go Outdated Show resolved Hide resolved
internal/ecsservicediscovery/servicediscovery.go Outdated Show resolved Hide resolved
internal/ecsservicediscovery/servicediscovery.go Outdated Show resolved Hide resolved
@Paramadon Paramadon force-pushed the ApiStatusCodes branch 4 times, most recently from 5bdc971 to 23c2761 Compare December 16, 2024 13:51
@Paramadon Paramadon requested a review from dricross December 16, 2024 14:33
dricross
dricross previously approved these changes Dec 16, 2024
okankoAMZ
okankoAMZ previously approved these changes Dec 16, 2024
Copy link
Contributor

@okankoAMZ okankoAMZ left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Testing looks good

@Paramadon Paramadon dismissed stale reviews from okankoAMZ and dricross via 5f53ae5 December 16, 2024 19:50
@Paramadon Paramadon merged commit b462f7f into main Dec 16, 2024
7 checks passed
@Paramadon Paramadon deleted the ApiStatusCodes branch December 16, 2024 22:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants