Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Statuscode Handler to Agent Health Extension #1423

Merged
merged 98 commits into from
Dec 12, 2024
Merged

Conversation

Paramadon
Copy link
Contributor

@Paramadon Paramadon commented Nov 12, 2024


Description of the Issue

The CloudWatch Agent lacked an agenthealth configuration to monitor API health by tracking HTTP status codes for specific responses. Monitoring the status codes (200, 400, 408, 419, and 429) across all APIs is critical for diagnosing issues and ensuring comprehensive observability of API behaviors. Without this configuration, users were unable to quickly identify trends in API health metrics or correlate specific status codes with performance issues.


Changes Made

  • Added agenthealth Configuration:

    • Configured the CloudWatch Agent to track the following status codes across all APIs:
      • 200: Success
      • 400: Bad Request
      • 408: Request Timeout
      • 419: Authentication Timeout
      • 429: Too Many Requests
    • This ensures that health metrics for these common response codes are captured and reported.
  • Updated CloudWatch Agent Configuration JSON:

    • Introduced a new section under the metrics collection configuration to specify agenthealth metrics.
    • Ensured compatibility with existing metrics and logs collection by integrating the new configuration seamlessly.
  • Validated the Configuration:

    • Tested the updated configuration with the CloudWatch Agent to ensure the new metrics are collected and reported correctly.
    • Confirmed that the metrics appear as expected in CloudWatch for all specified APIs.

Impact

  • Provides real-time visibility into API status codes, helping identify anomalies or patterns that indicate issues.
  • Enhances observability for critical services by adding detailed health metrics.

Testing

  • Deployed the updated CloudWatch Agent health configuraion
  • Verified that metrics for status codes 200, 400, 408, 419, and 429 are captured in the header as you can see from the image below:
Screenshot 2024-11-15 at 4 36 21 PM

@Paramadon Paramadon requested a review from a team as a code owner November 12, 2024 20:47
@Paramadon Paramadon changed the title fixing issue Adding Agent Health Extension Nov 15, 2024
dricross
dricross previously approved these changes Dec 10, 2024
dricross
dricross previously approved these changes Dec 12, 2024
}

statsRequestHandlers, statsResponseHandlers = stats.NewHandlers(ah.logger, statsConfig, statusCodeEnabled, agentStatsEnabled)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Could just pass in the stats pointer directly to determine if it's enabled.

Suggested change
statsRequestHandlers, statsResponseHandlers = stats.NewHandlers(ah.logger, statsConfig, statusCodeEnabled, agentStatsEnabled)
statsRequestHandlers, statsResponseHandlers = stats.NewHandlers(ah.logger, ah.cfg.Stats, statusCodeEnabled)

Comment on lines +114 to +115
existingStats.Merge(newStats)
newStats = existingStats
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: newStats.Merge(existingStats) also works.

Comment on lines +36 to +48
var statsResponseHandlers []awsmiddleware.ResponseHandler
var statsRequestHandlers []awsmiddleware.RequestHandler
var statsConfig agent.StatsConfig
var agentStatsEnabled bool

if ah.cfg.Stats != nil {
statsConfig = *ah.cfg.Stats
agentStatsEnabled = true
} else {
agentStatsEnabled = false
}

statsRequestHandlers, statsResponseHandlers = stats.NewHandlers(ah.logger, statsConfig, statusCodeEnabled, agentStatsEnabled)
Copy link
Contributor

@musa-asad musa-asad Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: don't need to define variables

Suggested change
var statsResponseHandlers []awsmiddleware.ResponseHandler
var statsRequestHandlers []awsmiddleware.RequestHandler
var statsConfig agent.StatsConfig
var agentStatsEnabled bool
if ah.cfg.Stats != nil {
statsConfig = *ah.cfg.Stats
agentStatsEnabled = true
} else {
agentStatsEnabled = false
}
statsRequestHandlers, statsResponseHandlers = stats.NewHandlers(ah.logger, statsConfig, statusCodeEnabled, agentStatsEnabled)
var statsConfig agent.StatsConfig
var agentStatsEnabled bool
if ah.cfg.Stats != nil {
statsConfig = *ah.cfg.Stats
agentStatsEnabled = true
} else {
agentStatsEnabled = false
}
statsRequestHandlers, statsResponseHandlers := stats.NewHandlers(ah.logger, statsConfig, statusCodeEnabled, agentStatsEnabled)

(and you can remove the conditional from @jefchien's nit)

Stats: agent.StatsConfig{
Operations: []string{agent.AllowAllOperations},
},
Stats: nil,
Copy link
Contributor

@musa-asad musa-asad Dec 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Should IsStatusCodeEnabled be added for consistency?

Suggested change
Stats: nil,
IsStatusCodeEnabled: false,
Stats: nil,

Or remove Stats since it should set to nil regardless ?

Suggested change
Stats: nil,

Comment on lines +43 to +51
func NewTranslatorWithStatusCode(name component.DataType, operations []string, isStatusCodeEnabled bool) common.Translator[component.Config] {
return &translator{
name: name.String(),
operations: operations,
factory: agenthealth.NewFactory(),
isUsageDataEnabled: envconfig.IsUsageDataEnabled(),
isStatusCodeEnabled: isStatusCodeEnabled,
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we just combine this with the NewTranslator function? Is there any case where this function is called when isStatusCodeEnabled is false?

Or you can do this instead.

Suggested change
func NewTranslatorWithStatusCode(name component.DataType, operations []string, isStatusCodeEnabled bool) common.Translator[component.Config] {
return &translator{
name: name.String(),
operations: operations,
factory: agenthealth.NewFactory(),
isUsageDataEnabled: envconfig.IsUsageDataEnabled(),
isStatusCodeEnabled: isStatusCodeEnabled,
}
}
func NewTranslatorWithStatusCode(name component.DataType, operations []string) common.Translator[component.Config] {
return &translator{
name: name.String(),
operations: operations,
factory: agenthealth.NewFactory(),
isUsageDataEnabled: envconfig.IsUsageDataEnabled(),
isStatusCodeEnabled: true,
}
}

@Paramadon Paramadon merged commit f3d2e33 into main Dec 12, 2024
7 checks passed
@Paramadon Paramadon deleted the CodeHandler branch December 12, 2024 14:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants