Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement JMX E2E Tests in Terraform Framework #443

Open
wants to merge 70 commits into
base: main
Choose a base branch
from
Open

Conversation

musa-asad
Copy link
Contributor

@musa-asad musa-asad commented Dec 9, 2024

Description of the issue

The CloudWatch Agent now supports JMX metrics on EKS. We need to add testing to verify the behavior for JMX metrics collection in EKS environments.

Description of changes

Co-PR: aws/amazon-cloudwatch-agent#1463.

  • Implemented test/e2e/jmx/jmx_test.go with the following functions:
    • ApplyHelm: Applies a helm release based on the specified resources passed in as variables, which sets up the CloudWatch Agent and deploys an annotated sample application.
    • TestResources: Tests to see if the appropriate resources have been deployed on the EKS cluster.
    • TestMetrics: Tests to see if metrics are being emitted properly based on the agent configuration file passed in.
  • Added test/e2e/jmx/resources/cwagent_configs/ and test/e2e/jmx/resources/sample_apps/ for custom agent configurations and sample applications to deploy.
    • Added a check to see if tomcat.sessions and tomcat.rejected_sessions increases above 0 accordingly.
  • Added GetMetricMaximum to util/awsservice/cloudwatchmetrics.go in order to find the maximum value for a metric being emitted.
  • Added nodes to generator/test_case_generator.go and generator/resources/eks_e2e_test_matrix.json to be able to use this as a configurable value from the generated matrix.

License

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Tests

JVM & Tomcat

Screenshot 2024-12-16 at 11 28 22 PM

Kafka

Screenshot 2024-12-16 at 11 28 47 PM

Container Insights

Screenshot 2024-12-16 at 11 29 07 PM

@musa-asad musa-asad self-assigned this Dec 9, 2024
@musa-asad musa-asad changed the base branch from e2e to main December 16, 2024 21:52
@musa-asad musa-asad changed the title JMX E2E Tests Implement JMX E2E Tests in Terraform Framework Dec 17, 2024
@musa-asad musa-asad marked this pull request as ready for review December 17, 2024 04:55
@musa-asad musa-asad requested a review from a team as a code owner December 17, 2024 04:55
test/e2e/jmx/jmx_test.go Outdated Show resolved Hide resolved
test/e2e/jmx/jmx_test.go Show resolved Hide resolved
@musa-asad musa-asad requested a review from varunch77 December 17, 2024 22:25
Comment on lines 135 to 137
if err != nil {
t.Fatalf("Error building kubeconfig: %v", err)
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should use testify's require / assert methods instead of t.Fatal / t.Error

Suggested change
if err != nil {
t.Fatalf("Error building kubeconfig: %v", err)
}
require.NoError(t, err, "Error building kubeconfig")

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'll make this change

fmt.Println("Waiting for metrics to propagate...")
time.Sleep(5 * time.Minute)

os.Exit(m.Run())

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do the helm resources ever get cleaned up?

Copy link
Contributor Author

@musa-asad musa-asad Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the exception of load balancers (which I do remove), they all should be removed upon the deletion of the cluster, but I'll look into handling this more elegantly.

helm := []string{
"helm", "upgrade", "--install", "amazon-cloudwatch-observability",
filepath.Join("..", "..", "..", "terraform", "eks", "e2e", "helm-charts", "charts", "amazon-cloudwatch-observability"),
"--values", filepath.Join("..", "..", "..", "terraform", "eks", "e2e", "helm-charts", "charts", "amazon-cloudwatch-observability", "values.yaml"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you need this line? I think helm will just use the values.yaml file from the chart path given above

Copy link
Contributor Author

@musa-asad musa-asad Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I can remove it. Thanks.

EDIT: When I removed it, the agent configuration didn't apply, so I added it back.

}

fmt.Println("Waiting for CloudWatch Agent Operator to initialize...")
time.Sleep(300 * time.Second)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does opertor need 5m to start?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll lower the time.

return fmt.Errorf("failed to apply sample app: %w\nOutput: %s", err, output)
}

wait := exec.Command("kubectl", "wait", "--for=condition=available", "--timeout=300s", fmt.Sprintf("deployment/%s", deploymentName), "-n", "default")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably nit but it looks like sample apps will always be type of deployment then do we need to add -deployment suffix to sample apps yaml files? They are all deployment types anyways.

Copy link
Contributor Author

@musa-asad musa-asad Dec 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I can remove that.


func testContainerInsightsMetrics(t *testing.T) {
t.Run("verify_containerinsights_metrics", func(t *testing.T) {
metricsToCheck := []struct {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this an existing convention? i'm just curious if there will be a case where metrics are under different namespaces

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. If the namespaces within each of these test functions isn't going to change, can we modify the code to just set it once in the function (maybe modifying the struct to get rid of namespace?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, this is an existing convention, these metrics will report to this namespace.


func testContainerInsightsMetrics(t *testing.T) {
t.Run("verify_containerinsights_metrics", func(t *testing.T) {
metricsToCheck := []struct {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1. If the namespaces within each of these test functions isn't going to change, can we modify the code to just set it once in the function (maybe modifying the struct to get rid of namespace?)

spec:
containers:
- name: zookeeper
image: wurstmeister/zookeeper:latest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using latest, can we pin a version so that it never changes?

Or maybe better yet -- we push an image to our own ECR repo?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, I'll add it to our own ECR repo.

spec:
containers:
- name: kafka
image: wurstmeister/kafka:latest
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar comment to above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add.

@@ -134,6 +134,78 @@ func GetMetricStatistics(
return CwmClient.GetMetricStatistics(ctx, &metricStatsInput)
}

func GetMetricMaximum(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Curious to hear the reasoning behind this func...is it because we don't know the exact value of the metric? But we expect the value to never be above a certain amount?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reasoning behind this function was to get the highest metric value being emitted for tomcat.sessions / tomcat.rejected_sessions in order to check if that value is greater than zero since we need to verify we're able to get these metrics above zero. The reason I went for the maximum was just to be sure we check for a value that is absolutely expected to be above zero.

However, thinking through it, a more efficient solution would be to dedicate the function to check if it's above zero directly and exit out when we find that. I fixed the function to account for that.

As for the exact values for each metric, tomcat.sessions will max out to 2 since that's what I set it to in the application, but I didn't think it was necessary to directly check for this since the agent doesn't generate the metric, the application does. But it is important to make sure the agent is able to ship a non-zero value to CloudWatch. Another reason is a possible race condition where there are latency issues and it doesn't append to the right value in time compared to when we check the metric value, so it just seems more ideal to check if the value is above 0 instead of a specific value. tomcat.rejected_sessions is unpredictable since there isn't a max value for that, so it'll just keep increasing after the max sessions are hit.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants