add integration tests for AWS Neuron #416

aditya-purang · 2024-08-09T16:53:23Z

Description of the issue

Adding integration tests for AWS Neuron Metrics

Description of changes

Added infrastructure code for AWS Neuron metrics cluster
Added tests for verification of log schema, cluster dimension metrics and EMF log frequency per minute for AWS neuron metrics

License

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Tests

Tested with a ci-neuron-integ-test branch in agent repo

test link : https://github.com/aws/amazon-cloudwatch-agent/actions/runs/10328302776/job/28594690160

This reverts commit df52326.

movence · 2024-08-13T18:44:02Z

terraform/eks/daemon/awsneuron/main.tf

+        }
+        container {
+          name  = "neuron-monitor-prometheus"
+          image = "506463145083.dkr.ecr.us-west-2.amazonaws.com/mocked-neuron-monitor:v2"


so I assume this image is supposed to be built with dummy-neuron-monitor/Dockerfile then gets pushed to a test account ECR repo manually? Is there a way to run the python script directly on a container without needing for a docker image?

should be possible but we wanted to keep it as close as possible to the real implementation and deployment of the neuron monitor pod. Also keeping it this way makes it a bit quicker to replace the dummy image with an actual one once we know how to setup a gpu burn like pod

feels like it's another thing to manage outside of test itself when there is already a script doing exactly the same thing. should we expect this test to still pass when we use an actual image?

it would, the only problem is it wont emit some of the metrics which can be sparse like errors and error correction events so we might miss some regression there. This way we inject those metrics in the prom script and then assert that all the metrics are correct

should be possible but we wanted to keep it as close as possible to the real implementation and deployment of the neuron monitor pod

How does it make close to the real implementation when the extra layer of docker image is just a wrapper of the python script underneath? I still think just running the same python script directly on a container or using the real neuron image would be better from the management perspective. With the real neuron image, is that sparse metric issue a known issue with no easy workaround or fix for it?

How does it make close to the real implementation when the extra layer of docker image is just a wrapper of the python script underneath?

The dockerfile used is identical to the neuron monitor one, also the python script is the same as the real neuron monitor, the only difference is that instead of piping the output from a neuron monitor we continuously ingest the same output every 5s (which was generated from a neuron monitor json)

With the real neuron image, is that sparse metric issue a known issue with no easy workaround or fix for it?

Yeah, it was a conscious decision we took to not emit continuous 0s for metrics which rarely occur.
Another case which we want to cover is that we need to simulate multiple runtimes running on the same host which we cannot test unless we actually run multiple runtimes (neuron burns) or mock their output. We haven't been able to run 2 runtimes on the same host even in our test cluster so currently we have no way to catch regressions for that case using actual neuron monitor.

terraform/eks/daemon/awsneuron/main.tf

terraform/eks/daemon/awsneuron/variables.tf

test/awsneuron/neuron_test.go

sam6134 · 2024-08-13T15:38:44Z

test/awsneuron/neuron_metrics_test.go

+	"time"
+
+	"github.com/aws/amazon-cloudwatch-agent-test/environment"
+	. "github.com/aws/amazon-cloudwatch-agent-test/test/awsneuron/resources"


Should it be just -> "github.com/aws/amazon-cloudwatch-agent-test/test/awsneuron/resources"

add integration tests for AWS Neuron

9d317bd

aditya-purang requested review from sam6134 and straussb August 9, 2024 16:53

fix linting issues

275db95

aditya-purang marked this pull request as ready for review August 12, 2024 11:16

aditya-purang requested a review from a team as a code owner August 12, 2024 11:16

aditya-purang requested review from movence and removed request for straussb August 12, 2024 11:16

aditya-purang added 3 commits August 12, 2024 12:25

add resources to create dummy neuron monitor to the repo

e804ad6

add test label

df52326

Revert "add test label"

667be4c

This reverts commit df52326.

movence reviewed Aug 13, 2024

View reviewed changes

terraform/eks/daemon/awsneuron/main.tf Outdated Show resolved Hide resolved

movence reviewed Aug 13, 2024

View reviewed changes

terraform/eks/daemon/awsneuron/main.tf Outdated Show resolved Hide resolved

movence reviewed Aug 13, 2024

View reviewed changes

terraform/eks/daemon/awsneuron/variables.tf Outdated Show resolved Hide resolved

movence reviewed Aug 13, 2024

View reviewed changes

test/awsneuron/neuron_test.go Outdated Show resolved Hide resolved

refactor code

74b38f9

aditya-purang requested a review from movence August 14, 2024 12:41

aditya-purang added 2 commits August 19, 2024 14:33

minor refactor

ab3692f

Merge branch 'main' into ci-neuron-integ-tests

6284521

movence approved these changes Aug 21, 2024

View reviewed changes

sam6134 approved these changes Aug 21, 2024

View reviewed changes

aditya-purang merged commit 766e41a into aws:main Aug 21, 2024
2 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add integration tests for AWS Neuron #416

add integration tests for AWS Neuron #416

aditya-purang commented Aug 9, 2024 •

edited

Loading

movence Aug 13, 2024 •

edited

Loading

aditya-purang Aug 14, 2024 •

edited

Loading

movence Aug 14, 2024

aditya-purang Aug 16, 2024

movence Aug 19, 2024

aditya-purang Aug 19, 2024 •

edited

Loading

sam6134 Aug 13, 2024

add integration tests for AWS Neuron #416

add integration tests for AWS Neuron #416

Conversation

aditya-purang commented Aug 9, 2024 • edited Loading

Description of the issue

Description of changes

License

Tests

movence Aug 13, 2024 • edited Loading

Choose a reason for hiding this comment

aditya-purang Aug 14, 2024 • edited Loading

Choose a reason for hiding this comment

movence Aug 14, 2024

Choose a reason for hiding this comment

aditya-purang Aug 16, 2024

Choose a reason for hiding this comment

movence Aug 19, 2024

Choose a reason for hiding this comment

aditya-purang Aug 19, 2024 • edited Loading

Choose a reason for hiding this comment

sam6134 Aug 13, 2024

Choose a reason for hiding this comment

aditya-purang commented Aug 9, 2024 •

edited

Loading

movence Aug 13, 2024 •

edited

Loading

aditya-purang Aug 14, 2024 •

edited

Loading

aditya-purang Aug 19, 2024 •

edited

Loading