Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add integration tests for AWS Neuron #416

Merged
merged 8 commits into from
Aug 21, 2024

Conversation

aditya-purang
Copy link
Contributor

@aditya-purang aditya-purang commented Aug 9, 2024

Description of the issue

Adding integration tests for AWS Neuron Metrics

Description of changes

  • Added infrastructure code for AWS Neuron metrics cluster
  • Added tests for verification of log schema, cluster dimension metrics and EMF log frequency per minute for AWS neuron metrics

License

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Tests

Tested with a ci-neuron-integ-test branch in agent repo

test link : https://github.com/aws/amazon-cloudwatch-agent/actions/runs/10328302776/job/28594690160

@aditya-purang aditya-purang marked this pull request as ready for review August 12, 2024 11:16
@aditya-purang aditya-purang requested a review from a team as a code owner August 12, 2024 11:16
@aditya-purang aditya-purang requested review from movence and removed request for straussb August 12, 2024 11:16
}
container {
name = "neuron-monitor-prometheus"
image = "506463145083.dkr.ecr.us-west-2.amazonaws.com/mocked-neuron-monitor:v2"
Copy link
Contributor

@movence movence Aug 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so I assume this image is supposed to be built with dummy-neuron-monitor/Dockerfile then gets pushed to a test account ECR repo manually? Is there a way to run the python script directly on a container without needing for a docker image?

Copy link
Contributor Author

@aditya-purang aditya-purang Aug 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be possible but we wanted to keep it as close as possible to the real implementation and deployment of the neuron monitor pod. Also keeping it this way makes it a bit quicker to replace the dummy image with an actual one once we know how to setup a gpu burn like pod

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feels like it's another thing to manage outside of test itself when there is already a script doing exactly the same thing. should we expect this test to still pass when we use an actual image?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it would, the only problem is it wont emit some of the metrics which can be sparse like errors and error correction events so we might miss some regression there. This way we inject those metrics in the prom script and then assert that all the metrics are correct

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be possible but we wanted to keep it as close as possible to the real implementation and deployment of the neuron monitor pod

How does it make close to the real implementation when the extra layer of docker image is just a wrapper of the python script underneath? I still think just running the same python script directly on a container or using the real neuron image would be better from the management perspective. With the real neuron image, is that sparse metric issue a known issue with no easy workaround or fix for it?

Copy link
Contributor Author

@aditya-purang aditya-purang Aug 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does it make close to the real implementation when the extra layer of docker image is just a wrapper of the python script underneath?

The dockerfile used is identical to the neuron monitor one, also the python script is the same as the real neuron monitor, the only difference is that instead of piping the output from a neuron monitor we continuously ingest the same output every 5s (which was generated from a neuron monitor json)

With the real neuron image, is that sparse metric issue a known issue with no easy workaround or fix for it?

Yeah, it was a conscious decision we took to not emit continuous 0s for metrics which rarely occur.
Another case which we want to cover is that we need to simulate multiple runtimes running on the same host which we cannot test unless we actually run multiple runtimes (neuron burns) or mock their output. We haven't been able to run 2 runtimes on the same host even in our test cluster so currently we have no way to catch regressions for that case using actual neuron monitor.

@aditya-purang aditya-purang requested a review from movence August 14, 2024 12:41
"time"

"github.com/aws/amazon-cloudwatch-agent-test/environment"
. "github.com/aws/amazon-cloudwatch-agent-test/test/awsneuron/resources"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should it be just -> "github.com/aws/amazon-cloudwatch-agent-test/test/awsneuron/resources"

@aditya-purang aditya-purang merged commit 766e41a into aws:main Aug 21, 2024
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants