-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add integration tests for AWS Neuron #416
Conversation
} | ||
container { | ||
name = "neuron-monitor-prometheus" | ||
image = "506463145083.dkr.ecr.us-west-2.amazonaws.com/mocked-neuron-monitor:v2" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so I assume this image is supposed to be built with dummy-neuron-monitor/Dockerfile
then gets pushed to a test account ECR repo manually? Is there a way to run the python script directly on a container without needing for a docker image?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be possible but we wanted to keep it as close as possible to the real implementation and deployment of the neuron monitor pod. Also keeping it this way makes it a bit quicker to replace the dummy image with an actual one once we know how to setup a gpu burn like pod
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
feels like it's another thing to manage outside of test itself when there is already a script doing exactly the same thing. should we expect this test to still pass when we use an actual image?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would, the only problem is it wont emit some of the metrics which can be sparse like errors and error correction events so we might miss some regression there. This way we inject those metrics in the prom script and then assert that all the metrics are correct
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be possible but we wanted to keep it as close as possible to the real implementation and deployment of the neuron monitor pod
How does it make close to the real implementation when the extra layer of docker image is just a wrapper of the python script underneath? I still think just running the same python script directly on a container or using the real neuron image would be better from the management perspective. With the real neuron image, is that sparse metric issue a known issue with no easy workaround or fix for it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does it make close to the real implementation when the extra layer of docker image is just a wrapper of the python script underneath?
The dockerfile used is identical to the neuron monitor one, also the python script is the same as the real neuron monitor, the only difference is that instead of piping the output from a neuron monitor we continuously ingest the same output every 5s (which was generated from a neuron monitor json)
With the real neuron image, is that sparse metric issue a known issue with no easy workaround or fix for it?
Yeah, it was a conscious decision we took to not emit continuous 0s for metrics which rarely occur.
Another case which we want to cover is that we need to simulate multiple runtimes running on the same host which we cannot test unless we actually run multiple runtimes (neuron burns) or mock their output. We haven't been able to run 2 runtimes on the same host even in our test cluster so currently we have no way to catch regressions for that case using actual neuron monitor.
"time" | ||
|
||
"github.com/aws/amazon-cloudwatch-agent-test/environment" | ||
. "github.com/aws/amazon-cloudwatch-agent-test/test/awsneuron/resources" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be just -> "github.com/aws/amazon-cloudwatch-agent-test/test/awsneuron/resources"
Description of the issue
Adding integration tests for AWS Neuron Metrics
Description of changes
License
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.
Tests
Tested with a
ci-neuron-integ-test
branch in agent repotest link : https://github.com/aws/amazon-cloudwatch-agent/actions/runs/10328302776/job/28594690160