Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix GPU E2E integ test #437

Merged
merged 6 commits into from
Dec 3, 2024
Merged

Fix GPU E2E integ test #437

merged 6 commits into from
Dec 3, 2024

Conversation

movence
Copy link
Contributor

@movence movence commented Dec 2, 2024

Description of the issue

GPU E2E Integ test is failing with missing metrics and dimensions

Description of changes

  • Reorder kubectl apply commands so that GPU burn pod starts after nvidia device plugin is installed
  • Update metrics-dims sets to the latest to include or drop GPU count metrics and their names
  • Update EKS and Addon versions to the latest
  • Use variables from the generated test matrix json instead hard-coded values

License

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Tests

Github test run: https://github.com/aws/amazon-cloudwatch-agent/actions/runs/12127752653

2024/12/02 15:46:15 >>>>>>>>>>>>>><<<<<<<<<<<<<<
2024/12/02 15:46:15 >>>>>>>>>>>>>>Successful<<<<<<<<<<<<<<
2024/12/02 15:46:15 ==============EKS_GPU_NVIDIA==============
2024/12/02 15:46:15 ==============Successful==============
ClusterName                                                         Successful  
container_gpu_memory_total                                          Successful  
container_gpu_memory_used                                           Successful  
container_gpu_power_draw                                            Successful  
container_gpu_temperature                                           Successful  
container_gpu_utilization                                           Successful  
container_gpu_memory_utilization                                    Successful  
pod_gpu_memory_total                                                Successful  
pod_gpu_memory_used                                                 Successful  
pod_gpu_power_draw                                                  Successful  
pod_gpu_temperature                                                 Successful  
pod_gpu_utilization                                                 Successful  
pod_gpu_memory_utilization                                          Successful  
pod_gpu_reserved_capacity                                           Successful  
pod_gpu_request                                                     Successful  
pod_gpu_usage_total                                                 Successful  
pod_gpu_limit                                                       Successful  
node_gpu_memory_total                                               Successful  
node_gpu_memory_used                                                Successful  
node_gpu_power_draw                                                 Successful  
node_gpu_temperature                                                Successful  
node_gpu_utilization                                                Successful  
node_gpu_memory_utilization                                         Successful  
node_gpu_usage_total                                                Successful  
node_gpu_limit                                                      Successful  
node_gpu_reserved_capacity                                          Successful  
ClusterName-Namespace                                               Successful  
pod_gpu_memory_total                                                Successful  
pod_gpu_memory_used                                                 Successful  
pod_gpu_power_draw                                                  Successful  
pod_gpu_temperature                                                 Successful  
pod_gpu_utilization                                                 Successful  
pod_gpu_memory_utilization                                          Successful  
ClusterName-Namespace-PodName                                       Successful  
pod_gpu_memory_total                                                Successful  
pod_gpu_memory_used                                                 Successful  
pod_gpu_power_draw                                                  Successful  
pod_gpu_temperature                                                 Successful  
pod_gpu_utilization                                                 Successful  
pod_gpu_memory_utilization                                          Successful  
pod_gpu_usage_total                                                 Successful  
pod_gpu_request                                                     Successful  
pod_gpu_reserved_capacity                                           Successful  
pod_gpu_limit                                                       Successful  
ClusterName-ContainerName-Namespace-PodName                         Successful  
container_gpu_memory_total                                          Successful  
container_gpu_memory_used                                           Successful  
container_gpu_power_draw                                            Successful  
container_gpu_temperature                                           Successful  
container_gpu_utilization                                           Successful  
container_gpu_memory_utilization                                    Successful  
ClusterName-ContainerName-FullPodName-Namespace-PodName             Successful  
container_gpu_memory_total                                          Successful  
container_gpu_memory_used                                           Successful  
container_gpu_power_draw                                            Successful  
container_gpu_temperature                                           Successful  
container_gpu_utilization                                           Successful  
container_gpu_memory_utilization                                    Successful  
ClusterName-ContainerName-FullPodName-GpuDevice-Namespace-PodName   Successful  
container_gpu_memory_total                                          Successful  
container_gpu_memory_used                                           Successful  
container_gpu_power_draw                                            Successful  
container_gpu_temperature                                           Successful  
container_gpu_utilization                                           Successful  
container_gpu_memory_utilization                                    Successful  
ClusterName-FullPodName-Namespace-PodName                           Successful  
pod_gpu_memory_total                                                Successful  
pod_gpu_memory_used                                                 Successful  
pod_gpu_power_draw                                                  Successful  
pod_gpu_temperature                                                 Successful  
pod_gpu_utilization                                                 Successful  
pod_gpu_memory_utilization                                          Successful  
pod_gpu_limit                                                       Successful  
pod_gpu_usage_total                                                 Successful  
pod_gpu_request                                                     Successful  
pod_gpu_reserved_capacity                                           Successful  
ClusterName-FullPodName-GpuDevice-Namespace-PodName                 Successful  
pod_gpu_memory_total                                                Successful  
pod_gpu_memory_used                                                 Successful  
pod_gpu_power_draw                                                  Successful  
pod_gpu_temperature                                                 Successful  
pod_gpu_utilization                                                 Successful  
pod_gpu_memory_utilization                                          Successful  
ClusterName-InstanceId-NodeName                                     Successful  
node_gpu_memory_total                                               Successful  
node_gpu_memory_used                                                Successful  
node_gpu_power_draw                                                 Successful  
node_gpu_temperature                                                Successful  
node_gpu_utilization                                                Successful  
node_gpu_memory_utilization                                         Successful  
node_gpu_limit                                                      Successful  
node_gpu_usage_total                                                Successful  
node_gpu_reserved_capacity                                          Successful  
ClusterName-GpuDevice-InstanceId-InstanceType-NodeName              Successful  
node_gpu_memory_total                                               Successful  
node_gpu_memory_used                                                Successful  
node_gpu_power_draw                                                 Successful  
node_gpu_temperature                                                Successful  
node_gpu_utilization                                                Successful  
node_gpu_memory_utilization                                         Successful  
emf-logs                                                            Successful  
2024/12/02 15:46:15 ==============================
2024/12/02 15:46:15 >>>>>>>>>>>>>>><<<<<<<<<<<<<<<
>>>> Finished GPU Container Insights TestSuite
--- PASS: TestGPUSuite (191.16s)
    --- PASS: TestGPUSuite/TestAllInSuite (191.16s)
PASS
ok  	github.com/aws/amazon-cloudwatch-agent-test/test/gpu	191.779s

Requirements

Before commit the code, please do the following steps.

  1. Run make fmt and make fmt-sh
  2. Run make lint

@movence movence requested a review from a team as a code owner December 2, 2024 21:07
musa-asad
musa-asad previously approved these changes Dec 2, 2024
lisguo
lisguo previously approved these changes Dec 2, 2024
@movence movence dismissed stale reviews from lisguo and musa-asad via f60c452 December 3, 2024 03:12
@movence movence merged commit 618d981 into main Dec 3, 2024
2 checks passed
@lisguo
Copy link
Contributor

lisguo commented Dec 4, 2024

For additional context:

These tests needed to be updated after this change: aws/amazon-cloudwatch-agent@1f6c19c

musa-asad pushed a commit that referenced this pull request Dec 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants