Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add integ test for Nvidia GPU in EKS #399

Merged
merged 1 commit into from
Apr 8, 2024
Merged

Conversation

movence
Copy link
Contributor

@movence movence commented Apr 3, 2024

Description of changes

Add integ test for GPU with EKS. Highlights of this PR:

  • There will be no GPU node used for the test, since the test will be using httpd to mock dcgm exporter
  • httpd server will serve /metrics endpoint with a static prometheus data file. All metric values are 1
  • Terraform creates CA and TLS certs then mounts them to the agent and httpd pods using k8s secrets
  • The validation logic will check expected metrics with their corresponding dimension sets

License

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Tests

Integ test run: https://github.com/aws/amazon-cloudwatch-agent/actions/runs/8542742082

null_resource.validator (local-exec): 2024/04/03 10:53:45 >>>>>>>>>>>>>><<<<<<<<<<<<<<
null_resource.validator (local-exec): 2024/04/03 10:53:45 >>>>>>>>>>>>>>Successful<<<<<<<<<<<<<<
null_resource.validator (local-exec): 2024/04/03 10:53:45 ==============EKS_GPU_NVIDIA==============
null_resource.validator (local-exec): 2024/04/03 10:53:45 ==============Successful==============
null_resource.validator (local-exec): ClusterName-ContainerName-FullPodName-GpuDevice-Namespace-PodName   Successful
null_resource.validator (local-exec): container_gpu_memory_total                                          Successful
null_resource.validator (local-exec): container_gpu_memory_used                                           Successful
null_resource.validator (local-exec): container_gpu_power_draw                                            Successful
null_resource.validator (local-exec): container_gpu_temperature                                           Successful
null_resource.validator (local-exec): container_gpu_utilization                                           Successful
null_resource.validator (local-exec): container_gpu_memory_utilization                                    Successful
null_resource.validator (local-exec): ClusterName-FullPodName-GpuDevice-Namespace-PodName                 Successful
null_resource.validator (local-exec): pod_gpu_memory_total                                                Successful
null_resource.validator (local-exec): pod_gpu_memory_used                                                 Successful
null_resource.validator (local-exec): pod_gpu_power_draw                                                  Successful
null_resource.validator (local-exec): pod_gpu_temperature                                                 Successful
null_resource.validator (local-exec): pod_gpu_utilization                                                 Successful
null_resource.validator (local-exec): pod_gpu_memory_utilization                                          Successful
null_resource.validator (local-exec): ClusterName-InstanceId-NodeName                                     Successful
null_resource.validator (local-exec): node_gpu_memory_total                                               Successful
null_resource.validator (local-exec): node_gpu_memory_used                                                Successful
null_resource.validator (local-exec): node_gpu_power_draw                                                 Successful
null_resource.validator (local-exec): node_gpu_temperature                                                Successful
null_resource.validator (local-exec): node_gpu_utilization                                                Successful
null_resource.validator (local-exec): node_gpu_memory_utilization                                         Successful
null_resource.validator (local-exec): ClusterName-GpuDevice-InstanceId-InstanceType-NodeName              Successful
null_resource.validator (local-exec): node_gpu_memory_total                                               Successful
null_resource.validator (local-exec): node_gpu_memory_used                                                Successful
null_resource.validator (local-exec): node_gpu_power_draw                                                 Successful
null_resource.validator (local-exec): node_gpu_temperature                                                Successful
null_resource.validator (local-exec): node_gpu_utilization                                                Successful
null_resource.validator (local-exec): node_gpu_memory_utilization                                         Successful
null_resource.validator (local-exec): ClusterName-ContainerName-FullPodName-Namespace-PodName             Successful
null_resource.validator (local-exec): container_gpu_memory_total                                          Successful
null_resource.validator (local-exec): container_gpu_memory_used                                           Successful
null_resource.validator (local-exec): container_gpu_power_draw                                            Successful
null_resource.validator (local-exec): container_gpu_temperature                                           Successful
null_resource.validator (local-exec): container_gpu_utilization                                           Successful
null_resource.validator (local-exec): container_gpu_memory_utilization                                    Successful
null_resource.validator (local-exec): ClusterName-Namespace                                               Successful
null_resource.validator (local-exec): pod_gpu_memory_total                                                Successful
null_resource.validator (local-exec): pod_gpu_memory_used                                                 Successful
null_resource.validator (local-exec): pod_gpu_power_draw                                                  Successful
null_resource.validator (local-exec): pod_gpu_temperature                                                 Successful
null_resource.validator (local-exec): pod_gpu_utilization                                                 Successful
null_resource.validator (local-exec): pod_gpu_memory_utilization                                          Successful
null_resource.validator (local-exec): ClusterName-Namespace-PodName                                       Successful
null_resource.validator (local-exec): pod_gpu_memory_total                                                Successful
null_resource.validator (local-exec): pod_gpu_memory_used                                                 Successful
null_resource.validator (local-exec): pod_gpu_power_draw                                                  Successful
null_resource.validator (local-exec): pod_gpu_temperature                                                 Successful
null_resource.validator (local-exec): pod_gpu_utilization                                                 Successful
null_resource.validator (local-exec): pod_gpu_memory_utilization                                          Successful
null_resource.validator (local-exec): ClusterName-ContainerName-Namespace-PodName                         Successful
null_resource.validator (local-exec): container_gpu_memory_total                                          Successful
null_resource.validator (local-exec): container_gpu_memory_used                                           Successful
null_resource.validator (local-exec): container_gpu_power_draw                                            Successful
null_resource.validator (local-exec): container_gpu_temperature                                           Successful
null_resource.validator (local-exec): container_gpu_utilization                                           Successful
null_resource.validator (local-exec): container_gpu_memory_utilization                                    Successful
null_resource.validator (local-exec): ClusterName-FullPodName-Namespace-PodName                           Successful
null_resource.validator (local-exec): pod_gpu_memory_total                                                Successful
null_resource.validator (local-exec): pod_gpu_memory_used                                                 Successful
null_resource.validator (local-exec): pod_gpu_power_draw                                                  Successful
null_resource.validator (local-exec): pod_gpu_temperature                                                 Successful
null_resource.validator (local-exec): pod_gpu_utilization                                                 Successful
null_resource.validator (local-exec): pod_gpu_memory_utilization                                          Successful
null_resource.validator (local-exec): ClusterName                                                         Successful
null_resource.validator (local-exec): container_gpu_memory_total                                          Successful
null_resource.validator (local-exec): container_gpu_memory_used                                           Successful
null_resource.validator (local-exec): container_gpu_power_draw                                            Successful
null_resource.validator (local-exec): container_gpu_temperature                                           Successful
null_resource.validator (local-exec): container_gpu_utilization                                           Successful
null_resource.validator (local-exec): container_gpu_memory_utilization                                    Successful
null_resource.validator (local-exec): pod_gpu_memory_total                                                Successful
null_resource.validator (local-exec): pod_gpu_memory_used                                                 Successful
null_resource.validator (local-exec): pod_gpu_power_draw                                                  Successful
null_resource.validator (local-exec): pod_gpu_temperature                                                 Successful
null_resource.validator (local-exec): pod_gpu_utilization                                                 Successful
null_resource.validator (local-exec): pod_gpu_memory_utilization                                          Successful
null_resource.validator (local-exec): node_gpu_memory_total                                               Successful
null_resource.validator (local-exec): node_gpu_memory_used                                                Successful
null_resource.validator (local-exec): node_gpu_power_draw                                                 Successful
null_resource.validator (local-exec): node_gpu_temperature                                                Successful
null_resource.validator (local-exec): node_gpu_utilization                                                Successful
null_resource.validator (local-exec): node_gpu_memory_utilization                                         Successful
null_resource.validator (local-exec): emf-logs                                                            Successful
null_resource.validator (local-exec): 2024/04/03 10:53:45 ==============================
null_resource.validator (local-exec): 2024/04/03 10:53:45 >>>>>>>>>>>>>>><<<<<<<<<<<<<<<
null_resource.validator (local-exec): >>>> Finished GPU Container Insights TestSuite
null_resource.validator (local-exec): --- PASS: TestGPUSuite (188.82s)
null_resource.validator (local-exec):     --- PASS: TestGPUSuite/TestAllInSuite (188.82s)
null_resource.validator (local-exec): PASS
null_resource.validator (local-exec): ok  	github.com/aws/amazon-cloudwatch-agent-test/test/gpu	189.974s
null_resource.validator: Creation complete after 3m15s [id=6180149298917079822]

@movence movence requested a review from a team as a code owner April 3, 2024 15:07
@movence movence changed the title Add integ test for GPU Add integ test for Nvidia GPU Apr 3, 2024
@zhihonl
Copy link
Contributor

zhihonl commented Apr 3, 2024

nit: TItle should be Add integ tests for Nvidia GPU in EKS

@zhihonl
Copy link
Contributor

zhihonl commented Apr 3, 2024

Can you link a full integration test run so we can check if it breaks existing tests

@movence
Copy link
Contributor Author

movence commented Apr 3, 2024

Sure, added to the desc. There are some failures but they seem to be not related to gpu test changes.

@movence movence changed the title Add integ test for Nvidia GPU Add integ test for Nvidia GPU in EKS Apr 3, 2024
@zhihonl
Copy link
Contributor

zhihonl commented Apr 4, 2024

https://github.com/aws/amazon-cloudwatch-agent/actions/runs/8542742082/job/23405557390

One of the eks_daemon tests failed, is that just flaky test?

@movence
Copy link
Contributor Author

movence commented Apr 5, 2024

Seeing timeouts with:

module.windows.null_resource.fluentbit-windows (local-exec): Waiting for daemon set "fluent-bit-windows" rollout to finish: 0 of 1 updated pods are available...

which doesn't seem to be related to this PR, but I will let the change owner know about it.

Copy link
Contributor

@Paramadon Paramadon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good!

@movence movence merged commit c638bfa into aws:main Apr 8, 2024
2 checks passed
@movence movence deleted the nvidia-gpu branch May 13, 2024 18:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants