Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix GPU E2E test failures #1457

Merged
merged 2 commits into from
Dec 4, 2024
Merged

Fix GPU E2E test failures #1457

merged 2 commits into from
Dec 4, 2024

Conversation

movence
Copy link
Contributor

@movence movence commented Dec 4, 2024

Description of the issue

  • kubectl rollout fails with timeout when checking the status of nvidia device plugin
Waiting for daemon set "nvidia-device-plugin-daemonset" rollout to finish: 0 of 1 updated pods are available...
error: timed out waiting for the condition
  • destroy fails with unknown beta var
╷
│ Error: Invalid value for input variable
│ 
│   on variables.tf line 39:
│   39: variable "beta" {
│ 
│ Unsuitable value for var.beta set using -var="beta=...": a bool is
│ required.

Description of changes

  • Replace kubectl rollout with simple sleep
  • Drop beta vars in terraform commands

License

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Tests

https://github.com/aws/amazon-cloudwatch-agent/actions/runs/12165006108

Requirements

Before commit the code, please do the following steps.

  1. Run make fmt and make fmt-sh
  2. Run make lint

@movence movence requested a review from a team as a code owner December 4, 2024 19:24
@movence movence merged commit a44d883 into main Dec 4, 2024
7 checks passed
@movence movence deleted the gpu-e2e branch December 4, 2024 20:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants