forked from open-telemetry/opentelemetry-collector-contrib
-
Notifications
You must be signed in to change notification settings - Fork 21
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add gpu count metrics including limit, request and total counts #214
Merged
movence
merged 3 commits into
amazon-contributing:aws-cwa-dev
from
movence:ci-gpu-count
May 29, 2024
Merged
Add gpu count metrics including limit, request and total counts #214
movence
merged 3 commits into
amazon-contributing:aws-cwa-dev
from
movence:ci-gpu-count
May 29, 2024
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
movence
requested review from
mitali-salvi and
sky333999
and removed request for
mxiamxia
May 21, 2024 12:19
sky333999
reviewed
May 23, 2024
receiver/awscontainerinsightreceiver/internal/k8sapiserver/k8sapiserver.go
Outdated
Show resolved
Hide resolved
receiver/awscontainerinsightreceiver/internal/k8sapiserver/k8sapiserver.go
Show resolved
Hide resolved
receiver/awscontainerinsightreceiver/internal/k8sapiserver/k8sapiserver.go
Outdated
Show resolved
Hide resolved
receiver/awscontainerinsightreceiver/internal/k8sapiserver/k8sapiserver.go
Show resolved
Hide resolved
receiver/awscontainerinsightreceiver/internal/k8sapiserver/k8sapiserver.go
Show resolved
Hide resolved
…d, node and cluster levels
remove bool flag for ConvertToOTLPMetrics
sky333999
approved these changes
May 24, 2024
mitali-salvi
approved these changes
May 29, 2024
movence
merged commit May 29, 2024
2728c19
into
amazon-contributing:aws-cwa-dev
111 of 122 checks passed
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
** Revision **
Description:
Add NVDIA GPU count metrics including
_limit
,_request
and_total
at pod, node and cluster levels. This change will have the leader agent pod to collect GPU count metrics rather than individual agent pods to collect them. This is to collect limit and request metrics for pods that are still in pending status due to lack of available GPU devices. The leader agent will still gather information of pending pods forlimit
andrequest
, but*_total
metric will only include gpu counts data from running pods.Testing:
Tested with a cluster which has 2
g4dn.12xlarge
instances with 4 gpu devices each. There are 4 RUNNING pods (total 8 allocated gpu devices) and 2 PENDING pods requesting 2 gpu devices each.