Move GPU metrics for request, limit, usage & reserved_capacity into pod store #225
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description:
#214 introduced NVIDIA GPU count metrics to track request, limit & usage counts at a pod, node & cluster level. However, this implementation was scraping all the metrics from the leader node for the entire cluster and introducing new log events of
PodGPU
Type.This change moves the scraping of these metrics into the pod store such that they are scraped on each agent/collector pod as opposed to just the leader and they are now emitted as part of the existing Type
Pod
and TypeNode
log events (similar to the cpu & mem metrics).This change also tweaks the metrics to more closely match the cpu metrics.
Testing: Deployed changes to a test cluster running a mix of GPU and non-GPU nodes.