Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Neuron Scraper for scraping neuron monitor metrics #184

Merged
merged 62 commits into from
Mar 21, 2024

Conversation

sam6134
Copy link

@sam6134 sam6134 commented Mar 7, 2024

Description: Adding Neuron Scraper Configs an decorator

CR for initial Reviews on the NeuronScraper

  • Rebased the GPU code, and made the decorator generic
  • Added a new decorator to add Pod-Attributes to the metric
  • Make both the Dcgm scraper and NeuronScraper implement the SimpleScraper

Testing: Deployed on test cluster consumed logs and printed final -

            "Metric_14": {
                "name": "neuroncore_memory_usage_tensors",
                "datapoints": [
                    {
                        "attributes": "map[ClusterName:my-trn1-cluster ContainerName:trn1-mlp DeviceId:0 FullPodName:trn1-mlp InstanceId:i-09679ee85eb4ec8ee K8sPodName:trn1-mlp Namespace:default PodName:trn1-mlp availability_zone:us-east-1c instance_id:i-09679ee85eb4ec8ee instance_type:trn1n.32xlarge kubernetes:{"container_name":"trn1-mlp","containerd":{"container_id":"7405c4b10700c84fafb45f7f8cc394f0af8f2cf7e0f6b0515b5d180bd4ce1297"},"labels":{"my-label1":"label1-value","my-label2":"label2-value"},"namespace_name":"default","pod_name":"trn1-mlp"} memory_location:None neuron_device_index:0 neuroncore:0 region:us-east-1 runtime_tag:367 subnet_id:subnet-06a7754948e8a000f]",
                        "value": 6.315872e+06,
                    },
                    {
                        "attributes": "map[ClusterName:my-trn1-cluster ContainerName:trn1-mlp DeviceId:1 FullPodName:trn1-mlp InstanceId:i-09679ee85eb4ec8ee K8sPodName:trn1-mlp Namespace:default PodName:trn1-mlp availability_zone:us-east-1c instance_id:i-09679ee85eb4ec8ee instance_type:trn1n.32xlarge kubernetes:{"container_name":"trn1-mlp","containerd":{"container_id":"7405c4b10700c84fafb45f7f8cc394f0af8f2cf7e0f6b0515b5d180bd4ce1297"},"labels":{"my-label1":"label1-value","my-label2":"label2-value"},"namespace_name":"default","pod_name":"trn1-mlp"} memory_location:None neuron_device_index:0 neuroncore:1 region:us-east-1 runtime_tag:367 subnet_id:subnet-06a7754948e8a000f]",
                        "value": 6.315872e+06,
                    },
                    {
                        "attributes": "map[ClusterName:my-trn1-cluster ContainerName:trn1-mlp DeviceId:10 FullPodName:trn1-mlp InstanceId:i-09679ee85eb4ec8ee K8sPodName:trn1-mlp Namespace:default PodName:trn1-mlp availability_zone:us-east-1c instance_id:i-09679ee85eb4ec8ee instance_type:trn1n.32xlarge kubernetes:{"container_name":"trn1-mlp","containerd":{"container_id":"7405c4b10700c84fafb45f7f8cc394f0af8f2cf7e0f6b0515b5d180bd4ce1297"},"labels":{"my-label1":"label1-value","my-label2":"label2-value"},"namespace_name":"default","pod_name":"trn1-mlp"} memory_location:None neuron_device_index:5 neuroncore:10 region:us-east-1 runtime_tag:367 subnet_id:subnet-06a7754948e8a000f]",
                        "value": 6.315872e+06,
                    },
                    {
                        "attributes": "map[ClusterName:my-trn1-cluster ContainerName:trn1-mlp DeviceId:11 FullPodName:trn1-mlp InstanceId:i-09679ee85eb4ec8ee K8sPodName:trn1-mlp Namespace:default PodName:trn1-mlp availability_zone:us-east-1c instance_id:i-09679ee85eb4ec8ee instance_type:trn1n.32xlarge kubernetes:{"container_name":"trn1-mlp","containerd":{"container_id":"7405c4b10700c84fafb45f7f8cc394f0af8f2cf7e0f6b0515b5d180bd4ce1297"},"labels":{"my-label1":"label1-value","my-label2":"label2-value"},"namespace_name":"default","pod_name":"trn1-mlp"} memory_location:None neuron_device_index:5 neuroncore:11 region:us-east-1 runtime_tag:367 subnet_id:subnet-06a7754948e8a000f]",
                        "value": 6.315872e+06,
                    },
                    {
                        "attributes": "map[ClusterName:my-trn1-cluster ContainerName:trn1-mlp DeviceId:12 FullPodName:trn1-mlp InstanceId:i-09679ee85eb4ec8ee K8sPodName:trn1-mlp Namespace:default PodName:trn1-mlp availability_zone:us-east-1c instance_id:i-09679ee85eb4ec8ee instance_type:trn1n.32xlarge kubernetes:{"container_name":"trn1-mlp","containerd":{"container_id":"7405c4b10700c84fafb45f7f8cc394f0af8f2cf7e0f6b0515b5d180bd4ce1297"},"labels":{"my-label1":"label1-value","my-label2":"label2-value"},"namespace_name":"default","pod_name":"trn1-mlp"} memory_location:None neuron_device_index:6 neuroncore:12 region:us-east-1 runtime_tag:367 subnet_id:subnet-06a7754948e8a000f]",
                        "value": 6.315872e+06,
                    },
                    {
                        "attributes": "map[ClusterName:my-trn1-cluster ContainerName:trn1-mlp DeviceId:13 FullPodName:trn1-mlp InstanceId:i-09679ee85eb4ec8ee K8sPodName:trn1-mlp Namespace:default PodName:trn1-mlp availability_zone:us-east-1c instance_id:i-09679ee85eb4ec8ee instance_type:trn1n.32xlarge kubernetes:{"container_name":"trn1-mlp","containerd":{"container_id":"7405c4b10700c84fafb45f7f8cc394f0af8f2cf7e0f6b0515b5d180bd4ce1297"},"labels":{"my-label1":"label1-value","my-label2":"label2-value"},"namespace_name":"default","pod_name":"trn1-mlp"} memory_location:None neuron_device_index:6 neuroncore:13 region:us-east-1 runtime_tag:367 subnet_id:subnet-06a7754948e8a000f]",
                        "value": 6.315872e+06,
                    },
                    {
                        "attributes": "map[ClusterName:my-trn1-cluster ContainerName:trn1-mlp DeviceId:14 FullPodName:trn1-mlp InstanceId:i-09679ee85eb4ec8ee K8sPodName:trn1-mlp Namespace:default PodName:trn1-mlp availability_zone:us-east-1c instance_id:i-09679ee85eb4ec8ee instance_type:trn1n.32xlarge kubernetes:{"container_name":"trn1-mlp","containerd":{"container_id":"7405c4b10700c84fafb45f7f8cc394f0af8f2cf7e0f6b0515b5d180bd4ce1297"},"labels":{"my-label1":"label1-value","my-label2":"label2-value"},"namespace_name":"default","pod_name":"trn1-mlp"} memory_location:None neuron_device_index:7 neuroncore:14 region:us-east-1 runtime_tag:367 subnet_id:subnet-06a7754948e8a000f]",
                        "value": 6.315872e+06,
                    },
                    {
                        "attributes": "map[ClusterName:my-trn1-cluster ContainerName:trn1-mlp DeviceId:15 FullPodName:trn1-mlp InstanceId:i-09679ee85eb4ec8ee K8sPodName:trn1-mlp Namespace:default PodName:trn1-mlp availability_zone:us-east-1c instance_id:i-09679ee85eb4ec8ee instance_type:trn1n.32xlarge kubernetes:{"container_name":"trn1-mlp","containerd":{"container_id":"7405c4b10700c84fafb45f7f8cc394f0af8f2cf7e0f6b0515b5d180bd4ce1297"},"labels":{"my-label1":"label1-value","my-label2":"label2-value"},"namespace_name":"default","pod_name":"trn1-mlp"} memory_location:None neuron_device_index:7 neuroncore:15 region:us-east-1 runtime_tag:367 subnet_id:subnet-06a7754948e8a000f]",
                        "value": 6.315872e+06,
                    },
                    {
                        "attributes": "map[ClusterName:my-trn1-cluster ContainerName:trn1-mlp DeviceId:16 FullPodName:trn1-mlp InstanceId:i-09679ee85eb4ec8ee K8sPodName:trn1-mlp Namespace:default PodName:trn1-mlp availability_zone:us-east-1c instance_id:i-09679ee85eb4ec8ee instance_type:trn1n.32xlarge kubernetes:{"container_name":"trn1-mlp","containerd":{"container_id":"7405c4b10700c84fafb45f7f8cc394f0af8f2cf7e0f6b0515b5d180bd4ce1297"},"labels":{"my-label1":"label1-value","my-label2":"label2-value"},"namespace_name":"default","pod_name":"trn1-mlp"} memory_location:None neuron_device_index:8 neuroncore:16 region:us-east-1 runtime_tag:367 subnet_id:subnet-06a7754948e8a000f]",
                        "value": 6.315872e+06,
                    },
                    {
                        "attributes": "map[ClusterName:my-trn1-cluster ContainerName:trn1-mlp DeviceId:17 FullPodName:trn1-mlp InstanceId:i-09679ee85eb4ec8ee K8sPodName:trn1-mlp Namespace:default PodName:trn1-mlp availability_zone:us-east-1c instance_id:i-09679ee85eb4ec8ee instance_type:trn1n.32xlarge kubernetes:{"container_name":"trn1-mlp","containerd":{"container_id":"7405c4b10700c84fafb45f7f8cc394f0af8f2cf7e0f6b0515b5d180bd4ce1297"},"labels":{"my-label1":"label1-value","my-label2":"label2-value"},"namespace_name":"default","pod_name":"trn1-mlp"} memory_location:None neuron_device_index:8 neuroncore:17 region:us-east-1 runtime_tag:367 subnet_id:subnet-06a7754948e8a000f]",
                        "value": 6.315872e+06,
                    },
                    {
                        "attributes": "map[ClusterName:my-trn1-cluster ContainerName:trn1-mlp DeviceId:18 FullPodName:trn1-mlp InstanceId:i-09679ee85eb4ec8ee K8sPodName:trn1-mlp Namespace:default PodName:trn1-mlp availability_zone:us-east-1c instance_id:i-09679ee85eb4ec8ee instance_type:trn1n.32xlarge kubernetes:{"container_name":"trn1-mlp","containerd":{"container_id":"7405c4b10700c84fafb45f7f8cc394f0af8f2cf7e0f6b0515b5d180bd4ce1297"},"labels":{"my-label1":"label1-value","my-label2":"label2-value"},"namespace_name":"default","pod_name":"trn1-mlp"} memory_location:None neuron_device_index:9 neuroncore:18 region:us-east-1 runtime_tag:367 subnet_id:subnet-06a7754948e8a000f]",
                        "value": 6.315872e+06,
                    },
                    {
                        "attributes": "map[ClusterName:my-trn1-cluster ContainerName:trn1-mlp DeviceId:19 FullPodName:trn1-mlp InstanceId:i-09679ee85eb4ec8ee K8sPodName:trn1-mlp Namespace:default PodName:trn1-mlp availability_zone:us-east-1c instance_id:i-09679ee85eb4ec8ee instance_type:trn1n.32xlarge kubernetes:{"container_name":"trn1-mlp","containerd":{"container_id":"7405c4b10700c84fafb45f7f8cc394f0af8f2cf7e0f6b0515b5d180bd4ce1297"},"labels":{"my-label1":"label1-value","my-label2":"label2-value"},"namespace_name":"default","pod_name":"trn1-mlp"} memory_location:None neuron_device_index:9 neuroncore:19 region:us-east-1 runtime_tag:367 subnet_id:subnet-06a7754948e8a000f]",
                        "value": 6.315872e+06,
                    },
                    {
                        "attributes": "map[ClusterName:my-trn1-cluster ContainerName:trn1-mlp DeviceId:2 FullPodName:trn1-mlp InstanceId:i-09679ee85eb4ec8ee K8sPodName:trn1-mlp Namespace:default PodName:trn1-mlp availability_zone:us-east-1c instance_id:i-09679ee85eb4ec8ee instance_type:trn1n.32xlarge kubernetes:{"container_name":"trn1-mlp","containerd":{"container_id":"7405c4b10700c84fafb45f7f8cc394f0af8f2cf7e0f6b0515b5d180bd4ce1297"},"labels":{"my-label1":"label1-value","my-label2":"label2-value"},"namespace_name":"default","pod_name":"trn1-mlp"} memory_location:None neuron_device_index:1 neuroncore:2 region:us-east-1 runtime_tag:367 subnet_id:subnet-06a7754948e8a000f]",
                        "value": 6.315872e+06,
                    },
                    {
...
....

movence and others added 30 commits February 6, 2024 09:02
use constant variables
use the same scrape configs in dcgm scraper test
remove unnecessary attribute decoration for GPU metrics
add dcgm as source for dim
@sam6134 sam6134 requested a review from mxiamxia as a code owner March 7, 2024 18:17
@sam6134 sam6134 requested review from movence and straussb and removed request for mxiamxia March 7, 2024 18:17
@sam6134 sam6134 changed the title Ci neuron Add Neuron Scraper for scraping neuron monitor metrics Mar 12, 2024
straussb
straussb previously approved these changes Mar 13, 2024
Copy link

@straussb straussb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For any future readers, note that this PR was continued from aditya-purang#1

@sam6134 sam6134 requested a review from movence March 19, 2024 16:52
@sam6134 sam6134 merged commit 7441665 into amazon-contributing:aws-cwa-dev Mar 21, 2024
42 of 67 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants