Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

监控数据断点异常 #2459

Open
safeAndSound3 opened this issue Feb 11, 2025 · 4 comments
Open

监控数据断点异常 #2459

safeAndSound3 opened this issue Feb 11, 2025 · 4 comments

Comments

@safeAndSound3
Copy link

Question and Steps to reproduce

版本: v7.7.2
架构:n9e(三节点 高可用 VIP) + victoria(单点vminsert,vmselect,vmstorage) + categraf 数据库三节点galera,redis三节点哨兵
客户端节点:62台 后续增加至300台左右
监控要求 GPU和基本信息 syslog对接

目前62台gpu dcgm采集监控有很多节点都是断点或无数据 是网络原因还是资源不够导致的

Image

Relevant logs and configurations

架构:n9e(三节点 高可用 VIP) + victoria(单点vminsert,vmselect,vmstorage) + categraf  数据库三节点galera,redis三节点哨兵

Version

版本: v7.7.2

@UlricQin
Copy link
Member

UlricQin commented Feb 11, 2025

采集 dcgm 的频率是多少?另外去即时查询里查询断点的指标,使用范围查询(比如 metric_name{name="abc"}[5m])查询断点的数据,看table视图,看看查到的内容和上报频率是否一致。

@safeAndSound3
Copy link
Author

safeAndSound3 commented Feb 12, 2025

Image

input.dcgm/exporter.toml

[[instances]]
# path to the file, that contains the DCGM fields to collect
 collectors = "conf/input.dcgm/default-counters.csv"

# Enable kubernetes mapping metrics to kubernetes pods
kubernetes=true

# Choose Type of GPU ID to use to map kubernetes resources to pods. Possible values: "uid", "device-name"
kubernetes-gpu-id-type = "uid"

# Use old 1.x namespace
# use-old-namespace = false

  cpu-devices = "f"

# gpu devices
  devices = "f"

  switch-devices = "f"

# ConfigMap <NAMESPACE>:<NAME> for metric data
  configmap-data = "none"

# Connect to remote hostengine at <HOST>:<PORT>
remote-hostengine-info = "127.0.0.1:5555"

# Accept GPUs that are fake, for testing purposes only
# fake-gpus = false

# Replaces every blank space in the GPU model name with a dash, ensuring a continuous, space-free identifier.
# replace-blanks-in-model-name = false

@safeAndSound3
Copy link
Author

safeAndSound3 commented Feb 12, 2025

Image
62台机器 四台一直都是正常的 categraf是共享目录下共享的 所以62台机器配置肯定是一致的
异常机器 刚开始是采集不到的 卡在

Image

2025/02/12 09:34:40 exporter.go:159: Attemping to connect to remote hostengine at %!(EXTRA string=127.0.0.1:5555)
2025/02/12 09:36:40 exporter.go:170: DCGM successfully initialized!
2025/02/12 09:36:40 exporter.go:179: Not collecting DCP metrics:  Host engine connection invalid/disconnected
INFO[0120] Falling back to metric file 'conf/input.dcgm/default-counters.csv' 
INFO[0120] Initializing system entities of type: GPU    
2025/02/12 09:36:40 exporter.go:215: Not collecting GPU metrics; Error getting devices count: Host engine connection invalid/disconnected
2025/02/12 09:36:40 exporter.go:215: Not collecting NvSwitch metrics; no fields to watch for device type: 3
2025/02/12 09:36:40 exporter.go:215: Not collecting NvLink metrics; no fields to watch for device type: 6
2025/02/12 09:36:40 exporter.go:215: Not collecting CPU metrics; no fields to watch for device type: 7
2025/02/12 09:36:40 exporter.go:215: Not collecting CPU Core metrics; no fields to watch for device type: 8
INFO[0120] Kubernetes metrics collection enabled!       
2025/02/12 09:36:40 metrics_agent.go:321: I! input: local.dcgm started
2025/02/12 09:36:40 agent.go:46: I! [*agent.MetricsAgent] started
2025/02/12 09:36:40 agent.go:49: I! agent started

重启nvidia-dcgm后 就有可能会正常链接 然后数据推送就会出点断点传送
dcgm版本:1:3.3.8

@safeAndSound3
Copy link
Author

safeAndSound3 commented Feb 12, 2025

./categraf --test --debug --inputs dcgm

2025/02/12 09:33:53 main.go:149: I! runner.binarydir: /mnt/inaisfs/scripts/categraf
2025/02/12 09:33:53 main.go:150: I! runner.hostname: n30027
2025/02/12 09:33:53 main.go:151: I! runner.fd_limits: (soft=65536, hard=65536)
2025/02/12 09:33:53 main.go:152: I! runner.vm_limits: (soft=unlimited, hard=unlimited)
2025/02/12 09:33:53 provider_manager.go:60: I! use input provider: [local]
2025/02/12 09:33:53 prometheus_agent.go:19: I! prometheus scraping disabled!
2025/02/12 09:33:53 ibex_agent.go:19: I! ibex agent disabled!
2025/02/12 09:33:53 agent.go:38: I! agent starting
2025/02/12 09:33:53 exporter.go:159: Attemping to connect to remote hostengine at %!(EXTRA string=127.0.0.1:5555)
^C
root@n30027:/mnt/inaisfs/scripts/categraf# ./categraf --test --debug --inputs dcgm
2025/02/12 09:34:40 main.go:149: I! runner.binarydir: /mnt/inaisfs/scripts/categraf
2025/02/12 09:34:40 main.go:150: I! runner.hostname: n30027
2025/02/12 09:34:40 main.go:151: I! runner.fd_limits: (soft=65536, hard=65536)
2025/02/12 09:34:40 main.go:152: I! runner.vm_limits: (soft=unlimited, hard=unlimited)
2025/02/12 09:34:40 provider_manager.go:60: I! use input provider: [local]
2025/02/12 09:34:40 prometheus_agent.go:19: I! prometheus scraping disabled!
2025/02/12 09:34:40 ibex_agent.go:19: I! ibex agent disabled!
2025/02/12 09:34:40 agent.go:38: I! agent starting
2025/02/12 09:34:40 exporter.go:151: Starting dcgm-exporter
2025/02/12 09:34:40 exporter.go:155: &{CollectorsFile:conf/input.dcgm/default-counters.csv Address: CollectInterval:0 Kubernetes:true KubernetesGPUIdType:uid CollectDCP:true UseOldNamespace:false UseRemoteHE:true RemoteHEInfo:127.0.0.1:5555 GPUDevices:{Flex:true MajorRange:[] MinorRange:[]} SwitchDevices:{Flex:true MajorRange:[] MinorRange:[]} CPUDevices:{Flex:true MajorRange:[] MinorRange:[]} NoHostname:false UseFakeGPUs:false ConfigMapData:none MetricGroups:[] WebSystemdSocket:false WebConfigFile: XIDCountWindowSize:0 ReplaceBlanksInModelName:false Debug:true ClockEventsCountWindowSize:0}
2025/02/12 09:34:40 exporter.go:159: Attemping to connect to remote hostengine at %!(EXTRA string=127.0.0.1:5555)
2025/02/12 09:36:40 exporter.go:170: DCGM successfully initialized!
2025/02/12 09:36:40 exporter.go:179: Not collecting DCP metrics:  Host engine connection invalid/disconnected
INFO[0120] Falling back to metric file 'conf/input.dcgm/default-counters.csv' 
INFO[0120] Initializing system entities of type: GPU    
2025/02/12 09:36:40 exporter.go:215: Not collecting GPU metrics; Error getting devices count: Host engine connection invalid/disconnected
2025/02/12 09:36:40 exporter.go:215: Not collecting NvSwitch metrics; no fields to watch for device type: 3
2025/02/12 09:36:40 exporter.go:215: Not collecting NvLink metrics; no fields to watch for device type: 6
2025/02/12 09:36:40 exporter.go:215: Not collecting CPU metrics; no fields to watch for device type: 7
2025/02/12 09:36:40 exporter.go:215: Not collecting CPU Core metrics; no fields to watch for device type: 8
INFO[0120] Kubernetes metrics collection enabled!       
2025/02/12 09:36:40 metrics_agent.go:321: I! input: local.dcgm started
2025/02/12 09:36:40 agent.go:46: I! [*agent.MetricsAgent] started
2025/02/12 09:36:40 agent.go:49: I! agent started
2025/02/12 09:36:40 metrics_reader.go:54: D! local.dcgm : before gather once
2025/02/12 09:36:40 metrics_reader.go:60: D! local.dcgm : after gather once, duration: 396.955µs
2025/02/12 09:36:55 metrics_reader.go:54: D! local.dcgm : before gather once
2025/02/12 09:36:55 metrics_reader.go:60: D! local.dcgm : after gather once, duration: 141.283µs
2025/02/12 09:37:10 metrics_reader.go:54: D! local.dcgm : before gather once
2025/02/12 09:37:10 metrics_reader.go:60: D! local.dcgm : after gather once, duration: 200.64µs
2025/02/12 09:37:25 metrics_reader.go:54: D! local.dcgm : before gather once
2025/02/12 09:37:25 metrics_reader.go:60: D! local.dcgm : after gather once, duration: 158.02µs
2025/02/12 09:37:40 metrics_reader.go:54: D! local.dcgm : before gather once
2025/02/12 09:37:40 metrics_reader.go:60: D! local.dcgm : after gather once, duration: 143.777µs
2025/02/12 09:37:55 metrics_reader.go:54: D! local.dcgm : before gather once
2025/02/12 09:37:55 metrics_reader.go:60: D! local.dcgm : after gather once, duration: 171.759µs
2025/02/12 09:38:10 metrics_reader.go:54: D! local.dcgm : before gather once
2025/02/12 09:38:10 metrics_reader.go:60: D! local.dcgm : after gather once, duration: 164.71µs
2025/02/12 09:38:25 metrics_reader.go:54: D! local.dcgm : before gather once
2025/02/12 09:38:25 metrics_reader.go:60: D! local.dcgm : after gather once, duration: 178.765µs
2025/02/12 09:38:40 metrics_reader.go:54: D! local.dcgm : before gather once
2025/02/12 09:38:40 metrics_reader.go:60: D! local.dcgm : after gather once, duration: 154.497µs
2025/02/12 09:38:55 metrics_reader.go:54: D! local.dcgm : before gather once
2025/02/12 09:38:55 metrics_reader.go:60: D! local.dcgm : after gather once, duration: 146.864µs
2025/02/12 09:39:10 metrics_reader.go:54: D! local.dcgm : before gather once
2025/02/12 09:39:10 metrics_reader.go:60: D! local.dcgm : after gather once, duration: 154.169µs
2025/02/12 09:39:25 metrics_reader.go:54: D! local.dcgm : before gather once
2025/02/12 09:39:25 metrics_reader.go:60: D! local.dcgm : after gather once, duration: 170.616µs
2025/02/12 09:39:40 metrics_reader.go:54: D! local.dcgm : before gather once
2025/02/12 09:39:40 metrics_reader.go:60: D! local.dcgm : after gather once, duration: 167.016µs
2025/02/12 09:39:55 metrics_reader.go:54: D! local.dcgm : before gather once
2025/02/12 09:39:55 metrics_reader.go:60: D! local.dcgm : after gather once, duration: 140.978µs
2025/02/12 09:40:10 metrics_reader.go:54: D! local.dcgm : before gather once
2025/02/12 09:40:10 metrics_reader.go:60: D! local.dcgm : after gather once, duration: 149.546µs
2025/02/12 09:40:25 metrics_reader.go:54: D! local.dcgm : before gather once
2025/02/12 09:40:25 metrics_reader.go:60: D! local.dcgm : after gather once, duration: 170.043µs
2025/02/12 09:40:40 metrics_reader.go:54: D! local.dcgm : before gather once
2025/02/12 09:40:40 metrics_reader.go:60: D! local.dcgm : after gather once, duration: 159.827µs
2025/02/12 09:40:55 metrics_reader.go:54: D! local.dcgm : before gather once
2025/02/12 09:40:55 metrics_reader.go:60: D! local.dcgm : after gather once, duration: 157.307µs
^C2025/02/12 09:40:59 main.go:131: I! received signal: interrupt
2025/02/12 09:40:59 agent.go:53: I! agent stopping
2025/02/12 09:40:59 agent.go:61: I! [*agent.MetricsAgent] stopped
2025/02/12 09:40:59 agent.go:64: I! agent stopped
2025/02/12 09:40:59 main.go:144: I! exited

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants