nvidia
Folders and files
Name | Name | Last commit date | ||
---|---|---|---|---|
parent directory.. | ||||
NVIDIA GPU monitoring plugin for gmond ====================================== Installation instructions: * First install the Python Bindings for the NVIDIA Management Library: $ cd nvidia-ml-py-* $ sudo python setup.py install For the latest bindings see: http://pypi.python.org/pypi/nvidia-ml-py/ You can do a site install or place it in {libdir}/ganglia/python_modules * Copy python_modules/nvidia.py to {libdir}/ganglia/python_modules * Copy conf.d/nvidia.pyconf to /etc/ganglia/conf.d * Copy graph.d/* to {ganglia_webroot}/graph.d/ * A demo of what the GPU graphs look like is available here: http://ganglia.ddbj.nig.ac.jp/?c=research+month+gpu+queue&h=t135i&m=load_one&r=hour&s=by+name&hc=4&mc=2 By default all metrics that the management library could detect for your GPU are collected. For more information on what metrics are supported on what models, please refer to NVML documentation. The following metrics have been implemented: * gpu_num * gpu_driver * gpu_type * gpu_uuid * gpu_pci_id * gpu_mem_total * gpu_graphics_speed * gpu_sm_speed * gpu_mem_speed * gpu_max_graphics_speed * gpu_max_sm_speed * gpu_max_mem_speed * gpu_temp * gpu_util * gpu_mem_util * gpu_mem_used * gpu_fan * gpu_power_usage * gpu_perf_state * gpu_ecc_mode Version 2: The following metrics have been implemented: * gpu_max_graphics_speed * gpu_max_sm_speed * gpu_max_mem_speed * gpu_serial * gpu_power_man_mode * gpu_power_man_limit Version 3: The following metrics have been implemented: * gpu_ecc_db_error * gpu_ecc_sb_error’ * gpu_power_violation_report * gpu_encoder_util * gpu_decoder_util * gpu_bar1_memory * gpu_shutdown_temp * gpu_slowdown_temp