Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add capability to publish metrics to prometheus (NVIDIA#2684)
### Description One of the feature request is to add system metrics to monitoring FLARE running metrics via Prometheus + Grafana or other monitoring systems. This PR add that missing piece. Here are few pieces to make this work 1) JobMetricsCollector/SysMetricsCollecor, this collector will subscribe a callback for the ReservedTopic.APP_METRICS topic in the DataBus; and receive callback when the topic is published. The SysMetricsCollector listens to the parent process events ( system start/end etc.) for client and server process The JobMetricsCollector listens to the job process events, mostly related to the job, task etc. 2) StatsD-reporter The statsd-reporter post the the metrics received ( from event callback) to the statsd-exporter interface: by default localhost:9125. StatsD-export expose the <host>:9102/metrics web interface for Prometheus to scrape, which can be used as data source for Grafana to visualize. These are standard setup. we added an example with docker-compose file to illustrate this process ## NVFLARE Monitoring Metrics | Event | Metric Count | Metric Time Taken | |-------|--------------|-------------------| | SYSTEM_START | _system_start_count | | | SYSTEM_END | _system_end_count | _system_time_taken | | ABOUT_TO_START_RUN | _about_to_start_run_count | | | START_RUN | _start_run_count | | | ABOUT_TO_END_RUN | _about_to_end_run_count | | | END_RUN | _end_run_count | _run_time_taken | | CHECK_END_RUN_READINESS | _check_end_run_readiness_count | | | SWAP_IN | _swap_in_count | | | SWAP_OUT | _swap_out_count | | | START_WORKFLOW | _start_workflow_count | | | END_WORKFLOW | _end_workflow_count | _workflow_time_taken | | ABORT_TASK | _abort_task_count | | | FATAL_SYSTEM_ERROR | _fatal_system_error_count | | | JOB_DEPLOYED | _job_deployed_count | | | JOB_STARTED | _job_started_count | | | JOB_COMPLETED | _job_completed_count | _job_time_taken | | JOB_ABORTED | _job_aborted_count | | | JOB_CANCELLED | _job_cancelled_count | | | CLIENT_DISCONNECTED | _client_disconnected_count | | | CLIENT_RECONNECTED | _client_reconnected_count | | | BEFORE_PULL_TASK | _before_pull_task_count | | | AFTER_PULL_TASK | _after_pull_task_count | _pull_task_time_taken | | BEFORE_PROCESS_TASK_REQUEST | _before_process_task_request_count | | | AFTER_PROCESS_TASK_REQUEST | _after_process_task_request_count | _process_task_request_time_taken | | BEFORE_PROCESS_SUBMISSION | _before_process_submission_count | | | AFTER_PROCESS_SUBMISSION | _after_process_submission_count | _process_submission_time_taken | | BEFORE_TASK_DATA_FILTER | _before_task_data_filter_count | | | AFTER_TASK_DATA_FILTER | _after_task_data_filter_count | _data_filter_time_taken | | BEFORE_TASK_RESULT_FILTER | _before_task_result_filter_count | | | AFTER_TASK_RESULT_FILTER | _after_task_result_filter_count | _result_filter_time_taken | | BEFORE_TASK_EXECUTION | _before_task_execution_count | | | AFTER_TASK_EXECUTION | _after_task_execution_count | _task_execution_time_taken | | BEFORE_SEND_TASK_RESULT | _before_send_task_result_count | | | AFTER_SEND_TASK_RESULT | _after_send_task_result_count | _send_task_result_time_taken | | BEFORE_PROCESS_RESULT_OF_UNKNOWN_TASK | _before_process_result_of_unknown_task_count | | | AFTER_PROCESS_RESULT_OF_UNKNOWN_TASK | _after_process_result_of_unknown_task_count | _process_result_of_unknown_task_time_taken | | PRE_RUN_RESULT_AVAILABLE | _pre_run_result_available_count | | | BEFORE_CHECK_CLIENT_RESOURCES | _before_check_client_resources_count | | | AFTER_CHECK_CLIENT_RESOURCES | _after_check_client_resources_count | _check_client_resources_time_taken | | SUBMIT_JOB | _submit_job_count | | | DEPLOY_JOB_TO_SERVER | _deploy_job_to_server_count | | | DEPLOY_JOB_TO_CLIENT | _deploy_job_to_client_count | | | BEFORE_CHECK_RESOURCE_MANAGER | _before_check_resource_manager_count | | | BEFORE_SEND_ADMIN_COMMAND | _before_send_admin_command_count | | | BEFORE_CLIENT_REGISTER | _before_client_register_count | | | AFTER_CLIENT_REGISTER | _after_client_register_count | client_register_time_taken | | CLIENT_REGISTER_RECEIVED | _client_register_received_count | | | CLIENT_REGISTER_PROCESSED | _client_register_processed_count | | | CLIENT_QUIT | _client_quit_count | | | SYSTEM_BOOTSTRAP | _system_bootstrap_count | | These metrics can be separated into Job Metrics and System Metrics. System Metrics are associated with the Client and Server parent processes, while Job Metrics are associated with each job. We support three different setups: ![setup-1](https://github.com/user-attachments/assets/c031cf99-a997-4d0d-9601-be1e71394bc3) ![setup-2](https://github.com/user-attachments/assets/dd37ac9b-32d3-4c6f-94f1-b2878dda1616) ![setup-3](https://github.com/user-attachments/assets/28182d8c-3672-41e9-9e3a-227c613ccf31) The detailed examples for setup 1 and 2 are given using hello-pt A few sentences describing the changes proposed in this pull request. ### Types of changes <!--- Put an `x` in all the boxes that apply, and remove the not applicable items --> - [x] Non-breaking change (fix or new feature that would not break existing functionality). - [ ] Breaking change (fix or new feature that would cause existing functionality to change). - [ ] New tests added to cover the changes. - [ ] Quick tests passed locally by running `./runtest.sh`. - [ ] In-line docstrings updated. - [ ] Documentation updated.
- Loading branch information