Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve monitoring information of WMAgent components #12302

Open
wants to merge 5 commits into
base: master
Choose a base branch
from

Conversation

vkuznet
Copy link
Contributor

@vkuznet vkuznet commented Mar 11, 2025

Fixes #12145

Status

ready

Description

To improve WMagent component's monitoring we introduced the following changes:

  • Add new procfs module with processStatus function based on /proc info for given PID. This function returns information about process threads which later can be used by WMStats. For example, here is addition to agentInfo dictionary we supply to CouchDB:
  "components": {
...
    "DBS3Upload": [
      {"process": "python", "pid": "1921054", "status": "S (sleeping)",  "type": "process"},
      {"process": "python", "pid": "1921059", "status": "S (sleeping)",  "type": "thread" }
    ],
...
}

It shows the main process and its threads, in this case main process pid is 1921054, and pid of its threads (in this case only one additional thread) is 1921059 which is in sleeping state.

  • add processThreadsInfo function in Utils/ProcessStats.py module which provides process and thread monitoring information:
{"process_name": .., "pid": .., "status": .., "ppid": .., "cmdline":.., 
 "cpu_usage_percent": .., "memory_usage_percent": .., "memory_rss": .., 
 "num_open_files": .., "num_connections": .., 
 "threads": [..]}

and thread info:

{"thread_id": .., "user_time":.., "system_time":.., "cpu_usage_percent": ..,
  "memory_usage_bytes":.., "num_open_files":.., "num_connections":..,
  "state":.., "name":..}
  • update WMComponent/AnalyticsDataCollector/DataCollectAPI.py parts with information about dead threads via threadsDetails function within agentInfo['down_component_detail'] used by WMStats web UI
  • update bin/wmcoreD where we provide status of component's process along with its threads, e.g.
manage status
...
Component:WorkQueueManager 3545240 running with threads [3545241, 3545242, 3545243, 3545244]
Component:DBS3Upload 3545249 running with threads [3545254]
...

Each component now reports all its threads and if thread is missing/died/stuck it will provide proper message about it, e.g. component can be in partially-running state where its main process and some of the threads are ok but other threads misbehave

  • introduced trheads.json file in component area along with Daemon.xml to capture initial conditions of component state. This information is used by aforementioned new functionality to determine if original threads are running or died

  • propagate thread information into WMComponent/AnalyticsDataCollector/DataCollectAPI.py who use it and analyze state of specific component and properly report its state (with new threads information) to CouchDB.

Note: The information reported in WMStats web UI agentInfo tab can be accessed via the following URL: https://xxxx.cern.ch/couchdb/wmstats/_design/WMStats/_view/agentInfo which return nested dictionary which is used by WMStats JavaScript engine to report this information in agentInfo tab.

Pylint note: I noticed degradation in modified bin/wmcoreD codebase which is not related to proposed changes. For instance, there are underlined variables used in this codebase. And, I explicitly did not touch those errors/warnings which most likely come from stricter rules imposed by pylint over time. If those should be addressed I suggest to make an explicit request for them and I'll be happy to fix those issues.

Is it backward compatible (if not, which system it affects?)

YES

Related PRs

External dependencies / deployment changes

@dmwm-bot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 tests added
    • 2 changes in unstable tests
  • Python3 Pylint check: failed
    • 4 warnings and errors that must be fixed
    • 4 warnings
    • 50 comments to review
  • Pycodestyle check: succeeded
    • 3 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/463/artifact/artifacts/PullRequestReport.html

@dmwm-bot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 tests added
    • 3 changes in unstable tests
  • Python3 Pylint check: succeeded
    • 4 warnings
    • 34 comments to review
  • Pycodestyle check: succeeded
    • 3 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/464/artifact/artifacts/PullRequestReport.html

@dmwm-bot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 tests added
    • 4 changes in unstable tests
  • Python3 Pylint check: succeeded
    • 4 warnings
    • 34 comments to review
  • Pycodestyle check: succeeded
    • 1 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/465/artifact/artifacts/PullRequestReport.html

@dmwm-bot
Copy link

Jenkins results:

  • Python3 Unit tests: failed
    • 1 tests added
    • 3 changes in unstable tests
  • Python3 Pylint check: succeeded
    • 4 warnings
    • 34 comments to review
  • Pycodestyle check: succeeded
    • 1 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/471/artifact/artifacts/PullRequestReport.html

@dmwm-bot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 2 tests added
  • Python3 Pylint check: succeeded
    • 4 warnings
    • 34 comments to review
  • Pycodestyle check: succeeded
    • 1 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/472/artifact/artifacts/PullRequestReport.html

@dmwm-bot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 2 tests added
    • 2 changes in unstable tests
  • Python3 Pylint check: succeeded
    • 4 warnings
    • 34 comments to review
  • Pycodestyle check: succeeded
    • 1 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/476/artifact/artifacts/PullRequestReport.html

@dmwm-bot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 3 tests added
    • 3 changes in unstable tests
  • Python3 Pylint check: failed
    • 14 warnings and errors that must be fixed
    • 5 warnings
    • 130 comments to review
  • Pycodestyle check: succeeded
    • 89 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/483/artifact/artifacts/PullRequestReport.html

@dmwm-bot
Copy link

Jenkins results:

  • Python3 Unit tests: succeeded
    • 3 tests added
    • 5 changes in unstable tests
  • Python3 Pylint check: failed
    • 20 warnings and errors that must be fixed
    • 10 warnings
    • 136 comments to review
  • Pycodestyle check: succeeded
    • 103 comments to review

Details at https://cmssdt.cern.ch/dmwm-jenkins/view/All/job/WMCore-PR-Report/519/artifact/artifacts/PullRequestReport.html

Copy link
Contributor

@amaltaro amaltaro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Without having a deep look into the details of the worker threads implementation, but seeing each component worker thread inherits a BaseWorkerThread:
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WorkerThreads/BaseWorkerThread.py

and runs as a daemon, spawning a Daemon.xml file, e.g.:

(WMAgent-2.3.9.2) [xxx@vocms0xxx:current]$ cat install/WorkQueueManager/Daemon.xml

Can't we consolidate most of this development into this module:
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Agent/Daemon/Details.py
?

Or perhaps we could have the relevant worker thread monitoring logic in the BaseWorkerThread itself:
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WorkerThreads/BaseWorkerThread.py

In addition, I see the Harness module is used for starting up a component:
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Agent/Harness.py
but I think it is a higher layer and would not fit well for the worker threads monitoring.

@vkuznet
Copy link
Contributor Author

vkuznet commented Mar 26, 2025

src/python/WMCore/WorkerThreads/BaseWorkerThread.py

Alan (@amaltaro), please go through the review process to avoid triggering unnecessary work. In my view, what you suggest does not fit in existing architectural design because we don't have ability to communicate with threads from external process. In other words, our components and pollers are python processes rather (HTTP) services where you can request some monitoring metrics. In external process with thread model what can be done is to push metrics somewhere, e.g. we can push such information to internal database, or Monit infrastructure and then use them afterwards. At the moment we rely on process ID (extracted from Daemon.xml) to inspect its status about run-time environment of components/daemons.

I provided a stand-along functions for that and they are used in Details.py as you mentioned in your comment. Therefore, I recommend that you review the current changes, and if you have specific idea how to implement thread monitoring via integration with [WMCore/WorkerThreads/BaseWorkerThread.py](https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WorkerThreads/BaseWorkerThread.py) I suggest that you outline your idea as proposal for such monitoring. For instance

  • add process metrics to BaseWorkerThread
  • how to push/collect these metrics elsewhere (please specify where)
  • how to extract these metrics from elsewhere and use it for monitoring of running components/daemons. For instance:
    • how to use these metrics in manage status
    • how to use these metrics in WMComponent/AnalyticsDataCollector/DataCollectAPI.py which report them to CouchDB, i.e. how it can communicate with run-time environment (BaseWorkerThread.py)
    • maybe we can push metrics directly from BaseWorkerThread.py to CouchDB, is it desired?

My point is that without global picture with all details of monitoring metrics workflows it is very hard to understand what you have in mind, and simply mentioning different parts of WMCore does not help to understand if it is feasible to accomplish with current architectural design.

@amaltaro
Copy link
Contributor

@vkuznet I am puzzled with the fragmentation of process-related code. Despite the extensive PR description (thanks!), please help me understand the following:
a) what is wmcoreD bin used for? I guess it is used by the manage script, hence the fragmentation from the components codebase
b) can't we merge the functionality in the new module ProcFS into ProcessStats? Is there any reason to keep those separated?
c) instead of extending Utilities module, can't we use the Daemon/Details module - which already loads the Daemon.xml and gives you the ProcessID element?
d) in DataCollectAPI, instead of relying in a broken information from the database - and the new data struct created - don't you think it is better to persist the actual worker thread ID in the database (wm_workers table?)

Checking the information already avaialble in WMStats agentInfo view, I see:

        "workers": [
          {
            "name": "AgentStatusPoller",
            "last_updated": 1743074171,
            "state": "Running",
            "poll_interval": 300,
            "cycle_time": 1.5446
          },
          {
            "name": "AnalyticsPoller",
            "last_updated": 1743074124,
            "state": "Running",
            "poll_interval": 600,
            "cycle_time": 0.8201
          },
...

and I would be in favor of always reporting the state of all worker threads (+ main component), instead of just publishing when one of the threads is down.

What do you think?

@mapellidario
Copy link
Member

Before I leave a review, I have a curiosity, mainly for my education, and if it has already been discussed i am sorry if i missed it. Is it there any reason why you use read the content of the /proc/$pid files in some places and use the python module psutils somewhere else? do they provide different information? if psutil can provide all the information we need, i would be in favor of never reading directly from /prod/$pid files

@vkuznet
Copy link
Contributor Author

vkuznet commented Mar 30, 2025

@amaltaro , here are my responses to your questions:
a) the wmcoreD is used within manage script, e.g.

start_agent(){
    if  _init_valid $wmaInitUsing ; then
        echo "Starting WMAgent..."
        wmcoreD --start --config=$WMA_CONFIG_DIR/config.py
    else
        echo "ERROR: This agent is not fully initialized. Cannot use it."
        return $(false)
    fi
}

b) we can merge ProcFS into ProcessStats but they have code not related to each other. The former keeps code to work with Linux file system, while later the code to about process monitoring. Because of this reason I kept them separately but I don't mind to merge them.
c) the only reason I added extractFromXML due to its generic nature, i.e. this function can be used with any XML parsing task. Moving this from Utilities to Daemon/Details is acceptable but based on my previous reviews you always wanted to keep pieces in right places, and to me keeping generic function in Utilities is more appropriate then in Daemon/Details. And, I do use processStatus function I provided in Daemon/Details. If my answers are not satisfactory to you please elaborate more how do you prefer relocate the code. In my view everything is in right places and used accordingly.
d) I don't really have a clue why we keep process information in a database as it is outdated due to different time it is used and collected. My understanding that we want to get up-to-date info about component to report it to the couch. The database write seems useless to me since knowing process ID we can get up-to-date process information. Regarding structure used agentInfo I intentionally kept it backward compatible to avoid even more changes on wmstats JavaScript (client) side. And, so far we only improving what we have. Instead the entire process of collecting info is outdated and I'm in favor of collecting metrics and pushing them to local Prometheus instances (which then can push it to CMS Monitoring one). Here again, I tried to minimize the effort (which seems already extensive).

@mapellidario , answering your question. Even though psutil is well known python package the Linux /proc file system is foundation of everything in Linux OS. I trust it more :) That said, I provided both ways, either use ProcFS module based on /proc FS, or use psutil approach in ProcessStats module.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve and simplify status check of WMAgent components
5 participants