-
Notifications
You must be signed in to change notification settings - Fork 108
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve monitoring information of WMAgent components #12302
base: master
Are you sure you want to change the base?
Conversation
Jenkins results:
|
86fafb8
to
5613776
Compare
Jenkins results:
|
5613776
to
6b5db72
Compare
Jenkins results:
|
6b5db72
to
8cfcb9f
Compare
Jenkins results:
|
8cfcb9f
to
9138b58
Compare
Jenkins results:
|
9138b58
to
2233d74
Compare
Jenkins results:
|
2233d74
to
ec6dddd
Compare
Jenkins results:
|
Jenkins results:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without having a deep look into the details of the worker threads implementation, but seeing each component worker thread inherits a BaseWorkerThread
:
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WorkerThreads/BaseWorkerThread.py
and runs as a daemon, spawning a Daemon.xml file, e.g.:
(WMAgent-2.3.9.2) [xxx@vocms0xxx:current]$ cat install/WorkQueueManager/Daemon.xml
Can't we consolidate most of this development into this module:
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Agent/Daemon/Details.py
?
Or perhaps we could have the relevant worker thread monitoring logic in the BaseWorkerThread itself:
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WorkerThreads/BaseWorkerThread.py
In addition, I see the Harness module is used for starting up a component:
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Agent/Harness.py
but I think it is a higher layer and would not fit well for the worker threads monitoring.
Alan (@amaltaro), please go through the review process to avoid triggering unnecessary work. In my view, what you suggest does not fit in existing architectural design because we don't have ability to communicate with threads from external process. In other words, our components and pollers are python processes rather (HTTP) services where you can request some monitoring metrics. In external process with thread model what can be done is to push metrics somewhere, e.g. we can push such information to internal database, or Monit infrastructure and then use them afterwards. At the moment we rely on process ID (extracted from Daemon.xml) to inspect its status about run-time environment of components/daemons. I provided a stand-along functions for that and they are used in
My point is that without global picture with all details of monitoring metrics workflows it is very hard to understand what you have in mind, and simply mentioning different parts of WMCore does not help to understand if it is feasible to accomplish with current architectural design. |
@vkuznet I am puzzled with the fragmentation of process-related code. Despite the extensive PR description (thanks!), please help me understand the following: Checking the information already avaialble in WMStats
and I would be in favor of always reporting the state of all worker threads (+ main component), instead of just publishing when one of the threads is down. What do you think? |
Before I leave a review, I have a curiosity, mainly for my education, and if it has already been discussed i am sorry if i missed it. Is it there any reason why you use read the content of the |
@amaltaro , here are my responses to your questions:
b) we can merge @mapellidario , answering your question. Even though |
Fixes #12145
Status
ready
Description
To improve WMagent component's monitoring we introduced the following changes:
processStatus
function based on/proc
info for given PID. This function returns information about process threads which later can be used by WMStats. For example, here is addition toagentInfo
dictionary we supply to CouchDB:It shows the main process and its threads, in this case main process pid is 1921054, and pid of its threads (in this case only one additional thread) is 1921059 which is in sleeping state.
processThreadsInfo
function inUtils/ProcessStats.py
module which provides process and thread monitoring information:and thread info:
WMComponent/AnalyticsDataCollector/DataCollectAPI.py
parts with information about dead threads viathreadsDetails
function withinagentInfo['down_component_detail']
used by WMStats web UIbin/wmcoreD
where we provide status of component's process along with its threads, e.g.Each component now reports all its threads and if thread is missing/died/stuck it will provide proper message about it, e.g. component can be in
partially-running
state where its main process and some of the threads are ok but other threads misbehaveintroduced
trheads.json
file in component area along withDaemon.xml
to capture initial conditions of component state. This information is used by aforementioned new functionality to determine if original threads are running or diedpropagate thread information into
WMComponent/AnalyticsDataCollector/DataCollectAPI.py
who use it and analyze state of specific component and properly report its state (with new threads information) to CouchDB.Note: The information reported in WMStats web UI agentInfo tab can be accessed via the following URL:
https://xxxx.cern.ch/couchdb/wmstats/_design/WMStats/_view/agentInfo
which return nested dictionary which is used by WMStats JavaScript engine to report this information in agentInfo tab.Pylint note: I noticed degradation in modified
bin/wmcoreD
codebase which is not related to proposed changes. For instance, there are underlined variables used in this codebase. And, I explicitly did not touch those errors/warnings which most likely come from stricter rules imposed by pylint over time. If those should be addressed I suggest to make an explicit request for them and I'll be happy to fix those issues.Is it backward compatible (if not, which system it affects?)
YES
Related PRs
External dependencies / deployment changes