-
Notifications
You must be signed in to change notification settings - Fork 7.3k
kernel: workq: introduce work timeout: #88345
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
kernel: workq: introduce work timeout: #88345
Conversation
7a199dc
to
ceb645c
Compare
added option to exclude the timeout entirely so the feature has zero overhead if not used. |
ceb645c
to
1e9f16f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpicks, but this seems unobjectionable
kernel/Kconfig
Outdated
@@ -600,6 +603,13 @@ config SYSTEM_WORKQUEUE_NO_YIELD | |||
cooperative and a sequence of work items is expected to complete | |||
without yielding. | |||
|
|||
config SYSTEM_WORKQUEUE_WORK_TIMEOUT_MS | |||
int "Select system work queue work timeout in milliseconds" | |||
default 10000 if DEBUG |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CONFIG_DEBUG is a control over compiler optimizations, it's the wrong tunable to use here. You probably want CONFIG_ASSERT to gate the default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to ASSERT and decreased the time to 5000ms (still far longer than I think is reasonable but hopefully users will adjust it lower)
kernel/work.c
Outdated
bool k_sys_work_queue_is_blocked(void) | ||
{ | ||
return flag_test(&k_sys_work_q.flags, K_WORK_QUEUE_BLOCKED_BIT); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really seeing the point of this as an application API? I mean, you're not supposed to be blocked at all. One doesn't normally write is_my_code_broken()
predicates, nor call them in working code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for use in tests, can move it to an internal header :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we use k_oops() on block this API can be removed entirely
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
API removed, thread is aborted if blocked, so a user who cares could join or check if the work queue thread is running :)
kernel/work.c
Outdated
if (name != NULL) { | ||
LOG_WRN("queue %s blocked by work %p with handler %p", name, work, handler); | ||
} else { | ||
LOG_WRN("queue %p blocked by work %p with handler %p", queue, work, handler); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like the k_oops() from the last PR might be a better choice, though maybe configurable. A mere warning message isn't likely loud enough for something we almost all agree is a should-never-happen condition.
Alternatively make the callback be settable by the app and have the default blow up, but let apps do what they want?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will add k_oops() (or panic?), this will also make the code a tiny but simpler since there is no unblock scenario :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed strategy a bit, since the timeout is run from a _timeout, k_oops() makes no sense since it is not run from the same thread as the work queue, so instead, abort the work queue and let the kernel handle it (essential thread would result in k_panic())
f76fe37
to
b575ee5
Compare
Some more nitpicks - sorry - otherwise looking good |
Also, would be really good to add a test for this! |
b575ee5
to
417e7ab
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm out of nitpicks, this looks very reasonable
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lingering nits, but I guess I had better not block
if (name != NULL) { | ||
LOG_ERR("queue %s blocked by work %p with handler %p", name, work, handler); | ||
} else { | ||
LOG_ERR("queue %p blocked by work %p with handler %p", queue, work, handler); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still not crazy about this, as it needlessly duplicates a nearly identical string in the string table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would you solve it? the string is about 30 bytes, and stored in ROM, the added operations of conditionally copying the thread name or a queue pointer, to a RAM buffer, to then pass to the LOG_ERR could easily take up the same or more ROM (while also adding complexity, the worst kind, involving null terminated strings)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(LOG_ERR is probably adding code as well given the use of VARGS)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How would you solve it?
Using the code suggestion here:
#88345 (comment)
Introduce work timeout, which is an optional workqueue configuration which enables monitoring for work items which take longer than expected. This could be due to long running or deadlocked handlers. Signed-off-by: Bjarki Arge Andreasen <bjarki.andreasen@nordicsemi.no>
417e7ab
to
90c81b4
Compare
The following west manifest projects have changed revision in this Pull Request:
⛔ DNM label due to: 1 project with PR revision Note: This message is automatically posted and updated by the Manifest GitHub Action. |
Addressed some nits and extended zephyr-lang-rust to init the new work_timeout_ms parameter :) See zephyrproject-rtos/zephyr-lang-rust#86 |
Add workqueue work timeout test suite. Signed-off-by: Bjarki Arge Andreasen <bjarki.andreasen@nordicsemi.no>
90c81b4
to
bdc13c7
Compare
This is going to be a bit challenging to get through with the needed rust change, since requiring a lock-step commit to support an API change like this doesn't seem very practical. I think the Rust change needs to go through, with an understanding of the conditional (and probably the field itself conditionally enabled), so that the Rust module will support all combinations of this change not being present, and it being enabled/disabled. |
Its not an issue, I expanded in a comment in the Rust module PR. No conditional is needed. |
Introduce work timeout, which is an optional workqueue configuration which enables monitoring for work items which take longer than expected. This could be due to long running or deadlocked handlers.
This is a far more permissive alternative to #87522. It allows blocking items, as long as they don't take longer than specified. Feature works on a per workqueue basis. If the workqueue is blocked, an ERR log will be printed and the work queue thread will be aborted.
Example output from test suite
if the aborted thread is the essential system workqueue thread, kernel explodes.