Skip to content

kernel: workq: introduce work timeout: #88345

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

bjarki-andreasen
Copy link
Collaborator

@bjarki-andreasen bjarki-andreasen commented Apr 9, 2025

Introduce work timeout, which is an optional workqueue configuration which enables monitoring for work items which take longer than expected. This could be due to long running or deadlocked handlers.

This is a far more permissive alternative to #87522. It allows blocking items, as long as they don't take longer than specified. Feature works on a per workqueue basis. If the workqueue is blocked, an ERR log will be printed and the work queue thread will be aborted.

Example output from test suite

[00:00:01.010,000] <wrn> os: queue sysworkq blocked by work 0x805a0e0 with handler 0x80494a6

if the aborted thread is the essential system workqueue thread, kernel explodes.

@bjarki-andreasen
Copy link
Collaborator Author

added option to exclude the timeout entirely so the feature has zero overhead if not used.

Copy link
Collaborator

@andyross andyross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nitpicks, but this seems unobjectionable

kernel/Kconfig Outdated
@@ -600,6 +603,13 @@ config SYSTEM_WORKQUEUE_NO_YIELD
cooperative and a sequence of work items is expected to complete
without yielding.

config SYSTEM_WORKQUEUE_WORK_TIMEOUT_MS
int "Select system work queue work timeout in milliseconds"
default 10000 if DEBUG
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CONFIG_DEBUG is a control over compiler optimizations, it's the wrong tunable to use here. You probably want CONFIG_ASSERT to gate the default.

Copy link
Collaborator Author

@bjarki-andreasen bjarki-andreasen Apr 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to ASSERT and decreased the time to 5000ms (still far longer than I think is reasonable but hopefully users will adjust it lower)

kernel/work.c Outdated
bool k_sys_work_queue_is_blocked(void)
{
return flag_test(&k_sys_work_q.flags, K_WORK_QUEUE_BLOCKED_BIT);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really seeing the point of this as an application API? I mean, you're not supposed to be blocked at all. One doesn't normally write is_my_code_broken() predicates, nor call them in working code.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for use in tests, can move it to an internal header :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use k_oops() on block this API can be removed entirely

Copy link
Collaborator Author

@bjarki-andreasen bjarki-andreasen Apr 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

API removed, thread is aborted if blocked, so a user who cares could join or check if the work queue thread is running :)

kernel/work.c Outdated
if (name != NULL) {
LOG_WRN("queue %s blocked by work %p with handler %p", name, work, handler);
} else {
LOG_WRN("queue %p blocked by work %p with handler %p", queue, work, handler);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like the k_oops() from the last PR might be a better choice, though maybe configurable. A mere warning message isn't likely loud enough for something we almost all agree is a should-never-happen condition.

Alternatively make the callback be settable by the app and have the default blow up, but let apps do what they want?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will add k_oops() (or panic?), this will also make the code a tiny but simpler since there is no unblock scenario :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed strategy a bit, since the timeout is run from a _timeout, k_oops() makes no sense since it is not run from the same thread as the work queue, so instead, abort the work queue and let the kernel handle it (essential thread would result in k_panic())

@bjarki-andreasen bjarki-andreasen force-pushed the workq-work-timeout branch 2 times, most recently from f76fe37 to b575ee5 Compare April 10, 2025 07:32
@cfriedt
Copy link
Member

cfriedt commented Apr 10, 2025

Some more nitpicks - sorry - otherwise looking good

@cfriedt
Copy link
Member

cfriedt commented Apr 10, 2025

Also, would be really good to add a test for this!

andyross
andyross previously approved these changes Apr 10, 2025
Copy link
Collaborator

@andyross andyross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm out of nitpicks, this looks very reasonable

cfriedt
cfriedt previously approved these changes Apr 10, 2025
Copy link
Member

@cfriedt cfriedt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lingering nits, but I guess I had better not block

Comment on lines +620 to +624
if (name != NULL) {
LOG_ERR("queue %s blocked by work %p with handler %p", name, work, handler);
} else {
LOG_ERR("queue %p blocked by work %p with handler %p", queue, work, handler);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still not crazy about this, as it needlessly duplicates a nearly identical string in the string table.

Copy link
Collaborator Author

@bjarki-andreasen bjarki-andreasen Apr 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would you solve it? the string is about 30 bytes, and stored in ROM, the added operations of conditionally copying the thread name or a queue pointer, to a RAM buffer, to then pass to the LOG_ERR could easily take up the same or more ROM (while also adding complexity, the worst kind, involving null terminated strings)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(LOG_ERR is probably adding code as well given the use of VARGS)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How would you solve it?

Using the code suggestion here:
#88345 (comment)

Introduce work timeout, which is an optional workqueue configuration
which enables monitoring for work items which take longer than
expected. This could be due to long running or deadlocked handlers.

Signed-off-by: Bjarki Arge Andreasen <bjarki.andreasen@nordicsemi.no>
@bjarki-andreasen bjarki-andreasen dismissed stale reviews from cfriedt and andyross via 90c81b4 April 11, 2025 07:14
Copy link

github-actions bot commented Apr 11, 2025

The following west manifest projects have changed revision in this Pull Request:

Name Old Revision New Revision Diff
zephyr-lang-rust zephyrproject-rtos/zephyr-lang-rust@d4f9036 (v4.1-branch) zephyrproject-rtos/zephyr-lang-rust#86 zephyrproject-rtos/zephyr-lang-rust#86/files

DNM label due to: 1 project with PR revision

Note: This message is automatically posted and updated by the Manifest GitHub Action.

@bjarki-andreasen
Copy link
Collaborator Author

bjarki-andreasen commented Apr 11, 2025

Addressed some nits and extended zephyr-lang-rust to init the new work_timeout_ms parameter :) See zephyrproject-rtos/zephyr-lang-rust#86

@github-actions github-actions bot added manifest manifest-zephyr-lang-rust DNM (manifest) This PR should not be merged (controlled by action-manifest) labels Apr 11, 2025
Add workqueue work timeout test suite.

Signed-off-by: Bjarki Arge Andreasen <bjarki.andreasen@nordicsemi.no>
@d3zd3z
Copy link
Collaborator

d3zd3z commented Apr 30, 2025

This is going to be a bit challenging to get through with the needed rust change, since requiring a lock-step commit to support an API change like this doesn't seem very practical.

I think the Rust change needs to go through, with an understanding of the conditional (and probably the field itself conditionally enabled), so that the Rust module will support all combinations of this change not being present, and it being enabled/disabled.

@bjarki-andreasen
Copy link
Collaborator Author

bjarki-andreasen commented May 1, 2025

This is going to be a bit challenging to get through with the needed rust change, since requiring a lock-step commit to support an API change like this doesn't seem very practical.

I think the Rust change needs to go through, with an understanding of the conditional (and probably the field itself conditionally enabled), so that the Rust module will support all combinations of this change not being present, and it being enabled/disabled.

Its not an issue, I expanded in a comment in the Rust module PR. No conditional is needed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: Kernel DNM (manifest) This PR should not be merged (controlled by action-manifest) manifest manifest-zephyr-lang-rust
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants