kernel: workq: introduce work timeout: #88345

bjarki-andreasen · 2025-04-09T08:46:16Z

Introduce work timeout, which is an optional workqueue configuration which enables monitoring for work items which take longer than expected. This could be due to long running or deadlocked handlers.

This is a far more permissive alternative to #87522. It allows blocking items, as long as they don't take longer than specified. Feature works on a per workqueue basis. If the workqueue is blocked, an ERR log will be printed and the work queue thread will be aborted.

Example output from test suite

[00:00:01.010,000] <wrn> os: queue sysworkq blocked by work 0x805a0e0 with handler 0x80494a6

if the aborted thread is the essential system workqueue thread, kernel explodes.

kernel/Kconfig

bjarki-andreasen · 2025-04-09T09:21:39Z

added option to exclude the timeout entirely so the feature has zero overhead if not used.

andyross

Nitpicks, but this seems unobjectionable

andyross · 2025-04-09T18:20:51Z

kernel/Kconfig

@@ -600,6 +603,13 @@ config SYSTEM_WORKQUEUE_NO_YIELD
 	  cooperative and a sequence of work items is expected to complete
 	  without yielding.

+config SYSTEM_WORKQUEUE_WORK_TIMEOUT_MS
+	int "Select system work queue work timeout in milliseconds"
+	default 10000 if DEBUG


CONFIG_DEBUG is a control over compiler optimizations, it's the wrong tunable to use here. You probably want CONFIG_ASSERT to gate the default.

Changed to ASSERT and decreased the time to 5000ms (still far longer than I think is reasonable but hopefully users will adjust it lower)

andyross · 2025-04-09T18:23:46Z

kernel/work.c

+bool k_sys_work_queue_is_blocked(void)
+{
+	return flag_test(&k_sys_work_q.flags, K_WORK_QUEUE_BLOCKED_BIT);
+}


Not really seeing the point of this as an application API? I mean, you're not supposed to be blocked at all. One doesn't normally write is_my_code_broken() predicates, nor call them in working code.

This is for use in tests, can move it to an internal header :)

If we use k_oops() on block this API can be removed entirely

API removed, thread is aborted if blocked, so a user who cares could join or check if the work queue thread is running :)

andyross · 2025-04-09T18:25:20Z

kernel/work.c

+	if (name != NULL) {
+		LOG_WRN("queue %s blocked by work %p with handler %p", name, work, handler);
+	} else {
+		LOG_WRN("queue %p blocked by work %p with handler %p", queue, work, handler);


Seems like the k_oops() from the last PR might be a better choice, though maybe configurable. A mere warning message isn't likely loud enough for something we almost all agree is a should-never-happen condition.

Alternatively make the callback be settable by the app and have the default blow up, but let apps do what they want?

I will add k_oops() (or panic?), this will also make the code a tiny but simpler since there is no unblock scenario :)

changed strategy a bit, since the timeout is run from a _timeout, k_oops() makes no sense since it is not run from the same thread as the work queue, so instead, abort the work queue and let the kernel handle it (essential thread would result in k_panic())

kernel/work.c

include/zephyr/kernel.h

kernel/Kconfig

kernel/system_work_q.c

kernel/Kconfig

cfriedt · 2025-04-10T10:15:34Z

Some more nitpicks - sorry - otherwise looking good

cfriedt · 2025-04-10T11:59:13Z

Also, would be really good to add a test for this!

andyross

I'm out of nitpicks, this looks very reasonable

cfriedt

Lingering nits, but I guess I had better not block

kernel/Kconfig

cfriedt · 2025-04-10T14:00:00Z

kernel/work.c

+	if (name != NULL) {
+		LOG_ERR("queue %s blocked by work %p with handler %p", name, work, handler);
+	} else {
+		LOG_ERR("queue %p blocked by work %p with handler %p", queue, work, handler);
+	}


Still not crazy about this, as it needlessly duplicates a nearly identical string in the string table.

How would you solve it? the string is about 30 bytes, and stored in ROM, the added operations of conditionally copying the thread name or a queue pointer, to a RAM buffer, to then pass to the LOG_ERR could easily take up the same or more ROM (while also adding complexity, the worst kind, involving null terminated strings)

(LOG_ERR is probably adding code as well given the use of VARGS)

How would you solve it?

Using the code suggestion here:
#88345 (comment)

Introduce work timeout, which is an optional workqueue configuration which enables monitoring for work items which take longer than expected. This could be due to long running or deadlocked handlers. Signed-off-by: Bjarki Arge Andreasen <bjarki.andreasen@nordicsemi.no>

github-actions · 2025-04-11T07:15:50Z

The following west manifest projects have changed revision in this Pull Request:

Name	Old Revision	New Revision	Diff
zephyr-lang-rust	zephyrproject-rtos/zephyr-lang-rust@`d4f9036` (`v4.1-branch`)	zephyrproject-rtos/zephyr-lang-rust#86	zephyrproject-rtos/zephyr-lang-rust#86/files

⛔ DNM label due to: 1 project with PR revision

Note: This message is automatically posted and updated by the Manifest GitHub Action.

bjarki-andreasen · 2025-04-11T07:15:52Z

Addressed some nits and extended zephyr-lang-rust to init the new work_timeout_ms parameter :) See zephyrproject-rtos/zephyr-lang-rust#86

Add workqueue work timeout test suite. Signed-off-by: Bjarki Arge Andreasen <bjarki.andreasen@nordicsemi.no>

d3zd3z · 2025-04-30T19:41:13Z

This is going to be a bit challenging to get through with the needed rust change, since requiring a lock-step commit to support an API change like this doesn't seem very practical.

I think the Rust change needs to go through, with an understanding of the conditional (and probably the field itself conditionally enabled), so that the Rust module will support all combinations of this change not being present, and it being enabled/disabled.

bjarki-andreasen · 2025-05-01T07:41:23Z

This is going to be a bit challenging to get through with the needed rust change, since requiring a lock-step commit to support an API change like this doesn't seem very practical.

I think the Rust change needs to go through, with an understanding of the conditional (and probably the field itself conditionally enabled), so that the Rust module will support all combinations of this change not being present, and it being enabled/disabled.

Its not an issue, I expanded in a comment in the Rust module PR. No conditional is needed.

bjarki-andreasen mentioned this pull request Apr 9, 2025

System workqueue: Prevent blocking API calls #87522

Closed

pdgendt reviewed Apr 9, 2025

View reviewed changes

kernel/Kconfig Show resolved Hide resolved

bjarki-andreasen force-pushed the workq-work-timeout branch from 7a199dc to ceb645c Compare April 9, 2025 09:20

bjarki-andreasen force-pushed the workq-work-timeout branch from ceb645c to 1e9f16f Compare April 9, 2025 13:36

bjarki-andreasen marked this pull request as ready for review April 9, 2025 13:46

github-actions bot added the area: Kernel label Apr 9, 2025

github-actions bot requested review from andyross, ceolin, cfriedt, dcpleung, nashif, npitre, peter-mitsis and TaiJuWu April 9, 2025 13:47

github-actions bot assigned andyross and peter-mitsis Apr 9, 2025

andyross reviewed Apr 9, 2025

View reviewed changes

bjarki-andreasen force-pushed the workq-work-timeout branch 2 times, most recently from f76fe37 to b575ee5 Compare April 10, 2025 07:32

cfriedt reviewed Apr 10, 2025

View reviewed changes

kernel/work.c Outdated Show resolved Hide resolved

cfriedt reviewed Apr 10, 2025

View reviewed changes

include/zephyr/kernel.h Outdated Show resolved Hide resolved

kernel/Kconfig Show resolved Hide resolved

kernel/system_work_q.c Show resolved Hide resolved

kernel/Kconfig Show resolved Hide resolved

bjarki-andreasen force-pushed the workq-work-timeout branch from b575ee5 to 417e7ab Compare April 10, 2025 13:19

andyross previously approved these changes Apr 10, 2025

View reviewed changes

cfriedt previously approved these changes Apr 10, 2025

View reviewed changes

bjarki-andreasen dismissed stale reviews from cfriedt and andyross via 90c81b4 April 11, 2025 07:14

bjarki-andreasen force-pushed the workq-work-timeout branch from 417e7ab to 90c81b4 Compare April 11, 2025 07:14

github-actions bot added manifest manifest-zephyr-lang-rust DNM (manifest) This PR should not be merged (controlled by action-manifest) labels Apr 11, 2025

tests: kernel: workq: add work_timeout test suite

bdc13c7

Add workqueue work timeout test suite. Signed-off-by: Bjarki Arge Andreasen <bjarki.andreasen@nordicsemi.no>

bjarki-andreasen force-pushed the workq-work-timeout branch from 90c81b4 to bdc13c7 Compare April 11, 2025 09:57

teburd approved these changes May 2, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kernel: workq: introduce work timeout: #88345

kernel: workq: introduce work timeout: #88345

bjarki-andreasen commented Apr 9, 2025 •

edited

Loading

bjarki-andreasen commented Apr 9, 2025

andyross left a comment

andyross Apr 9, 2025

bjarki-andreasen Apr 10, 2025 •

edited

Loading

andyross Apr 9, 2025

bjarki-andreasen Apr 10, 2025

bjarki-andreasen Apr 10, 2025

bjarki-andreasen Apr 10, 2025 •

edited

Loading

andyross Apr 9, 2025

bjarki-andreasen Apr 10, 2025

bjarki-andreasen Apr 10, 2025

cfriedt commented Apr 10, 2025

cfriedt commented Apr 10, 2025

andyross left a comment

cfriedt left a comment

cfriedt Apr 10, 2025

bjarki-andreasen Apr 11, 2025 •

edited

Loading

bjarki-andreasen Apr 11, 2025

cfriedt Apr 15, 2025

github-actions bot commented Apr 11, 2025 •

edited

Loading

bjarki-andreasen commented Apr 11, 2025 •

edited

Loading

d3zd3z commented Apr 30, 2025

bjarki-andreasen commented May 1, 2025 •

edited

Loading

kernel: workq: introduce work timeout: #88345

Are you sure you want to change the base?

kernel: workq: introduce work timeout: #88345

Conversation

bjarki-andreasen commented Apr 9, 2025 • edited Loading

bjarki-andreasen commented Apr 9, 2025

andyross left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bjarki-andreasen Apr 10, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bjarki-andreasen Apr 10, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cfriedt commented Apr 10, 2025

cfriedt commented Apr 10, 2025

andyross left a comment

Choose a reason for hiding this comment

cfriedt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bjarki-andreasen Apr 11, 2025 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

github-actions bot commented Apr 11, 2025 • edited Loading

bjarki-andreasen commented Apr 11, 2025 • edited Loading

d3zd3z commented Apr 30, 2025

bjarki-andreasen commented May 1, 2025 • edited Loading

bjarki-andreasen commented Apr 9, 2025 •

edited

Loading

bjarki-andreasen Apr 10, 2025 •

edited

Loading

bjarki-andreasen Apr 10, 2025 •

edited

Loading

bjarki-andreasen Apr 11, 2025 •

edited

Loading

github-actions bot commented Apr 11, 2025 •

edited

Loading

bjarki-andreasen commented Apr 11, 2025 •

edited

Loading

bjarki-andreasen commented May 1, 2025 •

edited

Loading