[jobs] When cancelling 2000+ jobs at once, CPU usage is too high #4649

cg505 · 2025-02-04T21:45:05Z

If 2500-3000 managed jobs are running, and they all finish or get cancelled, some will hit FAILED_CONTROLLER.
Based on testing it seems like the CPU is basically pinned and this causes some downstream failures, mostly around database read-write:

SQLite timeout settings are based on the assumption that once the database lock is held, writes are typically very fast. However in high CPU situations the process holding the lock may struggle to get enough CPU time to write the transaction and release the lock, making the situation worse.
It seems that it's possible for some long-running process to somehow crash out of a transaction and keep the database lock open on global_user_state. We are not using safe_cursor for global_user_state, so maybe there is an issue here. Have not investigated much further.

cg505 changed the title ~~When cancelling 2000+ jobs at once, CPU usage is too high~~ [jobs] When cancelling 2000+ jobs at once, CPU usage is too high Feb 4, 2025

cg505 added the gcp label Feb 7, 2025 — with Linear

cg505 self-assigned this Feb 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[jobs] When cancelling 2000+ jobs at once, CPU usage is too high #4649

[jobs] When cancelling 2000+ jobs at once, CPU usage is too high #4649

cg505 commented Feb 4, 2025 •

edited

Loading

[jobs] When cancelling 2000+ jobs at once, CPU usage is too high #4649

[jobs] When cancelling 2000+ jobs at once, CPU usage is too high #4649

Comments

cg505 commented Feb 4, 2025 • edited Loading

cg505 commented Feb 4, 2025 •

edited

Loading