You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If 2500-3000 managed jobs are running, and they all finish or get cancelled, some will hit FAILED_CONTROLLER.
Based on testing it seems like the CPU is basically pinned and this causes some downstream failures, mostly around database read-write:
SQLite timeout settings are based on the assumption that once the database lock is held, writes are typically very fast. However in high CPU situations the process holding the lock may struggle to get enough CPU time to write the transaction and release the lock, making the situation worse.
It seems that it's possible for some long-running process to somehow crash out of a transaction and keep the database lock open on global_user_state. We are not using safe_cursor for global_user_state, so maybe there is an issue here. Have not investigated much further.
The text was updated successfully, but these errors were encountered:
cg505
changed the title
When cancelling 2000+ jobs at once, CPU usage is too high
[jobs] When cancelling 2000+ jobs at once, CPU usage is too high
Feb 4, 2025
If 2500-3000 managed jobs are running, and they all finish or get cancelled, some will hit FAILED_CONTROLLER.
Based on testing it seems like the CPU is basically pinned and this causes some downstream failures, mostly around database read-write:
The text was updated successfully, but these errors were encountered: