Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[jobs] When cancelling 2000+ jobs at once, CPU usage is too high #4649

Open
cg505 opened this issue Feb 4, 2025 · 0 comments
Open

[jobs] When cancelling 2000+ jobs at once, CPU usage is too high #4649

cg505 opened this issue Feb 4, 2025 · 0 comments
Assignees
Labels

Comments

@cg505
Copy link
Collaborator

cg505 commented Feb 4, 2025

If 2500-3000 managed jobs are running, and they all finish or get cancelled, some will hit FAILED_CONTROLLER.
Based on testing it seems like the CPU is basically pinned and this causes some downstream failures, mostly around database read-write:

  • SQLite timeout settings are based on the assumption that once the database lock is held, writes are typically very fast. However in high CPU situations the process holding the lock may struggle to get enough CPU time to write the transaction and release the lock, making the situation worse.
  • It seems that it's possible for some long-running process to somehow crash out of a transaction and keep the database lock open on global_user_state. We are not using safe_cursor for global_user_state, so maybe there is an issue here. Have not investigated much further.
@cg505 cg505 changed the title When cancelling 2000+ jobs at once, CPU usage is too high [jobs] When cancelling 2000+ jobs at once, CPU usage is too high Feb 4, 2025
@cg505 cg505 added the gcp label Feb 7, 2025 — with Linear
@cg505 cg505 self-assigned this Feb 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant