-
-
Notifications
You must be signed in to change notification settings - Fork 9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[JENKINS-41127] Prevent flyweight tasks for jobs with concurrent execution disabled from running concurrently #3562
Conversation
…rom running concurrently
Reworded the changelog entry. WDYT? |
@daniel-beck Thanks, I think your version is much better! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems reasonable to me, good catch!
I added a basic regression test for the issue, and verified locally that it fails without my fix and passes with it. |
@daniel-beck @oleg-nenashev WDYT? Merge-worthy? |
I am planning on merging this today towards the next weekly if there are no objections. |
@dwnusbaum This might interact badly with JENKINS-45571. |
Daniel's point is that if JENKINS-45571 is caused by Without knowing more about the cause of JENKINS-45571, I don't know how best to prevent it from happening. We could merge this anyway, and if users start hitting JENKINS-45571 then we know what the proximate cause is, but we'd still need to identify the root cause and fix it. We could try a preemptive workaround here like checking |
The problem is that we have no idea about the scope of JENKINS-45571. It might be extremely rare, or affect nearly every serious Pipeline user. @dwnusbaum If the weeklies get regression reports that can be traced back to this change, are you prepared to look into them? |
@daniel-beck @dwnusbaum I am quite confident from a deep investigation that https://issues.jenkins-ci.org/browse/JENKINS-45571 is caused by the Pipeline failing to mark its AsynchronousExecution is done ( Edit: Now, fixing that is much harder than identifying what caused the the scenario (it's quite a long chain to figure out what prevents that callback from being issued, since it's at the end of a long chain of very complex, threaded logic). But I really and honestly cannot say what the interaction with Queuing will be here unfortunately -- have not dived into that end of things lately. Knowing the above, does it help/hurt with this particular PR? |
Thanks, but I think you just got lucky here @daniel-beck since I was looking into this one already (please see the edit above though). |
Yes, given that the only way users have noticed it so far is by seeing lots of It seems for now like we should wait on jenkinsci/workflow-cps-plugin#234, because I don't think we have a good way to prevent this issue from core's side of things. |
It means that we have a lot more confidence that this PR would cause some builds to be blocked forever for anyone hitting the bugs you are looking into, so I think we should definitely hold off for now. |
@dwnusbaum If that's the case, then we have a general design problem with OneOffExecutor use -- they are expected to stick around for some time because they track to the internal Pipeline execution and complete when it does. The fix won't change that behavior -- only ensure that the executor is killed when the Pipeline completes rather than managing to dangle around due to the bug. |
Ok, looking at this some more, it might not be a problem. The only difference in behavior this PR would cause with JENKINS-45571 is that if the stuck executor's work unit is in Executor#run calls Queue#onStartExecuting, which removes the executor's task from Now, if the task is still running on the Executor while the executor is stuck, then the Queue will still be blocked, but it will be blocked because of Summary: I don't think the executors stuck waiting for |
@dwnusbaum If you can deploy a snapshot of this fix, and tell me what state the OneOffExecutor & the FlyWeight task should be in in the queue, I can test against my JENKINS-45571 reproduction case. |
@svanoort Run your reproduction case, and while a
I uploaded the snapshots as |
Looks like I'm going to have to do some extra finagling to make this work, using https://github.com/jenkinsci/jenkins-test-harness/pull/100/files after all -- since the WarExploder isn't happy about this. Got it most of the way set up but is going to be a Monday Thing for the remainder. |
@svanoort Next Monday perhaps? 😃 |
@svanoort @dwnusbaum So looking forward to have it fixed! |
@svanoort In my case, the Throttle Concurrent Builds plugin prevents new builds from running because there are OneOffExecutors that are never removed from the master node. A workaround is to run a cleanup pipeline that removes such flyweights when there're no builds of my pipeline running. The problem is there's always a build running (well, almost)... (As another workaround, I can increase the concurrent builds limit for the throttle category, but that's not an automated solution an is troubling me and my fellow Jenkins users). Can you please tell if it's possible to tell apart stuck flyweights of the same pipeline from the ones that run normally? |
@atikhono I don't think that issue (likely the same cause as JENKINS-45571) would be fixed by this change. This change is just about keeping pipelines from running concurrently (JENKINS-41127), but before we merge we want to make sure that it won't make things worse for users like you who are hitting JENKINS-45571, because we don't have a fix for that issue yet. |
@dwnusbaum @daniel-beck No matter what I do, I cannot get the custom WAR incrementals to play nicely with the IntelliJ debugger for some reason. I've gone as far as using a custom test harness built off of jenkinsci/jenkins-test-harness#99 It seems a lot like https://issues.jenkins-ci.org/browse/JENKINS-45245
You guys have any ideas, or is this one where we either wait for ideas from @jglick or hope for the best from a review alone? |
I was able to sync up with Sam and run his reproduction for JENKINS-45571 on my machine. Thankfully, when the executor is stuck, its work item is not in the Queue at all, so the change to the
I think this is ready to be merged. Any objections? |
@dwnusbaum I'm fine with it, but am hoping @daniel-beck feels the same |
This investigation appears to adequately address my concerns, thanks for clarifying that! Otherwise I haven't had the time to look into this further, so ±0 |
@daniel-beck Ok, I will go ahead and merge towards this weekly, and if there are any regressions I will make sure to look into them. |
CI failure appears to just be due to ci.jenkins.io running out of space: |
See JENKINS-41127.
Previously, if a flyweight task was in
Queue.pendings
at the start of a call toQueue#maintain
, then it would be removed frompendings
as part of thelostPending
logic because tasks running on aOneOffExecutor
(flyweight tasks) were not tracked, and after being removed, it is not added back to the queue. (which appears incorrect as previously noted, though I'm not sure exactly how it should be fixed (as simple as just adding the return value to the queue as in the other parts ofQueue#maintain
?)):jenkins/core/src/main/java/hudson/model/Queue.java
Line 1503 in 1a11fa3
Because of this, if there were two builds in the queue for a pipeline with concurrent execution disabled, one still in
blocked
, and the other inpendings
at the start ofQueue#maintain
, then the pending build would be removed from the queue (but would continue executing), and the previously blocked build would no longer be blocked and would start executing concurrently.I was only able to reproduce the bug with Pipeline, but the issue should in theory affect other job types that use flyweight executors such as matrix projects.
I am not sure if there is a good way to create a regression test for the change, but am looking into it.
Proposed changelog entries
Bug: Some types of builds, like pipelines, would sometimes run concurrently even when that was disabled.
(Suggestions to improve the changelog entry are very welcome)
Submitter checklist
* Use the
Internal:
prefix if the change has no user-visible impact (API, test frameworks, etc.)Desired reviewers
@reviewbybees @jennbriden