[JENKINS-41127] Prevent flyweight tasks for jobs with concurrent execution disabled from running concurrently #3562

dwnusbaum · 2018-07-24T21:58:55Z

Previously, if a flyweight task was in Queue.pendings at the start of a call to Queue#maintain, then it would be removed from pendings as part of the lostPending logic because tasks running on a OneOffExecutor (flyweight tasks) were not tracked, and after being removed, it is not added back to the queue. (which appears incorrect as previously noted, though I'm not sure exactly how it should be fixed (as simple as just adding the return value to the queue as in the other parts of Queue#maintain?)):

jenkins/core/src/main/java/hudson/model/Queue.java

Line 1503 in 1a11fa3

    
           makeBuildable(p); // TODO whatever this is for, the return value is being ignored, so this does nothing at all

Because of this, if there were two builds in the queue for a pipeline with concurrent execution disabled, one still in blocked, and the other in pendings at the start of Queue#maintain, then the pending build would be removed from the queue (but would continue executing), and the previously blocked build would no longer be blocked and would start executing concurrently.

I was only able to reproduce the bug with Pipeline, but the issue should in theory affect other job types that use flyweight executors such as matrix projects.

I am not sure if there is a good way to create a regression test for the change, but am looking into it.

Proposed changelog entries

Bug: Some types of builds, like pipelines, would sometimes run concurrently even when that was disabled.

(Suggestions to improve the changelog entry are very welcome)

Submitter checklist

JIRA issue is well described
Changelog entry appropriate for the audience affected by the change (users or developer, depending on the change). Examples
* Use the Internal: prefix if the change has no user-visible impact (API, test frameworks, etc.)
Appropriate autotests or explanation to why this change has no tests
For dependency updates: links to external changelogs and, if possible, full diffs

Desired reviewers

@reviewbybees @jennbriden

…rom running concurrently

daniel-beck · 2018-07-24T23:24:40Z

Reworded the changelog entry. WDYT?

dwnusbaum · 2018-07-25T01:05:10Z

@daniel-beck Thanks, I think your version is much better!

svanoort

Seems reasonable to me, good catch!

dwnusbaum · 2018-07-25T15:20:22Z

I added a basic regression test for the issue, and verified locally that it fails without my fix and passes with it.

svanoort · 2018-07-26T14:06:43Z

@daniel-beck @oleg-nenashev WDYT? Merge-worthy?

dwnusbaum · 2018-07-27T14:52:18Z

I am planning on merging this today towards the next weekly if there are no objections.

daniel-beck · 2018-07-27T16:05:05Z

@dwnusbaum This might interact badly with JENKINS-45571.

dwnusbaum · 2018-07-27T19:26:56Z

Daniel's point is that if JENKINS-45571 is caused by OneOffExecutors not being cleaned up correctly for some reason, then this change may cause new builds of flyweight tasks to be blocked indefinitely waiting for those stuck executors, when previously the stuck executors would just have been ignored.

Without knowing more about the cause of JENKINS-45571, I don't know how best to prevent it from happening. We could merge this anyway, and if users start hitting JENKINS-45571 then we know what the proximate cause is, but we'd still need to identify the root cause and fix it. We could try a preemptive workaround here like checking Executor#isLikelyStuck for OneOffExecutor s before including them in the lostPending logic, but that feels seriously hacky. Any thoughts?

daniel-beck · 2018-07-27T19:48:26Z

The problem is that we have no idea about the scope of JENKINS-45571. It might be extremely rare, or affect nearly every serious Pipeline user. @dwnusbaum If the weeklies get regression reports that can be traced back to this change, are you prepared to look into them?

svanoort · 2018-07-27T20:03:19Z

@daniel-beck @dwnusbaum I am quite confident from a deep investigation that https://issues.jenkins-ci.org/browse/JENKINS-45571 is caused by the Pipeline failing to mark its AsynchronousExecution is done (AsynchronousExecution#completed), which means the OneOfExecutor is never removed.

Edit: Now, fixing that is much harder than identifying what caused the the scenario (it's quite a long chain to figure out what prevents that callback from being issued, since it's at the end of a long chain of very complex, threaded logic).

But I really and honestly cannot say what the interaction with Queuing will be here unfortunately -- have not dived into that end of things lately. Knowing the above, does it help/hurt with this particular PR?

daniel-beck · 2018-07-27T20:05:27Z

Responsiveness trophy! 🏆

svanoort · 2018-07-27T20:07:30Z

Thanks, but I think you just got lucky here @daniel-beck since I was looking into this one already (please see the edit above though).

dwnusbaum · 2018-07-27T20:08:04Z

The problem is that we have no idea about the scope of JENKINS-45571. It might be extremely rare, or affect nearly every serious Pipeline user.

Yes, given that the only way users have noticed it so far is by seeing lots of likelyStuck: true executors in the API, which I expect very few users to ever look at, it seems like it might be affecting quite a few users who just don't realize it.

It seems for now like we should wait on jenkinsci/workflow-cps-plugin#234, because I don't think we have a good way to prevent this issue from core's side of things.

dwnusbaum · 2018-07-27T20:12:40Z

But I really and honestly cannot say what the interaction with Queuing will be here unfortunately -- have not dived into that end of things lately. Knowing the above, does it help/hurt with this particular PR?

It means that we have a lot more confidence that this PR would cause some builds to be blocked forever for anyone hitting the bugs you are looking into, so I think we should definitely hold off for now.

svanoort · 2018-07-27T20:15:32Z

@dwnusbaum If that's the case, then we have a general design problem with OneOffExecutor use -- they are expected to stick around for some time because they track to the internal Pipeline execution and complete when it does. The fix won't change that behavior -- only ensure that the executor is killed when the Pipeline completes rather than managing to dangle around due to the bug.

dwnusbaum · 2018-07-27T20:50:28Z

Ok, looking at this some more, it might not be a problem. The only difference in behavior this PR would cause with JENKINS-45571 is that if the stuck executor's work unit is in Queue.pendings, then other builds of the same pipeline would be blocked due to this change (if concurrent builds are disabled). The reason for this is that because the executor for the item is now looped over here, it gets removed from lostPendings correctly and so is no longer removed from Queue.pendings here.

Executor#run calls Queue#onStartExecuting, which removes the executor's task from Queue.pendings long before it calls Queue#execute, so the executor's work unit should be removed from Queue.pendings before it even starts executing. Whether or not Executor#completedAsynchronous is called to clean up the executor shouldn't matter.

Now, if the task is still running on the Executor while the executor is stuck, then the Queue will still be blocked, but it will be blocked because of WorkflowJob#getCauseOfBlockage, and this would happen before or after my patch.

Summary: I don't think the executors stuck waiting for Executor#completedAsynchronous will ever have a work item that is still in Queue.pendings, so I don't think this PR will interact badly with JENKINS-45571.

svanoort · 2018-07-27T21:01:57Z

@dwnusbaum If you can deploy a snapshot of this fix, and tell me what state the OneOffExecutor & the FlyWeight task should be in in the queue, I can test against my JENKINS-45571 reproduction case.

dwnusbaum · 2018-07-27T21:16:24Z

@svanoort Run your reproduction case, and while aOneOffExecutor is stuck, check that the task on that executor is not in Queue#getPendingItems. You should be able to just add assertNull(Queue.getItem(Executor.getCurrentWorkUnit().context.task)) to your test, since the task shouldn't be inside of any of the Queue's lists at that point, but if that fails, then you can separate into the following to figure out exactly where it is:

BuildableItem item = Executor.getCurrentWorkUnit().context.item); 
assertThat(Queue.getPendingItems(), not(contains(item)));
assertThat(Queue.getBuildableItems(), not(contains(item)));
assertThat(Queue.getUnblockedItems(), not(contains(item)));

I uploaded the snapshots as jenkins-core:2.135-20180727.210759-1 and jenkins-war:2.135-20180727.210819-1.

svanoort · 2018-07-27T22:05:56Z

Looks like I'm going to have to do some extra finagling to make this work, using https://github.com/jenkinsci/jenkins-test-harness/pull/100/files after all -- since the WarExploder isn't happy about this.

Got it most of the way set up but is going to be a Monday Thing for the remainder.

daniel-beck · 2018-08-02T20:46:20Z

@svanoort Next Monday perhaps? 😃

atikhono · 2018-08-03T16:10:18Z

@svanoort @dwnusbaum So looking forward to have it fixed!

atikhono · 2018-08-03T16:22:59Z

@svanoort In my case, the Throttle Concurrent Builds plugin prevents new builds from running because there are OneOffExecutors that are never removed from the master node. A workaround is to run a cleanup pipeline that removes such flyweights when there're no builds of my pipeline running. The problem is there's always a build running (well, almost)... (As another workaround, I can increase the concurrent builds limit for the throttle category, but that's not an automated solution an is troubling me and my fellow Jenkins users). Can you please tell if it's possible to tell apart stuck flyweights of the same pipeline from the ones that run normally?

dwnusbaum · 2018-08-03T18:33:55Z

the Throttle Concurrent Builds plugin prevents new builds from running because there are OneOffExecutors that are never removed from the master node.

@atikhono I don't think that issue (likely the same cause as JENKINS-45571) would be fixed by this change. This change is just about keeping pipelines from running concurrently (JENKINS-41127), but before we merge we want to make sure that it won't make things worse for users like you who are hitting JENKINS-45571, because we don't have a fix for that issue yet.

svanoort · 2018-08-03T18:50:19Z

@dwnusbaum @daniel-beck No matter what I do, I cannot get the custom WAR incrementals to play nicely with the IntelliJ debugger for some reason. I've gone as far as using a custom test harness built off of jenkinsci/jenkins-test-harness#99

It seems a lot like https://issues.jenkins-ci.org/browse/JENKINS-45245

java.lang.AssertionError: /Users/svanoort/.m2/repository/org/jenkins-ci/main/jenkins-core/2.135-SNAPSHOT/jenkins-core-2.135-20180727.210759-1.jar is not in the expected location, and jenkins-war-*.war was not in /Applications/IntelliJ IDEA CE 2.app/Contents/lib/idea_rt.jar:/Applications/IntelliJ IDEA CE 2.app/Contents/plugins/junit/lib/junit-rt.jar: YADDA YADDA

at org.jvnet.hudson.test.WarExploder.explode(WarExploder.java:119)
at org.jvnet.hudson.test.WarExploder.(WarExploder.java:68)
at org.jvnet.hudson.test.JenkinsRule.createWebServer(JenkinsRule.java:668)
at org.jvnet.hudson.test.JenkinsRule.newHudson(JenkinsRule.java:612)
at org.jvnet.hudson.test.JenkinsRule.before(JenkinsRule.java:390)
at org.jvnet.hudson.test.JenkinsRule$1.evaluate(JenkinsRule.java:543)
at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:298)
at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:292)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.lang.Thread.run(Thread.java:745)

You guys have any ideas, or is this one where we either wait for ideas from @jglick or hope for the best from a review alone?

dwnusbaum · 2018-08-03T19:47:00Z

I was able to sync up with Sam and run his reproduction for JENKINS-45571 on my machine. Thankfully, when the executor is stuck, its work item is not in the Queue at all, so the change to the lostPendings logic here should not change the behavior, although as noted in my earlier comment, the existing behavior could still be problematic (although in my reproduction WorkflowRun#isBuilding returns false, so that specific case shouldn't be a problem):

If the task is still running on the Executor while the executor is stuck (WorkflowRun#isBuilding returns true), then new tasks may still be blocked, but they will be blocked because of WorkflowJob#getCauseOfBlockage, and this would happen before or after my patch.

I think this is ready to be merged. Any objections?

svanoort · 2018-08-03T19:53:19Z

@dwnusbaum I'm fine with it, but am hoping @daniel-beck feels the same

daniel-beck · 2018-08-03T21:55:01Z

This investigation appears to adequately address my concerns, thanks for clarifying that! Otherwise I haven't had the time to look into this further, so ±0

dwnusbaum · 2018-08-04T01:54:01Z

@daniel-beck Ok, I will go ahead and merge towards this weekly, and if there are any regressions I will make sure to look into them.

dwnusbaum · 2018-08-04T01:57:32Z

CI failure appears to just be due to ci.jenkins.io running out of space: java.io.IOException: No space left on device, but the tests are otherwise passing.

Prevent flyweight tasks for jobs with concurrent execution disabled f…

33155b1

…rom running concurrently

dwnusbaum requested review from svanoort and abayer July 24, 2018 22:05

kuisathaverat approved these changes Jul 25, 2018

View reviewed changes

svanoort approved these changes Jul 25, 2018

View reviewed changes

Add regression test for JENKINS-41127

eb39e8d

svanoort approved these changes Jul 26, 2018

View reviewed changes

daniel-beck requested a review from svanoort July 27, 2018 20:03

dwnusbaum added the on-hold This pull request depends on another event/release, and it cannot be merged right now label Jul 27, 2018

dwnusbaum added ready-for-merge The PR is ready to go, and it will be merged soon if there is no negative feedback and removed on-hold This pull request depends on another event/release, and it cannot be merged right now labels Aug 4, 2018

dwnusbaum merged commit 6df66e9 into jenkinsci:master Aug 4, 2018

dwnusbaum deleted the JENKINS-41127 branch August 4, 2018 01:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[JENKINS-41127] Prevent flyweight tasks for jobs with concurrent execution disabled from running concurrently #3562

[JENKINS-41127] Prevent flyweight tasks for jobs with concurrent execution disabled from running concurrently #3562

dwnusbaum commented Jul 24, 2018 •

edited by svanoort

Loading

daniel-beck commented Jul 24, 2018 •

edited

Loading

dwnusbaum commented Jul 25, 2018

svanoort left a comment

dwnusbaum commented Jul 25, 2018

svanoort commented Jul 26, 2018

dwnusbaum commented Jul 27, 2018

daniel-beck commented Jul 27, 2018

dwnusbaum commented Jul 27, 2018

daniel-beck commented Jul 27, 2018

svanoort commented Jul 27, 2018 •

edited

Loading

daniel-beck commented Jul 27, 2018

svanoort commented Jul 27, 2018

dwnusbaum commented Jul 27, 2018

dwnusbaum commented Jul 27, 2018

svanoort commented Jul 27, 2018

dwnusbaum commented Jul 27, 2018 •

edited

Loading

svanoort commented Jul 27, 2018

dwnusbaum commented Jul 27, 2018 •

edited

Loading

svanoort commented Jul 27, 2018

daniel-beck commented Aug 2, 2018

atikhono commented Aug 3, 2018

atikhono commented Aug 3, 2018 •

edited

Loading

dwnusbaum commented Aug 3, 2018

svanoort commented Aug 3, 2018

dwnusbaum commented Aug 3, 2018 •

edited

Loading

svanoort commented Aug 3, 2018

daniel-beck commented Aug 3, 2018 •

edited

Loading

dwnusbaum commented Aug 4, 2018

dwnusbaum commented Aug 4, 2018

[JENKINS-41127] Prevent flyweight tasks for jobs with concurrent execution disabled from running concurrently #3562

[JENKINS-41127] Prevent flyweight tasks for jobs with concurrent execution disabled from running concurrently #3562

Conversation

dwnusbaum commented Jul 24, 2018 • edited by svanoort Loading

Proposed changelog entries

Submitter checklist

Desired reviewers

daniel-beck commented Jul 24, 2018 • edited Loading

dwnusbaum commented Jul 25, 2018

svanoort left a comment

Choose a reason for hiding this comment

dwnusbaum commented Jul 25, 2018

svanoort commented Jul 26, 2018

dwnusbaum commented Jul 27, 2018

daniel-beck commented Jul 27, 2018

dwnusbaum commented Jul 27, 2018

daniel-beck commented Jul 27, 2018

svanoort commented Jul 27, 2018 • edited Loading

daniel-beck commented Jul 27, 2018

svanoort commented Jul 27, 2018

dwnusbaum commented Jul 27, 2018

dwnusbaum commented Jul 27, 2018

svanoort commented Jul 27, 2018

dwnusbaum commented Jul 27, 2018 • edited Loading

svanoort commented Jul 27, 2018

dwnusbaum commented Jul 27, 2018 • edited Loading

svanoort commented Jul 27, 2018

daniel-beck commented Aug 2, 2018

atikhono commented Aug 3, 2018

atikhono commented Aug 3, 2018 • edited Loading

dwnusbaum commented Aug 3, 2018

svanoort commented Aug 3, 2018

dwnusbaum commented Aug 3, 2018 • edited Loading

svanoort commented Aug 3, 2018

daniel-beck commented Aug 3, 2018 • edited Loading

dwnusbaum commented Aug 4, 2018

dwnusbaum commented Aug 4, 2018

dwnusbaum commented Jul 24, 2018 •

edited by svanoort

Loading

daniel-beck commented Jul 24, 2018 •

edited

Loading

svanoort commented Jul 27, 2018 •

edited

Loading

dwnusbaum commented Jul 27, 2018 •

edited

Loading

dwnusbaum commented Jul 27, 2018 •

edited

Loading

atikhono commented Aug 3, 2018 •

edited

Loading

dwnusbaum commented Aug 3, 2018 •

edited

Loading

daniel-beck commented Aug 3, 2018 •

edited

Loading