-
Notifications
You must be signed in to change notification settings - Fork 195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[JENKINS-55287] Improve error messages for nonresumable Pipelines #363
Conversation
Also fixes tracking of whether PERFORMANCE_OPTIMIZED Pipelines are able to resume or not. Previously, they were always marked as resumable because right after adding the FlowStartNode, the build was saved, cuasing the persistedClean flag to be set to true. Now we only modify persistedClean when the build has already completed or when Jenkins is shutting down.
initializeStorage(); // Throws exception and bombs out if we can't load FlowNodes | ||
} catch (Exception ex) { | ||
LOGGER.log(Level.WARNING, "Error initializing storage and loading nodes, will try to create placeholders for: "+this, ex); | ||
if (!canResume()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alternatively, we could check this even before we attempt to initialize storage and just bail out right away. I don't know if that is too pessimistic, maybe there is a case where the Pipeline is not resumable but flow nodes are all up to date and so the optimistic behavior here is useful because you will still be able to see the existing flow graph in those cases?
createPlaceholderNodes(ex); | ||
return; | ||
} catch (Exception ex2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is all mostly refactoring. The old code had this structure:
try {
try {
// code that can throw
} catch (first) {
// code that can throw
return
}
} catch (second) {
throw second // first exception, if any, is totally lost
}
Now it has this structure:
try {
// code that can throw
} catch (first) {
try {
// code that can throw
} catch (second) {
second.addSuppressed(first)
throw second
}
return
}
src/main/java/org/jenkinsci/plugins/workflow/cps/CpsFlowExecution.java
Outdated
Show resolved
Hide resolved
@@ -1526,10 +1540,6 @@ public boolean isPaused() { | |||
return false; | |||
} | |||
|
|||
private void setPersistedClean(boolean persistedClean) { // Workaround for some issues with anonymous classes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unused
src/main/java/org/jenkinsci/plugins/workflow/cps/CpsFlowExecution.java
Outdated
Show resolved
Hide resolved
@@ -244,7 +243,6 @@ public void inProgressNormal() throws Exception { | |||
} | |||
|
|||
@Test | |||
@Ignore |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These tests have been ignored since they were added. I am not sure why, but I went ahead and unignored them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You probably already knew this, but I just noticed that these test cases seem to be duplicated in workflow-job-plugin
's CpsPersistenceTest
. Once the dust settles, it might be worth doing a sweep of the two copies to ensure that they are in sync with each other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think I will probably delete the copies in workflow-job after this. Ideally we would run the PCT in the CI builds here with just the core Pipeline plugins so that changes are tested against the newest versions of everything.
@Test | ||
@Ignore | ||
public void inProgressMaxPerfDirtyShutdown() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the exact situation reported in JENKINS-55287 as far as I can tell.
src/main/java/org/jenkinsci/plugins/workflow/cps/CpsFlowExecution.java
Outdated
Show resolved
Hide resolved
…out after resuming or unpausing a build
// Before we resume, we need to unset persistedClean in case Jenkins restarts again. | ||
persistedClean = null; | ||
saveOwner(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should add a test case for this. The test would use a PERFORMANCE_OPTIMIZED Pipeline, restart Jenkins normally, then restart Jenkins again with a hard shutdown and make sure we get the error message that the Pipeline can't be resumed rather than a raw exception.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I came up with a test case, but it doesn't fail without this line because of a random save that happens when CpsFlowExecution.notifyListeners
is called. IIUC, #234 was trying to stop that save from occurring for PERFORMANCE_OPTIMIZED Pipelines, so I will look into making related changes either in this PR or before this PR is merged.
// Pausing the build sets persistedClean to true so the build can resume, so if we unpause the build | ||
// we need to unset persistedClean again. | ||
persistedClean = null; | ||
saveOwner(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should add a test case for this. The Pipeline would need to be paused, then unpaused, and then we do a hard shutdown and make sure we get the error message that the Pipeline can't be resumed rather than a raw exception.
if (isComplete() || this.getDurabilityHint().isPersistWithEveryStep()) { | ||
// Nothing to persist OR we've already persisted it along the way. | ||
return; | ||
} | ||
LOGGER.log(Level.INFO, "Attempting to checkpoint all data for {0}{1}", new Object[] { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logging-related changes were originally part of #354 (there are some slight modifications here), so it might be easier to review that PR first.
|
Underlying issue was fixed by jenkinsci#368.
I don't remember exactly what was wrong with this approach, but back in February I spent a bit more time looking into this and decided it would be better to just immediately throw an exception from See my JENKINS-55287-2 branch for what that approach looked like. I can't remember why I didn't file a PR for that branch at the time. I think that I might have thought we needed a simultaneous patch to |
See JENKINS-55287. Subsumes #354 (and will subsume #234 eventually).
Based on the cases I've seen, I think the root cause of JENKINS-55287 is that Jenkins is crashing, so it is expected that PERFORMANCE_OPTIMIZED Pipelines are unable to resume. From that perspective, the issue is more that the error message is misleading and causes users to think they have hit a bug in Pipeline, when really this case is expected and the fix is to investigate and understand why Jenkins is crashing.
The field
CpsFlowExecution.persistedClean
was supposed to help detect these cases and print a more relevant error message, but there were two issues:CpsFlowExecution.saveOwner
that occurs right after the build starts. In the cases of JENKINS-55287 I have seen, the only persisted flow node is always theFlowStartNode
, meaning that this is the only time the build was ever saved, and so the build is always considered resumable unless Jenkins shuts down normally and there is an error while saving the build inCpsFlowExecution.suspendAll
. To fix this we only modifypersistedClean
if Jenkins is shutting down.PERFORMANCE_OPTIMIZED
pipelines, see [JENKINS-53358] Remove bogus persistence calls due to notifyListeners #234. I don't think it makes sense to try to update the logic here without fixing that first, so I will look into fixing that separately.