-
Notifications
You must be signed in to change notification settings - Fork 195
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[JENKINS-55287] update exception warning for failure to load flow node #589
[JENKINS-55287] update exception warning for failure to load flow node #589
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the fact that this message shows up in this case at all is a bug, and it might be possible to hit this exception in other cases, but this seems like the most prudent fix at this point.
The intention seems to have been for
workflow-cps-plugin/src/main/java/org/jenkinsci/plugins/workflow/cps/CpsFlowExecution.java
Line 745 in f82abbe
throw new IOException("Cannot resume build -- was not cleanly saved when Jenkins shut down."); |
@@ -676,7 +676,7 @@ protected synchronized void initializeStorage() throws IOException { | |||
h.setForDeserialize(storage.getNode(entry.getValue())); | |||
heads.put(h.getId(), h); | |||
} else { | |||
throw new IOException("Tried to load head FlowNodes for execution "+this.owner+" but FlowNode was not found in storage for head id:FlowNodeId "+entry.getKey()+":"+entry.getValue()); | |||
throw new IOException("Flow node could not be loaded because Jenkins did not shut down cleanly and the Pipeline Durability was not set to MAX_SURVIVABILITY"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might tweak the message a bit to mirror the terminology on https://www.jenkins.io/doc/book/pipeline/scaling-pipeline/ and log the FlowHead:FlowNode
info if the logger has Level.FINE
enabled in case someone is debugging things at some point (untested):
throw new IOException("Flow node could not be loaded because Jenkins did not shut down cleanly and the Pipeline Durability was not set to MAX_SURVIVABILITY"); | |
throw new IOException("Cannot resume build because Jenkins did not shut down cleanly and the Pipeline durability setting was not set to MAX_SURVIVABILITY" + (LOGGER.isLoggable(Level.FINE) ? " (Missing FlowNode " + entry.getValue() + " for FlowHead " + entry.getKey() + ")" : "")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, this might happen even if you are using MAX_SURVIVABILITY
in obscure cases, but I'm not sure. Maybe check the actual durability setting and only add the "and the Pipeline durability setting was not set to MAX_SURVIVABILITY" part of the message if that is in fact true. Probably only matters if anyone in the Jira issue mentioned seeing this exception when using MAX_SURVIVABILITY
, otherwise it could be improved later as needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also hesitate to suggest a remediation without actually knowing that it suffices.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤷
if (getDurabilityHint() != FlowDurabilityHint.MAX_SURVIVABILITY) { | ||
suggestDurability = "Try setting pipeline durability to MAX_SURVIVABILITY. "; | ||
} | ||
throw new IOException("Cannot resume build because Jenkins did not shut down cleanly. " + suggestDurability + (LOGGER.isLoggable(Level.FINE) ? "(Missing FlowNode " + entry.getValue() + " for FlowHead " + entry.getKey() + ")" : "")); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not correct. Jenkins may have shut down correctly, we can not know.
All we know is a file we expect to be there is not. This could be disk corruption outside Jenkins or a restore from a non atomic backup....
We should say we can not find a file (and which file, log it add a chained FileNotFoundExcption?). If you are not using a durable mode we should say to use it, otherwise this is expected for a non clean shutdown.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion, the specific XML file that is missing is completely irrelevant to users (and probably even to plugin developers). I would guess that 95% or more of the time users only see this exception when Jenkins crashed while using the PERFORMANCE_OPTIMIZED
durability setting, and in that scenario the new message is strictly an improvement over the old one since the behavior is typical and expected.
If you really want to improve the logic to diagnose the scenario more precisely, then I would recommend looking into #363 or master...dwnusbaum:JENKINS-55287-2 like I mentioned in #589 (review), but it gets complex fast because this code is undertested and there are many subtle issues with the existing logic and its interaction with WorkflowRun.onLoad
, and I am not confident that either of those changes would fix the existing bugs without introducing new issues, which is why I abandoned them.
If you want, you could keep the existing logic more or less but make the message a lot more detailed and hedge all recommendations, but in practice IDK if it really makes a difference:
if (durabilitySetting != MAX_SURVIVABILITY) {
throw new AbortException("Cannot resume build because flow node X could not be loaded. " +
"This is expected to happen when using the " + durabilitySetting + " durability setting and Jenkins is not shut down cleanly. " +
"Consider investigating to understand if Jenkins was not shut down cleanly or switching to the MAX_SURVIVABILITY durability setting which should prevent this issue in most cases." );
} else {
throw new AbortException("Cannot resume build because flow node X could not be loaded.");
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you want, you could keep the existing logic more or less but make the message a lot more detailed and hedge all recommendations, but in practice IDK if it really makes a difference:
that was what I was trying to say - thanks for reading my mind :)
See https://issues.jenkins.io/browse/JENKINS-55287
Update the exception message when FlowNode fails to load because Jenkins did not shut down cleanly. Suggest pipeline durability MAX_SURVIVABILITY if not already set.