-
Notifications
You must be signed in to change notification settings - Fork 695
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle DescribeInstances failures when creating new instance #414
base: master
Are you sure you want to change the base?
Conversation
BTW, I haven't actually run or built this code. I don't have a setup to build Jenkins plugins. |
Looks like you have a compiler error. You can build the plugin by running |
Sometimes DescribeInstances takes a very long time to return data for new instances. Wait up to 1 minute in this case before assuming that the instance is dead.
e9c9cf6
to
929d580
Compare
fetchLiveInstanceData(true); | ||
try { | ||
// Wait up to 1 minute for the instance to show up | ||
fetchLiveInstanceData(true, 60); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When does this get called? If its during a provision operation, having such a long timeout might not be ideal as during provisioning the queue lock is held which means during this 1 minute builds you start won't be queued up and builds can't be picked up by nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not totally sure what Jenkins context this is called in, but I'm pretty sure it doesn't block other builds from queuing. That said, it already blocks for up to 25 seconds. This just extends it to 60 seconds.
We could try leaving out this commit and just having the commit that tries to terminate the instance if it couldn't get the information about the instance. That would hopefully keep it from getting deadlocked.
Yet another possible fix is for the EC2 plugin to keep track of which instances it launched and actively terminate any other instances it finds in EC2 with the same details. I didn't look into that, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we keep the timeout to be the same and explore increasing it if it does not solve the issue completely
If DescribeInstances did not return any data, send a termination request for it. If it shows up later, it would be included in capacity calculations and Jenkins could refuse to instantiate new nodes if the maximum capacity was reached.
929d580
to
3d39e47
Compare
* it doesn't affect capacity calculations. | ||
*/ | ||
LOGGER.log(Level.WARNING, "Failed to get instance data for new instance " + getInstanceId() + ", terminating"); | ||
terminateInstance(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@thoulen also made a point that this terminateInstance call can fail if the api gets throttled
Sometimes when launching a new instance,
DescribeInstances
does not report data within the amount of time thatgetInstanceWithRetry
waits. However, the instance is created and eventually Jenkins can't create any more nodes because we're at maximum capacity.This PR attempts 2 fixes:
Increase the timeout waiting for
DescribeInstances
to return data when creating a new instanceIf no data could be found, send a termination request for the request so it doesn't pollute the capacity accounting.