-
Notifications
You must be signed in to change notification settings - Fork 308
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Assume instance exists within eventual-consistency grace period #1024
base: master
Are you sure you want to change the base?
Conversation
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This issue is currently awaiting triage. If cloud-provider-aws contributors determine this is a relevant issue, they will accept it by applying the The Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/assign mmerkes |
pkg/providers/v1/instances_v2.go
Outdated
@@ -44,6 +49,11 @@ func (c *Cloud) getProviderID(ctx context.Context, node *v1.Node) (string, error | |||
// InstanceExists returns true if the instance for the given node exists according to the cloud provider. | |||
// Use the node.name or node.spec.providerID field to find the node in the cloud provider. | |||
func (c *Cloud) InstanceExists(ctx context.Context, node *v1.Node) (bool, error) { | |||
if time.Since(node.CreationTimestamp.Time) < instanceExistsGracePeriod { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we check only if the node is not getting deleted by checking if the deletion timestamp exists?
pkg/providers/v1/instances_v2.go
Outdated
|
||
v1 "k8s.io/api/core/v1" | ||
"k8s.io/apimachinery/pkg/types" | ||
cloudprovider "k8s.io/cloud-provider" | ||
) | ||
|
||
const ( | ||
instanceExistsGracePeriod = 2 * time.Minute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how did we come with this time period?
Closing this in favor of kubernetes/kubernetes#127424 |
Rebooting this. |
c463646
to
82999f0
Compare
82999f0
to
26c8821
Compare
/retest |
26c8821
to
b25a71e
Compare
pkg/providers/v1/instances_v2.go
Outdated
@@ -45,6 +47,12 @@ func (c *Cloud) getProviderID(ctx context.Context, node *v1.Node) (string, error | |||
// InstanceExists returns true if the instance for the given node exists according to the cloud provider. | |||
// Use the node.name or node.spec.providerID field to find the node in the cloud provider. | |||
func (c *Cloud) InstanceExists(ctx context.Context, node *v1.Node) (bool, error) { | |||
if time.Since(node.CreationTimestamp.Time) < c.nodeEventualConsistencyGracePeriod { | |||
// recently-launched EC2 instances may not appear in `ec2:DescribeInstances` | |||
// we return an error to cause the cloud-node-lifecycle-controller to ignore this node |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any downside to just forcing it to fail? Could increase the error logs. Can we just move this check down into code so that we only return an error if it's within the grace period AND it doesn't have the eventually consistent fields set? That would drastically reduce the noise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep had the same thought. We have to handle InvalidInstanceId.NotFound
differently in the Instances
v1/v2 impls, since we only have the context to apply this grace period in v2
b25a71e
to
d936e99
Compare
d936e99
to
8282b31
Compare
@cartermckinnon: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
What type of PR is this?
/kind bug
What this PR does / why we need it:
For a short period of time after an EC2 instance is launched, the
ec2:DescribeInstances
API may not return details for the instance. When this happens, the node lifecycle controller may delete theNode
erroneously, believing the corresponding EC2 instance does not exist.This PR adds a configurable "grace period" after
Node
creation, during which the EC2 instance is assumed to exist.Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
This is possible in our
InstancesV2
implementation, because the controller will pass us theNode
: https://github.com/kubernetes/cloud-provider/blob/912e64449ce4cb3645436a768d4a8d5c834652ed/controllers/nodelifecycle/node_lifecycle_controller.go#L253InstancesV2
.Does this PR introduce a user-facing change?: