Allow scaling events to be logged during cluster validation #16354

rifelpet · 2024-02-15T02:04:51Z

The e2e upgrade jobs that have been migrated to the new prow cluster are failing to validate mid rolling-update

I0209 13:36:37.740937 14123 instancegroups.go:560] Cluster did not pass validation within deadline: InstanceGroup "nodes-us-west-2a" did not have enough nodes 1 vs 4.

We can get scaling activities on the ASG which should mention if the AWS autoscaling service is failing to launch nodes for some reason (resource quota, capacity, etc.)

This allows those events to be logged at --v=4 level, and sets that level on the upgrade scripts.

I included autoscaling activity for both AWS and GCE. other providers can add their implementations separately.

k8s-ci-robot · 2024-02-15T02:04:53Z

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

k8s-ci-robot · 2024-02-15T02:06:19Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from rifelpet. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

rifelpet · 2024-02-15T02:21:02Z

trying this out:

/test pull-kops-e2e-aws-upgrade-k127-ko127-to-klatest-kolatest-many-addons

though this job sounds invalid, since a k8s 1.27 cluster can't be upgraded to k8s latest (1.29)

justinsb · 2024-02-18T00:04:30Z

I like the idea!

though this job sounds invalid, since a k8s 1.27 cluster can't be upgraded to k8s latest (1.29)

(Do we know that it can't? It's not considered a supported path, but in practice most skip-upgrades do work)

justinsb

Cool idea!

justinsb · 2024-02-18T00:06:32Z

upup/pkg/fi/cloudup/gce/compute.go

@@ -709,6 +710,17 @@ func (c *instanceGroupManagerClientImpl) List(ctx context.Context, project, zone
 	return ms, nil
 }

+func (c *instanceGroupManagerClientImpl) ListErrors(ctx context.Context, project, zone, name string) ([]*compute.InstanceManagedByIgmError, error) {


Aside: I sort of regret our having this layer in GCE, I'm not sure it adds much!

justinsb · 2024-02-18T00:07:53Z

upup/pkg/fi/cloudup/awsup/aws_cloud.go

@@ -1196,13 +1196,29 @@ func awsBuildCloudInstanceGroup(ctx context.Context, c AWSCloud, cluster *kops.C
 		return nil, fmt.Errorf("failed to fetch instances: %v", err)
 	}

+	scalingReq := &autoscaling.DescribeScalingActivitiesInput{


I don't know whether this gets expensive, but I wonder if we should plumb through a flag here as to whether we want this information (or add a method or func to CloudInstanceGroup that could query it on-demand)

justinsb · 2024-02-18T00:10:11Z

upup/pkg/fi/cloudup/awsup/aws_cloud.go

+		for _, activity := range p.Activities {
+			event := cloudinstances.ScalingEvent{
+				Timestamp:   aws.TimeValue(activity.StartTime),
+				Description: aws.StringValue(activity.Description),


Looks like there's some extra fields; I'm not sure if Description has all/most of the information, but you might consider including the Raw any field (which will then get printed!)

k8s-triage-robot · 2024-05-18T00:46:20Z

The Kubernetes project currently lacks enough contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle stale
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-06-17T01:01:46Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Mark this PR as fresh with /remove-lifecycle rotten
Close this PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-ci-robot · 2024-07-06T06:23:22Z

@rifelpet: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
pull-kops-e2e-k8s-aws-calico-k8s-infra	`39246bf`	link	true	`/test pull-kops-e2e-k8s-aws-calico-k8s-infra`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

k8s-triage-robot · 2024-08-05T07:04:58Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Reopen this PR with /reopen
Mark this PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2024-08-05T07:05:03Z

@k8s-triage-robot: Closed this PR.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the PR is closed

You can:

Reopen this PR with /reopen

Mark this PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

rifelpet added 5 commits February 14, 2024 19:57

Include recent scaling events in CloudInstanceGroup

9fc0365

Populate scaling events in GCE

bd049f5

Populate scaling events in AWS

56b53f8

Log scaling events at v=4 level

9b6c417

Increase log verbosity on rolling-update in upgrade tests

39246bf

k8s-ci-robot requested review from hakman and johngmyers February 15, 2024 02:06

k8s-ci-robot added the area/provider/gcp Issues or PRs related to gcp provider label Feb 15, 2024

justinsb reviewed Feb 18, 2024

View reviewed changes

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 18, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 17, 2024

k8s-ci-robot closed this Aug 5, 2024

rifelpet mentioned this pull request Nov 6, 2024

Reduce number of nodes in manyaddons tests #16934

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow scaling events to be logged during cluster validation #16354

Allow scaling events to be logged during cluster validation #16354

rifelpet commented Feb 15, 2024

k8s-ci-robot commented Feb 15, 2024

k8s-ci-robot commented Feb 15, 2024

rifelpet commented Feb 15, 2024

justinsb commented Feb 18, 2024

justinsb left a comment

justinsb Feb 18, 2024 •

edited

Loading

justinsb Feb 18, 2024 •

edited

Loading

justinsb Feb 18, 2024

k8s-triage-robot commented May 18, 2024

k8s-triage-robot commented Jun 17, 2024

k8s-ci-robot commented Jul 6, 2024

k8s-triage-robot commented Aug 5, 2024

k8s-ci-robot commented Aug 5, 2024

Allow scaling events to be logged during cluster validation #16354

Allow scaling events to be logged during cluster validation #16354

Conversation

rifelpet commented Feb 15, 2024

k8s-ci-robot commented Feb 15, 2024

k8s-ci-robot commented Feb 15, 2024

rifelpet commented Feb 15, 2024

justinsb commented Feb 18, 2024

justinsb left a comment

Choose a reason for hiding this comment

justinsb Feb 18, 2024 • edited Loading

Choose a reason for hiding this comment

justinsb Feb 18, 2024 • edited Loading

Choose a reason for hiding this comment

justinsb Feb 18, 2024

Choose a reason for hiding this comment

k8s-triage-robot commented May 18, 2024

k8s-triage-robot commented Jun 17, 2024

k8s-ci-robot commented Jul 6, 2024

k8s-triage-robot commented Aug 5, 2024

k8s-ci-robot commented Aug 5, 2024

justinsb Feb 18, 2024 •

edited

Loading

justinsb Feb 18, 2024 •

edited

Loading