Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Plan status should reflect errors from upgrade jobs #221

Open
tsde opened this issue Jan 20, 2023 · 0 comments
Open

Plan status should reflect errors from upgrade jobs #221

tsde opened this issue Jan 20, 2023 · 0 comments

Comments

@tsde
Copy link

tsde commented Jan 20, 2023

Is your feature request related to a problem? Please describe.
When one (or more) upgrade job fails, the Plan status does not reflect a failure in the upgrade process.

Describe the solution you'd like
The Plan status should indicate that a failure in the upgrade process occured by at least updating the status.condtions.type and status.condtions.reason fields. This would ease tracking down a failure in the node upgrade process

Describe alternatives you've considered
Of course, getting the status of the jobs is a way to have information, but given that the Plan is driving these jobs, having this information in its status as well would be a nice addition.

Additional context
To reproduce, simply add the following plan that forces the job to fail (also, set the SYSTEM_UPGRADE_JOB_BACKOFF_LIMIT env var in the configmap to a low value like 2 to avoid waiting forever):

---
apiVersion: upgrade.cattle.io/v1
kind: Plan
metadata:
  name: test
  labels:
    rke2-upgrade: agent
spec:
  concurrency: 1
  nodeSelector:
    matchExpressions:
      - {key: node-role.kubernetes.io/control-plane, operator: NotIn, values: ["true"]}
  prepare:
    command:
      - sh
      - "-c"
      - "/bin/true"
    image: rancher/rke2-upgrade
  serviceAccountName: default
  cordon: true
  drain:
    force: true
  upgrade:
    image: rancher/rke2-upgrade
    command:
      - sh
      - "-c"
      - "/bin/false"
  version: v1.25.5+rke2r2

After the retries, the job ends up in a failed state while the Plan status shows the following:

status:
  applying:
  - my-agent-node
  conditions:
  - lastUpdateTime: "2023-01-20T10:46:31Z"
    reason: Version
    status: "True"
    type: LatestResolved
  latestHash: 1f7d06e28d958431705fcce82a14c95ac4cc4b695cfb0cdf3e088bfb
  latestVersion: v1.25.5-rke2r2

There is no clear indication of a failure whatsoever. Setting the fields reason and type (and maybe other custom fields like the job name) to reflect a failure would be nice.

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant