Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle unexpected node reboots #421

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

sairameshv
Copy link
Member

Potentially
Fixes #336

make test, make lint, pass locally
make test-e2e-kind-emulated passes with some tweaks to the config files.

/hold

  • until a few more e2e test scenarios are added
  • thoroughly tested on the real GPU covering all the scenarios

Signed-off-by: Sai Ramesh Vanka <[email protected]>
@openshift-ci openshift-ci bot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Jan 31, 2025
Copy link

openshift-ci bot commented Jan 31, 2025

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@openshift-ci openshift-ci bot added the size/S Denotes a PR that changes 10-29 lines, ignoring generated files. label Jan 31, 2025
Copy link

openshift-ci bot commented Jan 31, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: sairameshv

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jan 31, 2025
@harche
Copy link
Contributor

harche commented Jan 31, 2025

@sairameshv not sure if I got this correctly, but reading this comment #336 (comment) I get the impression that if we inject the code in deamonset to delete all allocations during the deamonset bootstrap, would it fix the issue?

@harche
Copy link
Contributor

harche commented Jan 31, 2025

by that I meant, what happens if you fetch the corresponding instaslice object and delete all allocations somewhere here, https://github.com/openshift/instaslice-operator/blob/main/internal/controller/daemonset/instaslice_daemonset.go#L98 ?

@sairameshv
Copy link
Member Author

by that I meant, what happens if you fetch the corresponding instaslice object and delete all allocations somewhere here, https://github.com/openshift/instaslice-operator/blob/main/internal/controller/daemonset/instaslice_daemonset.go#L98 ?

If we delete all the previous allocations when a daemonset comes up after a reboot, can we gate the pods again and create the slices ? Any thoughts @asm582 ?

@asm582
Copy link
Contributor

asm582 commented Jan 31, 2025

by that I meant, what happens if you fetch the corresponding instaslice object and delete all allocations somewhere here, https://github.com/openshift/instaslice-operator/blob/main/internal/controller/daemonset/instaslice_daemonset.go#L98 ?

If we delete all the previous allocations when a daemonset comes up after a reboot, can we gate the pods again and create the slices ? Any thoughts @asm582 ?

good point, all partitions will be deleted by the hardware after reboot. I think if an operator or deployment is used to spawn pods, then we will always have new ungated pods. we need to come up with a solution for plain pods.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Handle unexpected node reboots
3 participants