Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training Operator crashes when submitting PyTorchJob with elasticPolicy but without worker template defined #2278

Closed
alenawang opened this issue Oct 8, 2024 · 2 comments · Fixed by #2320
Labels

Comments

@alenawang
Copy link
Contributor

What happened?

When submitting a PyTorchJob with an elasticPolicy set but only a master template (no worker template defined), the Training Operator crashes with a nil pointer exception that seems to be coming from this line.

What did you expect to happen?

We expect the Training Operator to not crash. We are interested in contributing a fix for this, but are wondering if it would make more sense to not allow users to set an elasticPolicy if they only have a master replica or to update the logic to allow this behavior. If we block the behavior, we could add this to the validating webhook so the user is informed of the error. If we allow the behavior, since the default c10d store is set on worker 0, we would need to update this piece as well. What are your thoughts?

Environment

Kubernetes version:
v1.29.8

Training Operator version:
v1-855e096, also tested a local build using the latest on master

Training Operator Python SDK version:
N/A

Impacted by this bug?

Give it a 👍 We prioritize the issues with most 👍

@kuizhiqing
Copy link
Member

One should come with worker section in elastic mode.

Feel free to add validation logic in this case.

Looking forward to your contribution.

@tenzen-y
Copy link
Member

tenzen-y commented Nov 5, 2024

/remove-label lifecycle/needs-triage

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
3 participants