You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When submitting a PyTorchJob with an elasticPolicy set but only a master template (no worker template defined), the Training Operator crashes with a nil pointer exception that seems to be coming from this line.
What did you expect to happen?
We expect the Training Operator to not crash. We are interested in contributing a fix for this, but are wondering if it would make more sense to not allow users to set an elasticPolicy if they only have a master replica or to update the logic to allow this behavior. If we block the behavior, we could add this to the validating webhook so the user is informed of the error. If we allow the behavior, since the default c10d store is set on worker 0, we would need to update this piece as well. What are your thoughts?
Environment
Kubernetes version: v1.29.8
Training Operator version: v1-855e096, also tested a local build using the latest on master
Training Operator Python SDK version:
N/A
Impacted by this bug?
Give it a 👍 We prioritize the issues with most 👍
The text was updated successfully, but these errors were encountered:
What happened?
When submitting a PyTorchJob with an elasticPolicy set but only a master template (no worker template defined), the Training Operator crashes with a nil pointer exception that seems to be coming from this line.
What did you expect to happen?
We expect the Training Operator to not crash. We are interested in contributing a fix for this, but are wondering if it would make more sense to not allow users to set an elasticPolicy if they only have a master replica or to update the logic to allow this behavior. If we block the behavior, we could add this to the validating webhook so the user is informed of the error. If we allow the behavior, since the default c10d store is set on worker 0, we would need to update this piece as well. What are your thoughts?
Environment
Kubernetes version:
v1.29.8
Training Operator version:
v1-855e096
, also tested a local build using the latest on masterTraining Operator Python SDK version:
N/A
Impacted by this bug?
Give it a 👍 We prioritize the issues with most 👍
The text was updated successfully, but these errors were encountered: