You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We expected PET_NNODES to be set to x:x where x is the total number of replicas (master + workers). Does this make sense? If so we would be interested in contributing this fix.
Environment
Kubernetes version: v1.29.8
Training Operator version: v1-855e096, also tested a local build using the latest on master
Training Operator Python SDK version:
N/A
Impacted by this bug?
Give it a 👍 We prioritize the issues with most 👍
The text was updated successfully, but these errors were encountered:
What happened?
When elasticPolicy is set on the manifest but the user does not pass in minReplicas or maxReplicas explicitly, the PET_NNODES env var is set to
x:x
wherex
is the number of worker replicas only - it does not seem to be including the master replica in this count. When elasticPolicy is not set, PET_NNODES is set to a single number that is the master + number of worker replicas, which seems correct.What did you expect to happen?
We expected PET_NNODES to be set to
x:x
wherex
is the total number of replicas (master + workers). Does this make sense? If so we would be interested in contributing this fix.Environment
Kubernetes version:
v1.29.8
Training Operator version:
v1-855e096
, also tested a local build using the latest on masterTraining Operator Python SDK version:
N/A
Impacted by this bug?
Give it a 👍 We prioritize the issues with most 👍
The text was updated successfully, but these errors were encountered: