PET_NNODES env var for PyTorchJobs is incorrect when elasticPolicy is set #2277

alenawang · 2024-10-08T19:04:39Z

What happened?

When elasticPolicy is set on the manifest but the user does not pass in minReplicas or maxReplicas explicitly, the PET_NNODES env var is set to x:x where x is the number of worker replicas only - it does not seem to be including the master replica in this count. When elasticPolicy is not set, PET_NNODES is set to a single number that is the master + number of worker replicas, which seems correct.

What did you expect to happen?

We expected PET_NNODES to be set to x:x where x is the total number of replicas (master + workers). Does this make sense? If so we would be interested in contributing this fix.

Environment

Kubernetes version:
v1.29.8

Training Operator version:
v1-855e096, also tested a local build using the latest on master

Training Operator Python SDK version:
N/A

Impacted by this bug?

Give it a 👍 We prioritize the issues with most 👍

The text was updated successfully, but these errors were encountered:

kuizhiqing · 2024-10-10T02:38:13Z

Thanks for this feedback.

Actually, for design purpose, we no need to set master at all, https://github.com/kubeflow/training-operator/blob/master/examples/pytorch/elastic/imagenet/imagenet.yaml and https://github.com/kubeflow/training-operator/blob/master/examples/pytorch/elastic/echo/echo.yaml.

This design make sense, since in the elastic scenario, nodes are treat equally.

alenawang added kind/bug lifecycle/needs-triage labels Oct 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PET_NNODES env var for PyTorchJobs is incorrect when elasticPolicy is set #2277

PET_NNODES env var for PyTorchJobs is incorrect when elasticPolicy is set #2277

alenawang commented Oct 8, 2024

kuizhiqing commented Oct 10, 2024 •

edited

Loading

PET_NNODES env var for PyTorchJobs is incorrect when elasticPolicy is set #2277

PET_NNODES env var for PyTorchJobs is incorrect when elasticPolicy is set #2277

Comments

alenawang commented Oct 8, 2024

What happened?

What did you expect to happen?

Environment

Impacted by this bug?

kuizhiqing commented Oct 10, 2024 • edited Loading

kuizhiqing commented Oct 10, 2024 •

edited

Loading