Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: dstack chooses a fleet with too few instances #2221

Open
jvstme opened this issue Jan 23, 2025 · 0 comments
Open

[Bug]: dstack chooses a fleet with too few instances #2221

jvstme opened this issue Jan 23, 2025 · 0 comments
Labels
bug Something isn't working fleets ssh-fleets

Comments

@jvstme
Copy link
Collaborator

jvstme commented Jan 23, 2025

Steps to reproduce

  1. Create a project without cloud backends.

    The same can be reproduced with cloud fleets, see below.

  2. Get an on-prem fleet with one instance and another fleet with two instances.

    > dstack fleet
     FLEET      INSTANCE  BACKEND       RESOURCES                  PRICE  STATUS  CREATED    
     on-prem-2  0         ssh (remote)  2xCPU, 1GB, 35.2GB (disk)  $0.0   idle    1 hour ago 
                1         ssh (remote)  2xCPU, 1GB, 35.2GB (disk)  $0.0   idle    1 hour ago 
     on-prem-1  0         ssh (remote)  2xCPU, 1GB, 35.1GB (disk)  $0.0   idle    7 mins ago
  3. Try running a task with two nodes or a service with two replicas.

    type: service
    replicas: 2
    port: 12345
    commands:
      - sleep infinity
    resources:
      memory: 0.5GB..
      disk: 10GB..

Actual behaviour

dstack may assign the run to the fleet with one instance. The second job will then fail because the fleet does not have enough instances.

> dstack apply                       

 #  BACKEND  REGION  INSTANCE  RESOURCES                  SPOT  PRICE       
 1  ssh      remote  instance  2xCPU, 1GB, 35.2GB (disk)  no    $0     idle 
 2  ssh      remote  instance  2xCPU, 1GB, 35.2GB (disk)  no    $0     idle 
 3  ssh      remote  instance  2xCPU, 1GB, 35.1GB (disk)  no    $0     idle 

Submit a new run? [y/n]: y
 NAME               BACKEND       INSTANCE  RESOURCES                  RESERVATION  PRICE  STATUS      SUBMITTED  ERROR                              
 happy-pug-1                                                                               failed      22:24      JOB_FAILED                         
   replica=0 job=0  ssh (remote)  instance  2xCPU, 1GB, 35.1GB (disk)               $0.0   terminated  22:24      TERMINATED_BY_SERVER               
   replica=1 job=0                                                                         failed      22:24      FAILED_TO_START_DUE_TO_NO_CAPACITY

Sometimes dstack will choose the correct fleet, you may need to re-create one of the fleets a few times until you can reproduce.

Expected behaviour

dstack chooses the fleet with two instances and both jobs are provisioned successfully.

If there are no fleets with enough capacity, dstack shows no offers and the run fails before submitting the jobs.

dstack version

0.18.36

Server logs

Additional information

The same can be reproduced with cloud fleets using --reuse.

> dstack fleet
 FLEET    INSTANCE  BACKEND           RESOURCES                          PRICE    STATUS  CREATED    
 cloud-1  0         aws (eu-north-1)  2xCPU, 8GB, 100.0GB (disk), SPOT   $0.029   idle    2 mins ago 
 cloud-2  0         aws (eu-north-1)  4xCPU, 16GB, 100.0GB (disk), SPOT  $0.0603  idle    1 min ago  
          1         aws (eu-north-1)  4xCPU, 16GB, 100.0GB (disk), SPOT  $0.0603  idle    1 min ago  

> dstack apply --reuse                   

 #  BACKEND  REGION      INSTANCE   RESOURCES                    SPOT  PRICE         
 1  aws      eu-north-1  m5.large   2xCPU, 8GB, 100.0GB (disk)   yes   $0.029   idle 
 2  aws      eu-north-1  m5.xlarge  4xCPU, 16GB, 100.0GB (disk)  yes   $0.0603  idle 
 3  aws      eu-north-1  m5.xlarge  4xCPU, 16GB, 100.0GB (disk)  yes   $0.0603  idle 

Submit a new run? [y/n]: y
 NAME               BACKEND           INSTANCE  RESOURCES                      RESERVATION  PRICE   STATUS      SUBMITTED  ERROR                              
 fuzzy-fish-1                                                                                       failed      22:58      JOB_FAILED                         
   replica=0 job=0  aws (eu-north-1)  m5.large  2xCPU, 8GB, 100.0GB (disk),                 $0.029  terminated  22:58      TERMINATED_BY_SERVER               
                                                SPOT                                                                                                          
   replica=1 job=0                                                                                  failed      22:58      FAILED_TO_START_DUE_TO_NO_CAPACITY
@jvstme jvstme added bug Something isn't working fleets ssh-fleets labels Jan 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working fleets ssh-fleets
Projects
None yet
Development

No branches or pull requests

1 participant