diff --git a/docs/docs/concepts/fleets.md b/docs/docs/concepts/fleets.md index 225cfb777..336a48370 100644 --- a/docs/docs/concepts/fleets.md +++ b/docs/docs/concepts/fleets.md @@ -38,23 +38,19 @@ Define a fleet configuration as a YAML file in your project directory. The file -#### Placement +#### Placement { #cloud-placement } To ensure instances are interconnected (e.g., for [distributed tasks](tasks.md#distributed-tasks)), set `placement` to `cluster`. This ensures all instances are provisioned in the same backend and region with optimal inter-node connectivity ??? info "AWS" - `dstack` automatically enables [Elastic Fabric Adapter :material-arrow-top-right-thin:{ .external }](https://aws.amazon.com/hpc/efa/){:target="_blank"} - for the instance types that support it: - `p5.48xlarge`, `p4d.24xlarge`, `g4dn.12xlarge`, `g4dn.16xlarge`, `g4dn.8xlarge`, `g4dn.metal`, - `g5.12xlarge`, `g5.16xlarge`, `g5.24xlarge`, `g5.48xlarge`, `g5.8xlarge`, `g6.12xlarge`, - `g6.16xlarge`, `g6.24xlarge`, `g6.48xlarge`, `g6.8xlarge`, and `gr6.8xlarge`. - + `dstack` automatically enables the Elastic Fabric Adapter for all + [EFA-capable instance types :material-arrow-top-right-thin:{ .external }](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html#efa-instance-types){:target="_blank"}. Currently, only one EFA interface is enabled per instance, regardless of its maximum capacity. This will change once [this issue :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/issues/1804){:target="_blank"} is resolved. -> The `cluster` placement is supported only for `aws`, `azure`, `gcp`, and `oci` +> The `cluster` placement is supported only for `aws`, `azure`, `gcp`, `oci`, and `vultr` > backends. #### Resources @@ -245,7 +241,7 @@ Define a fleet configuration as a YAML file in your project directory. The file 3. The user specified should have passwordless `sudo` access. -#### Placement +#### Placement { #ssh-placement } If the hosts are interconnected (i.e. share the same network), set `placement` to `cluster`. This is required if you'd like to use the fleet for [distributed tasks](tasks.md#distributed-tasks). diff --git a/docs/docs/concepts/tasks.md b/docs/docs/concepts/tasks.md index 16c612b2b..f3ef35230 100644 --- a/docs/docs/concepts/tasks.md +++ b/docs/docs/concepts/tasks.md @@ -71,7 +71,7 @@ application. By default, a task runs on a single node. However, you can run it on a cluster of nodes by specifying `nodes`. -
+
```yaml type: task @@ -81,33 +81,59 @@ name: train-distrib # The size of the cluster nodes: 2 -python: "3.10" +python: "3.12" -# Commands of the task +# Commands to run on each node commands: + - git clone https://github.com/pytorch/examples.git + - cd examples/distributed/ddp-tutorial-series - pip install -r requirements.txt - torchrun - --nproc_per_node=$DSTACK_GPUS_PER_NODE - --node_rank=$DSTACK_NODE_RANK + --nproc-per-node=$DSTACK_GPUS_PER_NODE + --node-rank=$DSTACK_NODE_RANK --nnodes=$DSTACK_NODES_NUM - --master_addr=$DSTACK_MASTER_NODE_IP - --master_port=8008 resnet_ddp.py - --num_epochs 20 + --master-addr=$DSTACK_MASTER_NODE_IP + --master-port=12345 + multinode.py 50 10 resources: gpu: 24GB + # Uncomment if using multiple GPUs + #shm_size: 24GB ```
-All you need to do is pass the corresponding environment variables such as -`DSTACK_GPUS_PER_NODE`, `DSTACK_NODE_RANK`, `DSTACK_NODES_NUM`, -`DSTACK_MASTER_NODE_IP`, and `DSTACK_GPUS_NUM` (see [System environment variables](#system-environment-variables)). +Nodes can communicate using their private IP addresses. +Use `DSTACK_MASTER_NODE_IP`, `$DSTACK_NODE_RANK`, and other +[System environment variables](#system-environment-variables) +to discover IP addresses and other details. + +??? info "Network interface" + Distributed frameworks usually detect the correct network interface automatically, + but sometimes you need to specify it explicitly. + + For example, with PyTorch and the NCCL backend, you may need + to add these commands to tell NCCL to use the private interface: + + ```yaml + commands: + - apt-get install -y iproute2 + - > + if [[ $DSTACK_NODE_RANK == 0 ]]; then + export NCCL_SOCKET_IFNAME=$(ip -4 -o addr show | fgrep $DSTACK_MASTER_NODE_IP | awk '{print $2}') + else + export NCCL_SOCKET_IFNAME=$(ip route get $DSTACK_MASTER_NODE_IP | sed -E 's/.*?dev (\S+) .*/\1/;t;d') + fi + # ... The rest of the commands + ``` !!! info "Fleets" - To ensure all nodes are provisioned into a cluster placement group and to enable the highest level of inter-node - connectivity (incl. support for [EFA :material-arrow-top-right-thin:{ .external }](https://aws.amazon.com/hpc/efa/){:target="_blank"}), - create a [fleet](fleets.md) via a configuration before running a disstributed task. + Distributed tasks can only run on fleets with + [cluster placement](fleets.md#cloud-placement). + While `dstack` can provision such fleets automatically, it is + recommended to create them via a fleet configuration + to ensure the highest level of inter-node connectivity. `dstack` is easy to use with `accelerate`, `torchrun`, Ray, Spark, and any other distributed frameworks. @@ -303,7 +329,7 @@ If you don't assign a value to an environment variable (see `HF_TOKEN` above), | `DSTACK_NODES_NUM` | The number of nodes in the run | | `DSTACK_GPUS_PER_NODE` | The number of GPUs per node | | `DSTACK_NODE_RANK` | The rank of the node | - | `DSTACK_MASTER_NODE_IP` | The internal IP address the master node | + | `DSTACK_MASTER_NODE_IP` | The internal IP address of the master node | | `DSTACK_NODES_IPS` | The list of internal IP addresses of all nodes delimited by "\n" | ### Spot policy diff --git a/docs/docs/reference/environment-variables.md b/docs/docs/reference/environment-variables.md index e94c5cf44..319bbbe0d 100644 --- a/docs/docs/reference/environment-variables.md +++ b/docs/docs/reference/environment-variables.md @@ -45,31 +45,33 @@ tasks, and services: - `DSTACK_NODES_NUM`{ #DSTACK_NODES_NUM } – The number of nodes in the run - `DSTACK_GPUS_PER_NODE`{ #DSTACK_GPUS_PER_NODE } – The number of GPUs per node - `DSTACK_NODE_RANK`{ #DSTACK_NODE_RANK } – The rank of the node -- `DSTACK_NODE_RANK`{ #DSTACK_NODE_RANK } – The internal IP address the master node. +- `DSTACK_MASTER_NODE_IP`{ #DSTACK_NODE_RANK } – The internal IP address of the master node. - Below is an example of using `DSTACK_NODES_NUM`, `DSTACK_GPUS_PER_NODE`, `DSTACK_NODE_RANK`, and `DSTACK_NODE_RANK` + Below is an example of using `DSTACK_NODES_NUM`, `DSTACK_GPUS_PER_NODE`, `DSTACK_NODE_RANK`, and `DSTACK_MASTER_NODE_IP` for distributed training: ```yaml - type: task - name: train-distrib - - # The number of instances in the cluster - nodes: 2 - - python: "3.10" - commands: - - pip install -r requirements.txt - - torchrun - --nproc_per_node=$DSTACK_GPUS_PER_NODE - --node_rank=$DSTACK_NODE_RANK - --nnodes=$DSTACK_NODES_NUM - --master_addr=$DSTACK_MASTER_NODE_IP - --master_port=8008 - resnet_ddp.py --num_epochs 20 - - resources: - gpu: 24GB + type: task + name: train-distrib + + nodes: 2 + python: "3.12" + + commands: + - git clone https://github.com/pytorch/examples.git + - cd examples/distributed/ddp-tutorial-series + - pip install -r requirements.txt + - torchrun + --nproc-per-node=$DSTACK_GPUS_PER_NODE + --node-rank=$DSTACK_NODE_RANK + --nnodes=$DSTACK_NODES_NUM + --master-addr=$DSTACK_MASTER_NODE_IP + --master-port=12345 + multinode.py 50 10 + + resources: + gpu: 24GB + shm_size: 24GB ``` - `DSTACK_NODES_IPS`{ #DSTACK_NODES_IPS } – The list of internal IP addresses of all nodes delimited by `"\n"`.