Skip to content

Commit

Permalink
[Docs]: Update Distributed Tasks docs (#2212)
Browse files Browse the repository at this point in the history
- Mention Vultr.
- Use an example from the official PyTorch repo.
  The previous example did not specify which repo
  to use.
- Add an example of detecting the private network
  interface.
- Clarify some statements.
  • Loading branch information
jvstme authored Jan 24, 2025
1 parent 372ae41 commit f148a3e
Show file tree
Hide file tree
Showing 3 changed files with 69 additions and 45 deletions.
14 changes: 5 additions & 9 deletions docs/docs/concepts/fleets.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,23 +38,19 @@ Define a fleet configuration as a YAML file in your project directory. The file

</div>

#### Placement
#### Placement { #cloud-placement }

To ensure instances are interconnected (e.g., for
[distributed tasks](tasks.md#distributed-tasks)), set `placement` to `cluster`.
This ensures all instances are provisioned in the same backend and region with optimal inter-node connectivity

??? info "AWS"
`dstack` automatically enables [Elastic Fabric Adapter :material-arrow-top-right-thin:{ .external }](https://aws.amazon.com/hpc/efa/){:target="_blank"}
for the instance types that support it:
`p5.48xlarge`, `p4d.24xlarge`, `g4dn.12xlarge`, `g4dn.16xlarge`, `g4dn.8xlarge`, `g4dn.metal`,
`g5.12xlarge`, `g5.16xlarge`, `g5.24xlarge`, `g5.48xlarge`, `g5.8xlarge`, `g6.12xlarge`,
`g6.16xlarge`, `g6.24xlarge`, `g6.48xlarge`, `g6.8xlarge`, and `gr6.8xlarge`.

`dstack` automatically enables the Elastic Fabric Adapter for all
[EFA-capable instance types :material-arrow-top-right-thin:{ .external }](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html#efa-instance-types){:target="_blank"}.
Currently, only one EFA interface is enabled per instance, regardless of its maximum capacity.
This will change once [this issue :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/issues/1804){:target="_blank"} is resolved.

> The `cluster` placement is supported only for `aws`, `azure`, `gcp`, and `oci`
> The `cluster` placement is supported only for `aws`, `azure`, `gcp`, `oci`, and `vultr`
> backends.
#### Resources
Expand Down Expand Up @@ -245,7 +241,7 @@ Define a fleet configuration as a YAML file in your project directory. The file

3.&nbsp;The user specified should have passwordless `sudo` access.

#### Placement
#### Placement { #ssh-placement }

If the hosts are interconnected (i.e. share the same network), set `placement` to `cluster`.
This is required if you'd like to use the fleet for [distributed tasks](tasks.md#distributed-tasks).
Expand Down
56 changes: 41 additions & 15 deletions docs/docs/concepts/tasks.md
Original file line number Diff line number Diff line change
Expand Up @@ -71,7 +71,7 @@ application.
By default, a task runs on a single node.
However, you can run it on a cluster of nodes by specifying `nodes`.

<div editor-title="examples/fine-tuning/train.dstack.yml">
<div editor-title="train.dstack.yml">

```yaml
type: task
Expand All @@ -81,33 +81,59 @@ name: train-distrib
# The size of the cluster
nodes: 2
python: "3.10"
python: "3.12"
# Commands of the task
# Commands to run on each node
commands:
- git clone https://github.com/pytorch/examples.git
- cd examples/distributed/ddp-tutorial-series
- pip install -r requirements.txt
- torchrun
--nproc_per_node=$DSTACK_GPUS_PER_NODE
--node_rank=$DSTACK_NODE_RANK
--nproc-per-node=$DSTACK_GPUS_PER_NODE
--node-rank=$DSTACK_NODE_RANK
--nnodes=$DSTACK_NODES_NUM
--master_addr=$DSTACK_MASTER_NODE_IP
--master_port=8008 resnet_ddp.py
--num_epochs 20
--master-addr=$DSTACK_MASTER_NODE_IP
--master-port=12345
multinode.py 50 10
resources:
gpu: 24GB
# Uncomment if using multiple GPUs
#shm_size: 24GB
```

</div>

All you need to do is pass the corresponding environment variables such as
`DSTACK_GPUS_PER_NODE`, `DSTACK_NODE_RANK`, `DSTACK_NODES_NUM`,
`DSTACK_MASTER_NODE_IP`, and `DSTACK_GPUS_NUM` (see [System environment variables](#system-environment-variables)).
Nodes can communicate using their private IP addresses.
Use `DSTACK_MASTER_NODE_IP`, `$DSTACK_NODE_RANK`, and other
[System environment variables](#system-environment-variables)
to discover IP addresses and other details.

??? info "Network interface"
Distributed frameworks usually detect the correct network interface automatically,
but sometimes you need to specify it explicitly.

For example, with PyTorch and the NCCL backend, you may need
to add these commands to tell NCCL to use the private interface:

```yaml
commands:
- apt-get install -y iproute2
- >
if [[ $DSTACK_NODE_RANK == 0 ]]; then
export NCCL_SOCKET_IFNAME=$(ip -4 -o addr show | fgrep $DSTACK_MASTER_NODE_IP | awk '{print $2}')
else
export NCCL_SOCKET_IFNAME=$(ip route get $DSTACK_MASTER_NODE_IP | sed -E 's/.*?dev (\S+) .*/\1/;t;d')
fi
# ... The rest of the commands
```

!!! info "Fleets"
To ensure all nodes are provisioned into a cluster placement group and to enable the highest level of inter-node
connectivity (incl. support for [EFA :material-arrow-top-right-thin:{ .external }](https://aws.amazon.com/hpc/efa/){:target="_blank"}),
create a [fleet](fleets.md) via a configuration before running a disstributed task.
Distributed tasks can only run on fleets with
[cluster placement](fleets.md#cloud-placement).
While `dstack` can provision such fleets automatically, it is
recommended to create them via a fleet configuration
to ensure the highest level of inter-node connectivity.

`dstack` is easy to use with `accelerate`, `torchrun`, Ray, Spark, and any other distributed frameworks.

Expand Down Expand Up @@ -303,7 +329,7 @@ If you don't assign a value to an environment variable (see `HF_TOKEN` above),
| `DSTACK_NODES_NUM` | The number of nodes in the run |
| `DSTACK_GPUS_PER_NODE` | The number of GPUs per node |
| `DSTACK_NODE_RANK` | The rank of the node |
| `DSTACK_MASTER_NODE_IP` | The internal IP address the master node |
| `DSTACK_MASTER_NODE_IP` | The internal IP address of the master node |
| `DSTACK_NODES_IPS` | The list of internal IP addresses of all nodes delimited by "\n" |

### Spot policy
Expand Down
44 changes: 23 additions & 21 deletions docs/docs/reference/environment-variables.md
Original file line number Diff line number Diff line change
Expand Up @@ -45,31 +45,33 @@ tasks, and services:
- `DSTACK_NODES_NUM`{ #DSTACK_NODES_NUM } – The number of nodes in the run
- `DSTACK_GPUS_PER_NODE`{ #DSTACK_GPUS_PER_NODE } – The number of GPUs per node
- `DSTACK_NODE_RANK`{ #DSTACK_NODE_RANK } – The rank of the node
- `DSTACK_NODE_RANK`{ #DSTACK_NODE_RANK } – The internal IP address the master node.
- `DSTACK_MASTER_NODE_IP`{ #DSTACK_NODE_RANK } – The internal IP address of the master node.

Below is an example of using `DSTACK_NODES_NUM`, `DSTACK_GPUS_PER_NODE`, `DSTACK_NODE_RANK`, and `DSTACK_NODE_RANK`
Below is an example of using `DSTACK_NODES_NUM`, `DSTACK_GPUS_PER_NODE`, `DSTACK_NODE_RANK`, and `DSTACK_MASTER_NODE_IP`
for distributed training:

```yaml
type: task
name: train-distrib
# The number of instances in the cluster
nodes: 2
python: "3.10"
commands:
- pip install -r requirements.txt
- torchrun
--nproc_per_node=$DSTACK_GPUS_PER_NODE
--node_rank=$DSTACK_NODE_RANK
--nnodes=$DSTACK_NODES_NUM
--master_addr=$DSTACK_MASTER_NODE_IP
--master_port=8008
resnet_ddp.py --num_epochs 20
resources:
gpu: 24GB
type: task
name: train-distrib
nodes: 2
python: "3.12"
commands:
- git clone https://github.com/pytorch/examples.git
- cd examples/distributed/ddp-tutorial-series
- pip install -r requirements.txt
- torchrun
--nproc-per-node=$DSTACK_GPUS_PER_NODE
--node-rank=$DSTACK_NODE_RANK
--nnodes=$DSTACK_NODES_NUM
--master-addr=$DSTACK_MASTER_NODE_IP
--master-port=12345
multinode.py 50 10
resources:
gpu: 24GB
shm_size: 24GB
```

- `DSTACK_NODES_IPS`{ #DSTACK_NODES_IPS } – The list of internal IP addresses of all nodes delimited by `"\n"`.
Expand Down

0 comments on commit f148a3e

Please sign in to comment.