title | date | weight | tags | ||
---|---|---|---|---|---|
g. Create Ray Cluster |
2022-08-19 |
80 |
|
To create ray cluster, we need a .yaml file with the necessary configuration. A complete list of configuration options can be found here.
Copy the following configuration to cluster.yaml:
cluster_name: workshop
# The node config specifies the launch config and physical instance type.
available_node_types:
ray.head:
node_config:
SubnetIds: [SUBNET]
ImageId: AMI_ID
IamInstanceProfile:
Arn: RAY_HEAD_IAM_ROLE_ARN
InstanceType: c5.2xlarge
ray.worker.gpu:
min_workers: 2
max_workers: 2
node_config:
SubnetIds: [SUBNET]
ImageId: AMI_ID
IamInstanceProfile:
Arn: RAY_WORKER_IAM_ROLE_ARN
InstanceType: g4dn.2xlarge
head_node_type: ray.head
# Cloud-provider specific configuration.
provider:
type: aws
region: us-west-2
# Availability zone(s), comma-separated, that nodes may be launched in.
availability_zone: us-west-2a
cache_stopped_nodes: False # If not present, the default is True.
security_group:
GroupName: ray-cluster-sg
cloudwatch:
agent:
config: "cloudwatch-agent-config.json"
# How Ray will authenticate with newly launched nodes.
auth:
ssh_user: ubuntu
# List of shell commands to run to set up nodes.
setup_commands:
- FSXL_MOUNT_COMMAND
# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
- ray stop
- ulimit -n 65536; ray start --head --port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml
# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
- ray stop
- ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379
PlaceHolder | Replace With |
---|---|
SUBNET | subnet-xxxxxxxxxxxxxxxxx (public subnet for the availability_zone specified in the .yaml file) |
RAY_HEAD_IAM_ROLE_ARN | arn:aws:iam::xxxxxxxxxxxx:instance-profile/ray-head |
RAY_WORKER_IAM_ROLE_ARN | arn:aws:iam::xxxxxxxxxxxx:instance-profile/ray-worker |
FSXL_MOUNT_COMMAND | sudo mount command from FSxL console (see below) |
Use the Arns for the instance profiles obtained in the last step of section b.
To get the mount command for FSxL, navigate to the ray-fsx file system we created in section c and click Attach. This will show an information pan. Copy the command from step 3. under Attach instructions and add that to the .yaml file. It looks something like this:
sudo mount -t lustre -o noatime,flock fs-xxxxxxxxxxxxxxxxx.fsx.us-west-2.amazonaws.com@tcp:/xxxxxxxx /fsx
Before we can launch the cluster, we also have to install ray in Cloud9:
pip install boto3 ray[default]
Now we're ready to spin up the cluster:
ray up -y cluster.yaml
The command will exit once the head node is set up. The worker nodes are launched after that. The whole process takes 5-10 min to launch all the nodes.
To the check the status of the cluster, we can log in to the head node using the following command:
ray attach cluster.yaml
And, from inside the head node, execute:
ray status
You would see an output like this:
======== Autoscaler status: 2022-08-27 20:48:19.055954 ========
Node status
---------------------------------------------------------------
Healthy:
2 ray.worker.gpu
1 ray.head
Pending:
(no pending nodes)
Recent failures:
(no failures)
Resources
---------------------------------------------------------------
Usage:
0.0/24.0 CPU
0.0/2.0 GPU
0.0/2.0 accelerator_type:T4
0.00/53.336 GiB memory
0.00/22.405 GiB object_store_memory
Demands:
(no resource demands)