Skip to content

First draft of Ray workshop #285

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 3 commits into
base: frameworks
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions content/05-hpc-ray-workshop/01-prerequisites.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
---
title: "a. Prerequisites"
date: 2022-08-18
weight: 20
---

- Set up VPC (TODO add link to instructions; temporary link: https://pr-282.db63t2jjt7llc.amplifyapp.com/05-batch-mnp-train-gpu/00-create-vpc-subnet.html)

- Set up Cloud9 Environment (TODO add link to instructions; temporary link: https://www.hpcworkshops.com/02-aws-getting-started/04-start_cloud9.html)

84 changes: 84 additions & 0 deletions content/05-hpc-ray-workshop/02-create-iam-roles.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
---
title: "b. Create IAM Roles"
date: 2022-08-18
weight: 30
tags: ["Ray", "IAM"]
---

By default, Ray creates an IAM role with some managed policies and attaches it to the head node at the cluster creation time. No role is created for the worker node. But, to have a more granular control over policies and permissions for both the head and work nodes, we will create two IAM roles (ray-head, ray-worker) to be used at the cluster creation time.

In order to create an IAM role from command line, we need to define a trust policy. Save the following trust policy to **ray_trust_policy.json** file:

```json
{
"Version": "2008-10-17",
"Statement": [
{
"Effect": "Allow",
"Principal": {
"Service": "ec2.amazonaws.com"
},
"Action": "sts:AssumeRole"
}
]
}
```

Execute the following commands to create the IAM role for the head node and also the instance profile for this role:

```bash
aws iam create-role --role-name ray-head --assume-role-policy-document file://ray_trust_policy.json
aws iam create-instance-profile --instance-profile-name ray-head
aws iam add-role-to-instance-profile --instance-profile-name ray-head --role-name ray-head
```

Next, execute the following commands to create the IAM role for worker nodes and also the instance profile for this role:

```bash
aws iam create-role --role-name ray-worker --assume-role-policy-document file://ray_trust_policy.json
aws iam create-instance-profile --instance-profile-name ray-worker
aws iam add-role-to-instance-profile --instance-profile-name ray-worker --role-name ray-worker
```

The head and worker nodes need permission to access S3, FSxL and CloudWatch. So we will attache the relevant managed policies to these roles. Apart from this, the head node also need to able to spin up EC2 instances.

Following commands attach the necessary policies to the IAM role for the head node:

```bash
aws iam attach-role-policy --policy-arn arn:aws:iam::aws:policy/AmazonEC2FullAccess --role-name ray-head
aws iam attach-role-policy --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess --role-name ray-head
aws iam attach-role-policy --policy-arn arn:aws:iam::aws:policy/AmazonFSxFullAccess --role-name ray-head
aws iam attach-role-policy --policy-arn arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy --role-name ray-head
aws iam attach-role-policy --policy-arn arn:aws:iam::aws:policy/AmazonSSMFullAccess --role-name ray-head
```
Apart from these manages policies, we also need to give permission to the head node to pass an IAM role to ec2 instances. Save the following policy to **ray_pass_role_policy.json**:

```json
{
"Version": "2012-10-17",
"Statement": [
{
"Effect": "Allow",
"Action": [
"iam:PassRole"
],
"Resource": [
"arn:aws:iam::*:role/ray-worker"
]
}
]
}
```

Now, attache this policy to the ray-head role:

```bash
aws iam put-role-policy --role-name ray-head --policy-name ray-pass-role-policy --policy-document file://ray_pass_role_policy.json
```

We would need the Arns for the instance profiles for these roles later when creating the cluster. Execute the following to get these arns:

```
aws iam get-instance-profile --instance-profile-name ray-head --o text --query 'InstanceProfile.Arn'
aws iam get-instance-profile --instance-profile-name ray-worker --o text --query 'InstanceProfile.Arn'
```
20 changes: 20 additions & 0 deletions content/05-hpc-ray-workshop/03-security-groups.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
---
title: "c. Security Groups"
date: 2022-08-18
weight: 40
tags: ["Ray", "Security Groups"]
---

The default security group in a VPC does not have ssh permission which is needed by the head node to connect to the worker nodes. Also, also need a security group with permissions to mount FSx filesystem. It is straight forward to create security groups from the AWS EC2 console.

Create a security group with the following inbound rules and call it **ray-cluster-sg**:

![ray-cluster-sg-inbound-rules](/images/hpc-ray-workshop/ray-cluster-sg-inbound-rules.png)

Leave the outbound rules as default.

Next, create a security group for FSxL with the following inbound/outbound rules and call it **ray-fsx-sg**:

![ray-fsx-sg-inbound-rules](/images/hpc-ray-workshop/ray-fsx-sg-inbound-rules.png)

Again, leave the outbound rules as default.
18 changes: 18 additions & 0 deletions content/05-hpc-ray-workshop/04-create-fsx.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
---
title: "d. Create FSx for Luster Filesystem"
date: 2022-08-18
weight: 50
tags: ["Ray", "FSx"]
---

We will use AWS console to create FSxL.

- Navigate to Amazon FSx console and click on Create file system
- Select Amazon FSx for Luster and click Next
- For filesystem name, choose **ray-fsx**
- Next set the storage capacity to 1.2 TB. Leave other values as default.
- Under Network & Security settings, select ray-vpc, ray-cluster-sg and a subnet.

![ray-fsx-network-setting](/images/hpc-ray-workshop/ray-fsx-network-setting.png)

FSxL can only exist in one Availability Zone. Therefore, we will spin up the ray cluster in the same subnet as the one used here.
70 changes: 70 additions & 0 deletions content/05-hpc-ray-workshop/05-create-ami.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
---
title: "e. Create AMI"
date: 2022-08-19
weight: 60
tags: ["Ray", "AMI"]
---

It’s preferable to create an AMI instead to be used for all cluster nodes with the required packages pre-installed. It make it much faster to spin up a cluster as compared to installing all the packages on the fly at the time of cluster creation.

Launch an EC2 instance from AWS console.

- Select AWS Deep Learning Base AMI GPU CUDA 11 (Ubuntu 20.04) AMI to start with
- Select g4dn.xlarge instance type
- Select the key-pair you created earlier in this workshop
- Keep rest of the settings as default and launch the instance

It takes few minutes for the instance to be available. Once the instance is ready, ssh to this instance from Cloud9 terminal using the private ip address of the instance just created:

```bash
ssh -i your_key.pem [email protected]
```
We are going to install the following packages in this instance:

- anaconda
- ray (2.0)
- PyTorch
- FSxL client

Use the following list of command to complete the setup:

Update system:
```bash
sudo apt update && sudo apt upgrade -y
```

Set up conda:
```bash
wget https://repo.anaconda.com/archive/Anaconda3-2022.05-Linux-x86_64.sh
bash Anaconda3-2022.05-Linux-x86_64.sh -b -p $HOME/anaconda3
./anaconda3/bin/conda init
source .bashrc
pip install --upgrade pip
```

Install ray:
```bash
pip install ray[air]==2.0
```

Install PyTorch:
```bash
conda install -y pytorch torchvision cudatoolkit=11.6 -c pytorch -c conda-forge
```

Install FSx for Luster client (Ubuntu 20.04):
```bash
wget -O - https://fsx-lustre-client-repo-public-keys.s3.amazonaws.com/fsx-ubuntu-public-key.asc | gpg --dearmor | sudo tee /usr/share/keyrings/fsx-ubuntu-public-key.gpg >/dev/null
sudo bash -c 'echo "deb [signed-by=/usr/share/keyrings/fsx-ubuntu-public-key.gpg] https://fsx-lustre-client-repo.s3.amazonaws.com/ubuntu focal main" > /etc/apt/sources.list.d/fsxlustreclientrepo.list && apt-get update'
sudo apt install -y lustre-client-modules-$(uname -r)
```

Create mount point for FSxL
```bash
sudo mkdir /fsx
sudo chmod 777 /fsx
```

After this setup, exit the instance.

Navigate to the EC2 console, select the instance and click on Actions button. From the dropdown, select Image and templates, and click on Create image. In the create image wizard, just provide a name and description and click on Create image. This process can take up to 10 minutes to create an AMI. To check the progress, click AMIs under Images in the left pan. You will see new AMI in the list. Once the status of AMI changes to from Pending to Available, terminate the EC2 instance. Note that we will need the AMI id for later use.
48 changes: 48 additions & 0 deletions content/05-hpc-ray-workshop/06-cloudwatch-agent.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
---
title: "f. Set up CloudWatch metrics"
date: 2022-08-19
weight: 70
tags: ["Ray", "CloudWatch"]
---

AWS CloudWatch agent is already installed on the ubuntu AMI we used in the previous step. To setup CloudWatch in Ray cluster, we need to specify the all the metrics we wish to send to the CloudWatch in a config file. This is done by creating a json file. Save the following json to cloudwatch-agent-config.json**cloudwatch-agent-config.json**:

```json
{
"agent": {
"metrics_collection_interval": 10,
"run_as_user": "root"
},
"metrics": {
"namespace": "ray-{cluster_name}-CWAgent",
"append_dimensions": {
"InstanceId": "${aws:InstanceId}"
},
"metrics_collected": {
"cpu": {
"measurement": [
"usage_active",
"usage_system",
"usage_user"
]
},
"nvidia_gpu": {
"measurement": [
"utilization_gpu",
"utilization_memory",
"memory_used"
],
"metrics_collection_interval": 10
},
"mem": {
"measurement": [
"mem_used_percent"
],
"metrics_collection_interval": 10
}
}
}
}
```

We will use this file in the cluster configuration.
127 changes: 127 additions & 0 deletions content/05-hpc-ray-workshop/07-create-ray-cluster.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,127 @@
---
title: "g. Create Ray Cluster"
date: 2022-08-19
weight: 80
tags: ["Ray", "Cluster"]
---

To create ray cluster, we need a .yaml file with the necessary configuration. A complete list of configuration options can be found [here](https://docs.ray.io/en/latest/cluster/vms/references/ray-cluster-configuration.html).

Copy the following configuration to **cluster.yaml**:

```yaml
cluster_name: workshop

# The node config specifies the launch config and physical instance type.
available_node_types:
ray.head:
node_config:
SubnetIds: [SUBNET]
ImageId: AMI_ID
IamInstanceProfile:
Arn: RAY_HEAD_IAM_ROLE_ARN
InstanceType: c5.2xlarge

ray.worker.gpu:
min_workers: 2
max_workers: 2
node_config:
SubnetIds: [SUBNET]
ImageId: AMI_ID
IamInstanceProfile:
Arn: RAY_WORKER_IAM_ROLE_ARN
InstanceType: g4dn.2xlarge

head_node_type: ray.head
# Cloud-provider specific configuration.
provider:
type: aws
region: us-west-2
# Availability zone(s), comma-separated, that nodes may be launched in.
availability_zone: us-west-2a
cache_stopped_nodes: False # If not present, the default is True.
security_group:
GroupName: ray-cluster-sg
cloudwatch:
agent:
config: "cloudwatch-agent-config.json"

# How Ray will authenticate with newly launched nodes.
auth:
ssh_user: ubuntu

# List of shell commands to run to set up nodes.
setup_commands:
- FSXL_MOUNT_COMMAND

# Command to start ray on the head node. You don't need to change this.
head_start_ray_commands:
- ray stop
- ulimit -n 65536; ray start --head --port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml
# Command to start ray on worker nodes. You don't need to change this.
worker_start_ray_commands:
- ray stop
- ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379
```

|PlaceHolder |Replace With |
|------------ |-------------- |
|SUBNET |subnet-xxxxxxxxxxxxxxxxx (public subnet for the availability_zone specified in the .yaml file) |
|RAY_HEAD_IAM_ROLE_ARN |arn:aws:iam::xxxxxxxxxxxx:instance-profile/ray-head |
|RAY_WORKER_IAM_ROLE_ARN |arn:aws:iam::xxxxxxxxxxxx:instance-profile/ray-worker |
|FSXL_MOUNT_COMMAND | sudo mount command from FSxL console (see below) |


Use the Arns for the instance profiles obtained in the last step of section **b**.

To get the mount command for FSxL, navigate to the **ray-fsx** file system we created in section **c** and click Attach. This will show an information pan. Copy the command from step 3. under Attach instructions and add that to the .yaml file. It looks something like this:
```bash
sudo mount -t lustre -o noatime,flock fs-xxxxxxxxxxxxxxxxx.fsx.us-west-2.amazonaws.com@tcp:/xxxxxxxx /fsx
```

Before we can launch the cluster, we also have to install ray in Cloud9:
```bash
pip install boto3 ray[default]
```

Now we're ready to spin up the cluster:
```bash
ray up -y cluster.yaml
```
The command will exit once the head node is set up. The worker nodes are launched after that. The whole process takes 5-10 min to launch all the nodes.

To the check the status of the cluster, we can log in to the head node using the following command:
```bash
ray attach cluster.yaml
```

And, from inside the head node, execute:
```bash
ray status
```

You would see an output like this:
```bash
======== Autoscaler status: 2022-08-27 20:48:19.055954 ========
Node status
---------------------------------------------------------------
Healthy:
2 ray.worker.gpu
1 ray.head
Pending:
(no pending nodes)
Recent failures:
(no failures)

Resources
---------------------------------------------------------------
Usage:
0.0/24.0 CPU
0.0/2.0 GPU
0.0/2.0 accelerator_type:T4
0.00/53.336 GiB memory
0.00/22.405 GiB object_store_memory

Demands:
(no resource demands)
```
Loading