aws-samples · iyounus-aws · Aug 19, 2022 · Aug 27, 2022 · Sep 1, 2022
diff --git a/content/02-aws-getting-started/05-summary.md → content/02-aws-getting-started/99-summary.md b/content/02-aws-getting-started/05-summary.md → content/02-aws-getting-started/99-summary.md
diff --git a/content/05-hpc-ray-workshop/01-prerequisites.md b/content/05-hpc-ray-workshop/01-prerequisites.md
@@ -0,0 +1,10 @@
+---
+title: "a. Prerequisites"
+date: 2022-08-18
+weight: 20
+---
+
+- Set up VPC (TODO add link to instructions; temporary link: https://pr-282.db63t2jjt7llc.amplifyapp.com/05-batch-mnp-train-gpu/00-create-vpc-subnet.html)
+
+- Set up Cloud9 Environment (TODO add link to instructions; temporary link: https://www.hpcworkshops.com/02-aws-getting-started/04-start_cloud9.html)
+
diff --git a/content/05-hpc-ray-workshop/02-create-iam-roles.md b/content/05-hpc-ray-workshop/02-create-iam-roles.md
@@ -0,0 +1,84 @@
+---
+title: "b. Create IAM Roles"
+date: 2022-08-18
+weight: 30
+tags: ["Ray", "IAM"]
+---
+
+By default, Ray creates an IAM role with some managed policies and attaches it to the head node at the cluster creation time. No role is created for the worker node. But, to have a more granular control over policies and permissions for both the head and work nodes, we will create two IAM roles (ray-head, ray-worker) to be used at the cluster creation time.
+
+In order to create an IAM role from command line, we need to define a trust policy. Save the following trust policy to **ray_trust_policy.json** file:
+
+```json
+{
+    "Version": "2008-10-17",
+    "Statement": [
+        {
+            "Effect": "Allow",
+            "Principal": {
+                "Service": "ec2.amazonaws.com"
+            },
+            "Action": "sts:AssumeRole"
+        }
+    ]
+}
+```
+
+Execute the following commands to create the IAM role for the head node and also the instance profile for this role:
+
+```bash
+aws iam create-role --role-name ray-head --assume-role-policy-document file://ray_trust_policy.json
+aws iam create-instance-profile --instance-profile-name ray-head
+aws iam add-role-to-instance-profile --instance-profile-name ray-head --role-name ray-head
+```
+
+Next, execute the following commands to create the IAM role for worker nodes and also the instance profile for this role:
+
+```bash
+aws iam create-role --role-name ray-worker --assume-role-policy-document file://ray_trust_policy.json
+aws iam create-instance-profile --instance-profile-name ray-worker
+aws iam add-role-to-instance-profile --instance-profile-name ray-worker --role-name ray-worker
+```
+
+The head and worker nodes need permission to access S3, FSxL and CloudWatch. So we will attache the relevant managed policies to these roles. Apart from this, the head node also need to able to spin up EC2 instances.
+
+Following commands attach the necessary policies to the IAM role for the head node:
+
+```bash
+aws iam attach-role-policy --policy-arn arn:aws:iam::aws:policy/AmazonEC2FullAccess --role-name ray-head
+aws iam attach-role-policy --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess --role-name ray-head
+aws iam attach-role-policy --policy-arn arn:aws:iam::aws:policy/AmazonFSxFullAccess --role-name ray-head
+aws iam attach-role-policy --policy-arn arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy --role-name ray-head
+aws iam attach-role-policy --policy-arn arn:aws:iam::aws:policy/AmazonSSMFullAccess --role-name ray-head
+```
+Apart from these manages policies, we also need to give permission to the head node to pass an IAM role to ec2 instances. Save the following policy to **ray_pass_role_policy.json**:
+
+```json
+{
+    "Version": "2012-10-17",
+    "Statement": [
+        {
+            "Effect": "Allow",
+            "Action": [
+                "iam:PassRole"
+            ],
+            "Resource": [
+              "arn:aws:iam::*:role/ray-worker"
+            ]
+        }
+    ]
+}
+```
+
+Now, attache this policy to the ray-head role:
+
+```bash
+aws iam put-role-policy --role-name ray-head --policy-name ray-pass-role-policy --policy-document file://ray_pass_role_policy.json
+```
+
+We would need the Arns for the instance profiles for these roles later when creating the cluster. Execute the following to get these arns:
+
+```
+aws iam get-instance-profile --instance-profile-name ray-head --o text --query 'InstanceProfile.Arn'
+aws iam get-instance-profile --instance-profile-name ray-worker --o text --query 'InstanceProfile.Arn'
+```
diff --git a/content/05-hpc-ray-workshop/03-security-groups.md b/content/05-hpc-ray-workshop/03-security-groups.md
@@ -0,0 +1,20 @@
+---
+title: "c. Security Groups"
+date: 2022-08-18
+weight: 40
+tags: ["Ray", "Security Groups"]
+---
+
+The default security group in a VPC does not have ssh permission which is needed by the head node to connect to the worker nodes. Also, also need a security group with permissions to mount FSx filesystem. It is straight forward to create security groups from the AWS EC2 console.
+
+Create a security group with the following inbound rules and call it **ray-cluster-sg**:
+
+![ray-cluster-sg-inbound-rules](/images/hpc-ray-workshop/ray-cluster-sg-inbound-rules.png)
+
+Leave the outbound rules as default.
+
+Next, create a security group for FSxL with the following inbound/outbound rules and call it **ray-fsx-sg**:
+
+![ray-fsx-sg-inbound-rules](/images/hpc-ray-workshop/ray-fsx-sg-inbound-rules.png)
+
+Again, leave the outbound rules as default.
diff --git a/content/05-hpc-ray-workshop/04-create-fsx.md b/content/05-hpc-ray-workshop/04-create-fsx.md
@@ -0,0 +1,18 @@
+---
+title: "d. Create FSx for Luster Filesystem"
+date: 2022-08-18
+weight: 50
+tags: ["Ray", "FSx"]
+---
+
+We will use AWS console to create FSxL.
+
+- Navigate to Amazon FSx console and click on Create file system
+- Select Amazon FSx for Luster and click Next
+- For filesystem name, choose **ray-fsx**
+- Next set the storage capacity to 1.2 TB. Leave other values as default.
+- Under Network & Security settings, select ray-vpc, ray-cluster-sg and a subnet.
+
+![ray-fsx-network-setting](/images/hpc-ray-workshop/ray-fsx-network-setting.png)
+
+FSxL can only exist in one Availability Zone. Therefore, we will spin up the ray cluster in the same subnet as the one used here.
diff --git a/content/05-hpc-ray-workshop/05-create-ami.md b/content/05-hpc-ray-workshop/05-create-ami.md
@@ -0,0 +1,70 @@
+---
+title: "e. Create AMI"
+date: 2022-08-19
+weight: 60
+tags: ["Ray", "AMI"]
+---
+
+It’s preferable to create an AMI instead to be used for all cluster nodes with the required packages pre-installed. It make it much faster to spin up a cluster as compared to installing all the packages on the fly at the time of cluster creation.
+
+Launch an EC2 instance from AWS console.
+
+- Select AWS Deep Learning Base AMI GPU CUDA 11 (Ubuntu 20.04) AMI to start with
+- Select g4dn.xlarge instance type
+- Select the key-pair you created earlier in this workshop
+- Keep rest of the settings as default and launch the instance
+
+It takes few minutes for the instance to be available. Once the instance is ready, ssh to this instance from Cloud9 terminal using the private ip address of the instance just created:
+
+```bash
+ssh -i your_key.pem [email protected]
+```
+We are going to install the following packages in this instance:
+
+- anaconda
+- ray (2.0)
+- PyTorch
+- FSxL client
+
+Use the following list of command to complete the setup:
+
+Update system:
+```bash
+sudo apt update && sudo apt upgrade -y
+```
+
+Set up conda:
+```bash
+wget https://repo.anaconda.com/archive/Anaconda3-2022.05-Linux-x86_64.sh
+bash Anaconda3-2022.05-Linux-x86_64.sh -b -p $HOME/anaconda3
+./anaconda3/bin/conda init
+source .bashrc
+pip install --upgrade pip
+```
+
+Install ray:
+```bash
+pip install ray[air]==2.0
+```
+
+Install PyTorch:
+```bash
+conda install -y pytorch torchvision cudatoolkit=11.6 -c pytorch -c conda-forge
+```
+
+Install FSx for Luster client (Ubuntu 20.04):
+```bash
+wget -O - https://fsx-lustre-client-repo-public-keys.s3.amazonaws.com/fsx-ubuntu-public-key.asc | gpg --dearmor | sudo tee /usr/share/keyrings/fsx-ubuntu-public-key.gpg >/dev/null
+sudo bash -c 'echo "deb [signed-by=/usr/share/keyrings/fsx-ubuntu-public-key.gpg] https://fsx-lustre-client-repo.s3.amazonaws.com/ubuntu focal main" > /etc/apt/sources.list.d/fsxlustreclientrepo.list && apt-get update'
+sudo apt install -y lustre-client-modules-$(uname -r)
+```
+
+Create mount point for FSxL
+```bash
+sudo mkdir /fsx
+sudo chmod 777 /fsx
+```
+
+After this setup, exit the instance.
+
+Navigate to the EC2 console, select the instance and click on Actions button. From the dropdown, select Image and templates, and click on Create image. In the create image wizard, just provide a name and description and click on Create image. This process can take up to 10 minutes to create an AMI. To check the progress, click AMIs under Images in the left pan. You will see new AMI in the list. Once the status of AMI changes to from Pending to Available, terminate the EC2 instance. Note that we will need the AMI id for later use.
diff --git a/content/05-hpc-ray-workshop/06-cloudwatch-agent.md b/content/05-hpc-ray-workshop/06-cloudwatch-agent.md
@@ -0,0 +1,48 @@
+---
+title: "f. Set up CloudWatch metrics"
+date: 2022-08-19
+weight: 70
+tags: ["Ray", "CloudWatch"]
+---
+
+AWS CloudWatch agent is already installed on the ubuntu AMI we used in the previous step. To setup CloudWatch in Ray cluster, we need to specify the all the metrics we wish to send to the CloudWatch in a config file. This is done by creating a json file. Save the following json to cloudwatch-agent-config.json**cloudwatch-agent-config.json**:
+
+```json
+{
+    "agent": {
+        "metrics_collection_interval": 10,
+        "run_as_user": "root"
+    },
+    "metrics": {
+        "namespace": "ray-{cluster_name}-CWAgent",
+        "append_dimensions": {
+            "InstanceId": "${aws:InstanceId}"
+        },
+        "metrics_collected": {
+            "cpu": {
+                "measurement": [
+                    "usage_active",
+                    "usage_system",
+                    "usage_user"
+                ]
+            },
+            "nvidia_gpu": {
+                "measurement": [
+                    "utilization_gpu",
+                    "utilization_memory",
+                    "memory_used"
+                ],
+                "metrics_collection_interval": 10
+            },
+            "mem": {
+                "measurement": [
+                    "mem_used_percent"
+                ],
+                "metrics_collection_interval": 10
+            }
+        }
+    }
+}
+```
+
+We will use this file in the cluster configuration.
diff --git a/content/05-hpc-ray-workshop/07-create-ray-cluster.md b/content/05-hpc-ray-workshop/07-create-ray-cluster.md
@@ -0,0 +1,127 @@
+---
+title: "g. Create Ray Cluster"
+date: 2022-08-19
+weight: 80
+tags: ["Ray", "Cluster"]
+---
+
+To create ray cluster, we need a .yaml file with the necessary configuration. A complete list of configuration options can be found [here](https://docs.ray.io/en/latest/cluster/vms/references/ray-cluster-configuration.html).
+
+Copy the following configuration to **cluster.yaml**:
+
+```yaml
+cluster_name: workshop
+
+# The node config specifies the launch config and physical instance type.
+available_node_types:
+    ray.head:
+        node_config:
+            SubnetIds: [SUBNET]
+            ImageId: AMI_ID
+            IamInstanceProfile:
+                Arn: RAY_HEAD_IAM_ROLE_ARN
+            InstanceType: c5.2xlarge
+
+    ray.worker.gpu:
+        min_workers: 2
+        max_workers: 2
+        node_config:
+            SubnetIds: [SUBNET]
+            ImageId: AMI_ID
+            IamInstanceProfile:
+                Arn: RAY_WORKER_IAM_ROLE_ARN
+            InstanceType: g4dn.2xlarge
+
+head_node_type: ray.head
+# Cloud-provider specific configuration.
+provider:
+    type: aws
+    region: us-west-2
+    # Availability zone(s), comma-separated, that nodes may be launched in.
+    availability_zone: us-west-2a
+    cache_stopped_nodes: False # If not present, the default is True.
+    security_group:
+        GroupName: ray-cluster-sg
+    cloudwatch:
+        agent:
+            config: "cloudwatch-agent-config.json"
+
+# How Ray will authenticate with newly launched nodes.
+auth:
+    ssh_user: ubuntu
+
+# List of shell commands to run to set up nodes.
+setup_commands:
+    - FSXL_MOUNT_COMMAND
+
+# Command to start ray on the head node. You don't need to change this.
+head_start_ray_commands:
+    - ray stop
+    - ulimit -n 65536; ray start --head --port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml
+# Command to start ray on worker nodes. You don't need to change this.
+worker_start_ray_commands:
+    - ray stop
+    - ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379
+```
+
+|PlaceHolder |Replace With |
+|------------ |-------------- |
+|SUBNET      |subnet-xxxxxxxxxxxxxxxxx (public subnet for the availability_zone specified in the .yaml file) |
+|RAY_HEAD_IAM_ROLE_ARN |arn:aws:iam::xxxxxxxxxxxx:instance-profile/ray-head |
+|RAY_WORKER_IAM_ROLE_ARN |arn:aws:iam::xxxxxxxxxxxx:instance-profile/ray-worker |
+|FSXL_MOUNT_COMMAND | sudo mount command from FSxL console (see below) |
+
+
+Use the Arns for the instance profiles obtained in the last step of section **b**.
+
+To get the mount command for FSxL, navigate to the **ray-fsx** file system we created in section **c** and click Attach. This will show an information pan. Copy the command from step 3. under Attach instructions and add that to the .yaml file. It looks something like this:
+```bash
+sudo mount -t lustre -o noatime,flock fs-xxxxxxxxxxxxxxxxx.fsx.us-west-2.amazonaws.com@tcp:/xxxxxxxx /fsx
+```
+
+Before we can launch the cluster, we also have to install ray in Cloud9:
+```bash
+pip install boto3 ray[default]
+```
+
+Now we're ready to spin up the cluster:
+```bash
+ray up -y cluster.yaml
+```
+The command will exit once the head node is set up. The worker nodes are launched after that. The whole process takes 5-10 min to launch all the nodes.
+
+To the check the status of the cluster, we can log in to the head node using the following command:
+```bash
+ray attach cluster.yaml
+```
+
+And, from inside the head node, execute:
+```bash
+ray status
+```
+
+You would see an output like this:
+```bash
+======== Autoscaler status: 2022-08-27 20:48:19.055954 ========
+Node status
+---------------------------------------------------------------
+Healthy:
+ 2 ray.worker.gpu
+ 1 ray.head
+Pending:
+ (no pending nodes)
+Recent failures:
+ (no failures)
+
+Resources
+---------------------------------------------------------------
+Usage:
+ 0.0/24.0 CPU
+ 0.0/2.0 GPU
+ 0.0/2.0 accelerator_type:T4
+ 0.00/53.336 GiB memory
+ 0.00/22.405 GiB object_store_memory
+
+Demands:
+ (no resource demands)
+```