diff --git a/content/02-aws-getting-started/05-summary.md b/content/02-aws-getting-started/99-summary.md similarity index 100% rename from content/02-aws-getting-started/05-summary.md rename to content/02-aws-getting-started/99-summary.md diff --git a/content/05-hpc-ray-workshop/01-prerequisites.md b/content/05-hpc-ray-workshop/01-prerequisites.md new file mode 100644 index 00000000..4f3d5103 --- /dev/null +++ b/content/05-hpc-ray-workshop/01-prerequisites.md @@ -0,0 +1,10 @@ +--- +title: "a. Prerequisites" +date: 2022-08-18 +weight: 20 +--- + +- Set up VPC (TODO add link to instructions; temporary link: https://pr-282.db63t2jjt7llc.amplifyapp.com/05-batch-mnp-train-gpu/00-create-vpc-subnet.html) + +- Set up Cloud9 Environment (TODO add link to instructions; temporary link: https://www.hpcworkshops.com/02-aws-getting-started/04-start_cloud9.html) + diff --git a/content/05-hpc-ray-workshop/02-create-iam-roles.md b/content/05-hpc-ray-workshop/02-create-iam-roles.md new file mode 100644 index 00000000..f0f769f6 --- /dev/null +++ b/content/05-hpc-ray-workshop/02-create-iam-roles.md @@ -0,0 +1,84 @@ +--- +title: "b. Create IAM Roles" +date: 2022-08-18 +weight: 30 +tags: ["Ray", "IAM"] +--- + +By default, Ray creates an IAM role with some managed policies and attaches it to the head node at the cluster creation time. No role is created for the worker node. But, to have a more granular control over policies and permissions for both the head and work nodes, we will create two IAM roles (ray-head, ray-worker) to be used at the cluster creation time. + +In order to create an IAM role from command line, we need to define a trust policy. Save the following trust policy to **ray_trust_policy.json** file: + +```json +{ + "Version": "2008-10-17", + "Statement": [ + { + "Effect": "Allow", + "Principal": { + "Service": "ec2.amazonaws.com" + }, + "Action": "sts:AssumeRole" + } + ] +} +``` + +Execute the following commands to create the IAM role for the head node and also the instance profile for this role: + +```bash +aws iam create-role --role-name ray-head --assume-role-policy-document file://ray_trust_policy.json +aws iam create-instance-profile --instance-profile-name ray-head +aws iam add-role-to-instance-profile --instance-profile-name ray-head --role-name ray-head +``` + +Next, execute the following commands to create the IAM role for worker nodes and also the instance profile for this role: + +```bash +aws iam create-role --role-name ray-worker --assume-role-policy-document file://ray_trust_policy.json +aws iam create-instance-profile --instance-profile-name ray-worker +aws iam add-role-to-instance-profile --instance-profile-name ray-worker --role-name ray-worker +``` + +The head and worker nodes need permission to access S3, FSxL and CloudWatch. So we will attache the relevant managed policies to these roles. Apart from this, the head node also need to able to spin up EC2 instances. + +Following commands attach the necessary policies to the IAM role for the head node: + +```bash +aws iam attach-role-policy --policy-arn arn:aws:iam::aws:policy/AmazonEC2FullAccess --role-name ray-head +aws iam attach-role-policy --policy-arn arn:aws:iam::aws:policy/AmazonS3FullAccess --role-name ray-head +aws iam attach-role-policy --policy-arn arn:aws:iam::aws:policy/AmazonFSxFullAccess --role-name ray-head +aws iam attach-role-policy --policy-arn arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy --role-name ray-head +aws iam attach-role-policy --policy-arn arn:aws:iam::aws:policy/AmazonSSMFullAccess --role-name ray-head +``` +Apart from these manages policies, we also need to give permission to the head node to pass an IAM role to ec2 instances. Save the following policy to **ray_pass_role_policy.json**: + +```json +{ + "Version": "2012-10-17", + "Statement": [ + { + "Effect": "Allow", + "Action": [ + "iam:PassRole" + ], + "Resource": [ + "arn:aws:iam::*:role/ray-worker" + ] + } + ] +} +``` + +Now, attache this policy to the ray-head role: + +```bash +aws iam put-role-policy --role-name ray-head --policy-name ray-pass-role-policy --policy-document file://ray_pass_role_policy.json +``` + +We would need the Arns for the instance profiles for these roles later when creating the cluster. Execute the following to get these arns: + +``` +aws iam get-instance-profile --instance-profile-name ray-head --o text --query 'InstanceProfile.Arn' +aws iam get-instance-profile --instance-profile-name ray-worker --o text --query 'InstanceProfile.Arn' +``` diff --git a/content/05-hpc-ray-workshop/03-security-groups.md b/content/05-hpc-ray-workshop/03-security-groups.md new file mode 100644 index 00000000..a20c9279 --- /dev/null +++ b/content/05-hpc-ray-workshop/03-security-groups.md @@ -0,0 +1,20 @@ +--- +title: "c. Security Groups" +date: 2022-08-18 +weight: 40 +tags: ["Ray", "Security Groups"] +--- + +The default security group in a VPC does not have ssh permission which is needed by the head node to connect to the worker nodes. Also, also need a security group with permissions to mount FSx filesystem. It is straight forward to create security groups from the AWS EC2 console. + +Create a security group with the following inbound rules and call it **ray-cluster-sg**: + +![ray-cluster-sg-inbound-rules](/images/hpc-ray-workshop/ray-cluster-sg-inbound-rules.png) + +Leave the outbound rules as default. + +Next, create a security group for FSxL with the following inbound/outbound rules and call it **ray-fsx-sg**: + +![ray-fsx-sg-inbound-rules](/images/hpc-ray-workshop/ray-fsx-sg-inbound-rules.png) + +Again, leave the outbound rules as default. diff --git a/content/05-hpc-ray-workshop/04-create-fsx.md b/content/05-hpc-ray-workshop/04-create-fsx.md new file mode 100644 index 00000000..11643976 --- /dev/null +++ b/content/05-hpc-ray-workshop/04-create-fsx.md @@ -0,0 +1,18 @@ +--- +title: "d. Create FSx for Luster Filesystem" +date: 2022-08-18 +weight: 50 +tags: ["Ray", "FSx"] +--- + +We will use AWS console to create FSxL. + +- Navigate to Amazon FSx console and click on Create file system +- Select Amazon FSx for Luster and click Next +- For filesystem name, choose **ray-fsx** +- Next set the storage capacity to 1.2 TB. Leave other values as default. +- Under Network & Security settings, select ray-vpc, ray-cluster-sg and a subnet. + +![ray-fsx-network-setting](/images/hpc-ray-workshop/ray-fsx-network-setting.png) + +FSxL can only exist in one Availability Zone. Therefore, we will spin up the ray cluster in the same subnet as the one used here. diff --git a/content/05-hpc-ray-workshop/05-create-ami.md b/content/05-hpc-ray-workshop/05-create-ami.md new file mode 100644 index 00000000..263c4eb6 --- /dev/null +++ b/content/05-hpc-ray-workshop/05-create-ami.md @@ -0,0 +1,70 @@ +--- +title: "e. Create AMI" +date: 2022-08-19 +weight: 60 +tags: ["Ray", "AMI"] +--- + +It’s preferable to create an AMI instead to be used for all cluster nodes with the required packages pre-installed. It make it much faster to spin up a cluster as compared to installing all the packages on the fly at the time of cluster creation. + +Launch an EC2 instance from AWS console. + +- Select AWS Deep Learning Base AMI GPU CUDA 11 (Ubuntu 20.04) AMI to start with +- Select g4dn.xlarge instance type +- Select the key-pair you created earlier in this workshop +- Keep rest of the settings as default and launch the instance + +It takes few minutes for the instance to be available. Once the instance is ready, ssh to this instance from Cloud9 terminal using the private ip address of the instance just created: + +```bash +ssh -i your_key.pem ubuntu@10.0.xxx.xxx +``` +We are going to install the following packages in this instance: + +- anaconda +- ray (2.0) +- PyTorch +- FSxL client + +Use the following list of command to complete the setup: + +Update system: +```bash +sudo apt update && sudo apt upgrade -y +``` + +Set up conda: +```bash +wget https://repo.anaconda.com/archive/Anaconda3-2022.05-Linux-x86_64.sh +bash Anaconda3-2022.05-Linux-x86_64.sh -b -p $HOME/anaconda3 +./anaconda3/bin/conda init +source .bashrc +pip install --upgrade pip +``` + +Install ray: +```bash +pip install ray[air]==2.0 +``` + +Install PyTorch: +```bash +conda install -y pytorch torchvision cudatoolkit=11.6 -c pytorch -c conda-forge +``` + +Install FSx for Luster client (Ubuntu 20.04): +```bash +wget -O - https://fsx-lustre-client-repo-public-keys.s3.amazonaws.com/fsx-ubuntu-public-key.asc | gpg --dearmor | sudo tee /usr/share/keyrings/fsx-ubuntu-public-key.gpg >/dev/null +sudo bash -c 'echo "deb [signed-by=/usr/share/keyrings/fsx-ubuntu-public-key.gpg] https://fsx-lustre-client-repo.s3.amazonaws.com/ubuntu focal main" > /etc/apt/sources.list.d/fsxlustreclientrepo.list && apt-get update' +sudo apt install -y lustre-client-modules-$(uname -r) +``` + +Create mount point for FSxL +```bash +sudo mkdir /fsx +sudo chmod 777 /fsx +``` + +After this setup, exit the instance. + +Navigate to the EC2 console, select the instance and click on Actions button. From the dropdown, select Image and templates, and click on Create image. In the create image wizard, just provide a name and description and click on Create image. This process can take up to 10 minutes to create an AMI. To check the progress, click AMIs under Images in the left pan. You will see new AMI in the list. Once the status of AMI changes to from Pending to Available, terminate the EC2 instance. Note that we will need the AMI id for later use. diff --git a/content/05-hpc-ray-workshop/06-cloudwatch-agent.md b/content/05-hpc-ray-workshop/06-cloudwatch-agent.md new file mode 100644 index 00000000..0620cbbb --- /dev/null +++ b/content/05-hpc-ray-workshop/06-cloudwatch-agent.md @@ -0,0 +1,48 @@ +--- +title: "f. Set up CloudWatch metrics" +date: 2022-08-19 +weight: 70 +tags: ["Ray", "CloudWatch"] +--- + +AWS CloudWatch agent is already installed on the ubuntu AMI we used in the previous step. To setup CloudWatch in Ray cluster, we need to specify the all the metrics we wish to send to the CloudWatch in a config file. This is done by creating a json file. Save the following json to cloudwatch-agent-config.json**cloudwatch-agent-config.json**: + +```json +{ + "agent": { + "metrics_collection_interval": 10, + "run_as_user": "root" + }, + "metrics": { + "namespace": "ray-{cluster_name}-CWAgent", + "append_dimensions": { + "InstanceId": "${aws:InstanceId}" + }, + "metrics_collected": { + "cpu": { + "measurement": [ + "usage_active", + "usage_system", + "usage_user" + ] + }, + "nvidia_gpu": { + "measurement": [ + "utilization_gpu", + "utilization_memory", + "memory_used" + ], + "metrics_collection_interval": 10 + }, + "mem": { + "measurement": [ + "mem_used_percent" + ], + "metrics_collection_interval": 10 + } + } + } +} +``` + +We will use this file in the cluster configuration. diff --git a/content/05-hpc-ray-workshop/07-create-ray-cluster.md b/content/05-hpc-ray-workshop/07-create-ray-cluster.md new file mode 100644 index 00000000..85795b67 --- /dev/null +++ b/content/05-hpc-ray-workshop/07-create-ray-cluster.md @@ -0,0 +1,127 @@ +--- +title: "g. Create Ray Cluster" +date: 2022-08-19 +weight: 80 +tags: ["Ray", "Cluster"] +--- + +To create ray cluster, we need a .yaml file with the necessary configuration. A complete list of configuration options can be found [here](https://docs.ray.io/en/latest/cluster/vms/references/ray-cluster-configuration.html). + +Copy the following configuration to **cluster.yaml**: + +```yaml +cluster_name: workshop + +# The node config specifies the launch config and physical instance type. +available_node_types: + ray.head: + node_config: + SubnetIds: [SUBNET] + ImageId: AMI_ID + IamInstanceProfile: + Arn: RAY_HEAD_IAM_ROLE_ARN + InstanceType: c5.2xlarge + + ray.worker.gpu: + min_workers: 2 + max_workers: 2 + node_config: + SubnetIds: [SUBNET] + ImageId: AMI_ID + IamInstanceProfile: + Arn: RAY_WORKER_IAM_ROLE_ARN + InstanceType: g4dn.2xlarge + +head_node_type: ray.head +# Cloud-provider specific configuration. +provider: + type: aws + region: us-west-2 + # Availability zone(s), comma-separated, that nodes may be launched in. + availability_zone: us-west-2a + cache_stopped_nodes: False # If not present, the default is True. + security_group: + GroupName: ray-cluster-sg + cloudwatch: + agent: + config: "cloudwatch-agent-config.json" + +# How Ray will authenticate with newly launched nodes. +auth: + ssh_user: ubuntu + +# List of shell commands to run to set up nodes. +setup_commands: + - FSXL_MOUNT_COMMAND + +# Command to start ray on the head node. You don't need to change this. +head_start_ray_commands: + - ray stop + - ulimit -n 65536; ray start --head --port=6379 --autoscaling-config=~/ray_bootstrap_config.yaml +# Command to start ray on worker nodes. You don't need to change this. +worker_start_ray_commands: + - ray stop + - ulimit -n 65536; ray start --address=$RAY_HEAD_IP:6379 +``` + +|PlaceHolder |Replace With | +|------------ |-------------- | +|SUBNET |subnet-xxxxxxxxxxxxxxxxx (public subnet for the availability_zone specified in the .yaml file) | +|RAY_HEAD_IAM_ROLE_ARN |arn:aws:iam::xxxxxxxxxxxx:instance-profile/ray-head | +|RAY_WORKER_IAM_ROLE_ARN |arn:aws:iam::xxxxxxxxxxxx:instance-profile/ray-worker | +|FSXL_MOUNT_COMMAND | sudo mount command from FSxL console (see below) | + + +Use the Arns for the instance profiles obtained in the last step of section **b**. + +To get the mount command for FSxL, navigate to the **ray-fsx** file system we created in section **c** and click Attach. This will show an information pan. Copy the command from step 3. under Attach instructions and add that to the .yaml file. It looks something like this: +```bash +sudo mount -t lustre -o noatime,flock fs-xxxxxxxxxxxxxxxxx.fsx.us-west-2.amazonaws.com@tcp:/xxxxxxxx /fsx +``` + +Before we can launch the cluster, we also have to install ray in Cloud9: +```bash +pip install boto3 ray[default] +``` + +Now we're ready to spin up the cluster: +```bash +ray up -y cluster.yaml +``` +The command will exit once the head node is set up. The worker nodes are launched after that. The whole process takes 5-10 min to launch all the nodes. + +To the check the status of the cluster, we can log in to the head node using the following command: +```bash +ray attach cluster.yaml +``` + +And, from inside the head node, execute: +```bash +ray status +``` + +You would see an output like this: +```bash +======== Autoscaler status: 2022-08-27 20:48:19.055954 ======== +Node status +--------------------------------------------------------------- +Healthy: + 2 ray.worker.gpu + 1 ray.head +Pending: + (no pending nodes) +Recent failures: + (no failures) + +Resources +--------------------------------------------------------------- +Usage: + 0.0/24.0 CPU + 0.0/2.0 GPU + 0.0/2.0 accelerator_type:T4 + 0.00/53.336 GiB memory + 0.00/22.405 GiB object_store_memory + +Demands: + (no resource demands) +``` diff --git a/content/05-hpc-ray-workshop/08-data-prep.md b/content/05-hpc-ray-workshop/08-data-prep.md new file mode 100644 index 00000000..0dfcae88 --- /dev/null +++ b/content/05-hpc-ray-workshop/08-data-prep.md @@ -0,0 +1,90 @@ +--- +title: "h. Prepare Training Data" +date: 2022-08-27 +weight: 90 +--- + +For the model training part of the workshop, we will use the Tiny ImageNet dataset which consists of 100000 images of 200 classes. We can download the data directly to the FSxL filesystem mounted to all the nodes in the cluster. We created the mount directory (`/fsx`) when creating the AMI in step **e**. Executing the following command will run wget on the head node to download the data to `/fsx` directory: + +```bash +ray exec cluster.yaml 'wget http://cs231n.stanford.edu/tiny-imagenet-200.zip -P /fsx/' +``` + +Next, unzip the data file residing in the `/fsx` directory: + +```bash +ray exec cluster.yaml 'unzip -d /fsx /fsx/tiny-imagenet-200.zip && rm /fsx/tiny-imagenet-200.zip' +``` + +We can check the contents of the data directory executing `ls` command on head node: + +```bash +ray exec cluster.yaml 'ls /fsx/tiny-imagenet-200' +``` + +In our training code, we will use ImageFolder class from PyTorch to ingest this dataset. The ImageFolder class expects all the images to be stored in separate folders for each class. The structure should look like this: + +``` +. +|-- train +| |-- class1 +| | |-- image1.jpeg +| | |-- image2.jpeg +| | |-- image3.jpeg +. +| |-- class2 +| | |-- image1.jpeg +| | |-- image2.jpeg +| | |-- image3.jpeg +. +. +``` + +The `val` folder in the Tiny ImageNet dataset does not have this structure, so we have to rearrange the images in the val directory. This can done by running a simple python script. Copy the following code to **data-prep.py** file: + +```python +import os +import ray + +def main(): + ray.init(address="auto") + + root_dir = '/fsx/tiny-imagenet-200/val/' + annotation_file = 'val_annotations.txt' + with open(root_dir + annotation_file) as f: + """ + lines in the val_annotations.txt file: + val_0.JPEG n03444034 0 32 44 62 + val_1.JPEG n04067472 52 55 57 59 + val_2.JPEG n04070727 4 0 60 55 + """ + lines = f.read().split('\n') + lines = lines[:-1] # last line is empty + + data = {} + for line in lines: + file, label = line.split('\t')[:2] + data[file] = label + + # create the directories. labels are the directory names + labels = set(data.values()) + for label in labels: + os.mkdir(root_dir + label) + + # move files from images folder to the new directories + for file in data: + src = root_dir + 'images/' + file + dst = root_dir + '/' + data[file] + '/' + file + os.replace(src, dst) + + os.rmdir(root_dir + 'images') + os.remove(root_dir + annotation_file) + +if __name__ == "__main__": + main() +``` + +Finally, execute this code on the ray cluster: +```bash +ray submit cluster.yaml data-prep.py +``` diff --git a/content/05-hpc-ray-workshop/09-train-model.md b/content/05-hpc-ray-workshop/09-train-model.md new file mode 100644 index 00000000..bf177ea2 --- /dev/null +++ b/content/05-hpc-ray-workshop/09-train-model.md @@ -0,0 +1,157 @@ +--- +title: "i. Train ResNet18 Model" +date: 2022-08-27 +weight: 100 +tags: ["PyTorch", "ResNet", "ResNet18"] +--- + +We will now train ResNet18 model on Tiny ImageNet dataset. The training the will run on two gpu worker nodes in the ray cluster launched in section **g**. + +The python code for the training is mostly standard PyTorch model training code with some additional ray code to set up the distributed training on the ray cluster. Following is the complete code which we will run on the cluster. Create **train.py** file and copy the code to this file: + +(TODO move this code to a downloadable file) + +```python +import ray +from ray.train.torch import TorchTrainer +from ray.train.torch import prepare_data_loader, prepare_model +from ray.air.config import ScalingConfig +from ray.air import session + +import torch +from torch.utils.data import DataLoader +from torchvision import datasets, models, transforms + +def imagenet_data_creator(config): + train_transform = transforms.Compose([ + transforms.RandomHorizontalFlip(), + transforms.RandomResizedCrop(224), + transforms.ToTensor() + ]) + val_transform = transforms.Compose([ + transforms.Resize(256), + transforms.CenterCrop(224), + transforms.ToTensor() + ]) + + train_data = datasets.ImageFolder(config['traindir'], transform=train_transform) + val_data = datasets.ImageFolder(config['valdir'], transform=val_transform) + + train_loader = DataLoader( + train_data, + config['batch_size'], + num_workers=config['num_data_workers'], + pin_memory=True, + shuffle=True + ) + val_loader = DataLoader( + val_data, + config['batch_size'], + num_workers=config['num_data_workers'], + pin_memory=True + ) + return train_loader, val_loader + +def train_epoch(dataloader, model, loss_fn, optimizer): + size = len(dataloader.dataset) // session.get_world_size() + model.train() + + for batch, (X, y) in enumerate(dataloader): + # Compute prediction error + pred = model(X) + loss = loss_fn(pred, y) + # Backpropagation + optimizer.zero_grad() + loss.backward() + optimizer.step() + + if batch % 100 == 0: + loss, current = loss.item(), batch * len(X) + print(f"train loss: {loss:>5f}, batch [{current:>5d}/{size:>5d}]") + +def validate_epoch(dataloader, model, loss_fn): + size = len(dataloader.dataset) // session.get_world_size() + num_batches = len(dataloader) + model.eval() + test_loss, correct = 0, 0 + with torch.no_grad(): + for X, y in dataloader: + pred = model(X) + test_loss += loss_fn(pred, y).item() + correct += (pred.argmax(1) == y).type(torch.float).sum().item() + + test_loss /= num_batches + correct /= size + print(f"Test Accuracy: {(100 * correct):>0.1f}%, ", f"Avg loss: {test_loss:>8f} \n") + return test_loss + +def train_func(config): + # Create data loaders. + train_dataloader, val_dataloader = imagenet_data_creator(config) + train_dataloader = prepare_data_loader(train_dataloader) + val_dataloader = prepare_data_loader(val_dataloader) + + # Create model. + model = models.resnet18() + model = prepare_model(model) + + loss_fn = torch.nn.CrossEntropyLoss() + optimizer = torch.optim.SGD(model.parameters(), lr=config['lr']) + + loss_results = [] + for _ in range(config['epochs']): + train_epoch(train_dataloader, model, loss_fn, optimizer) + loss = validate_epoch(val_dataloader, model, loss_fn) + ray.train.report(loss=loss) + loss_results.append(loss) + + return loss_results + +def main(config): + ray.init(address='auto', log_to_driver=True) + + scaling_config = ScalingConfig( + num_workers=config['num_ray_workers'], + use_gpu=config['use_gpu'], + resources_per_worker=config['resources_per_worker'] + ) + trainer = TorchTrainer( + train_func, + train_loop_config=config, + scaling_config=scaling_config + ) + + result = trainer.fit() + print(f"Loss results: {result}") + +if __name__ == "__main__": + config = { + # ray related config + 'num_ray_workers': 2, + 'use_gpu': True, + 'resources_per_worker': {'CPU': 4, 'GPU': 1}, + # pytorch related config + 'traindir':'/fsx/tiny-imagenet-200/train', + 'valdir': '/fsx/tiny-imagenet-200/val', + 'batch_size': 64, # per worker batch size + 'num_data_workers': 4, + 'lr': 1e-3, + 'epochs': 1, + } + + main(config) +``` +To submit this code to the ray cluster execute the following: + +```bash +ray submit cluster.yaml train.py +``` + +This training job will take 7-8 min for one epoch. + +While the job is running, we can monitor gpu usage in CloudWatch. Navigate to CloudWatch console and select All Metric from the left pan. You will find `ray-workshop-CWAgent` namespace under Custom namespaces. Select this namespace and then click on the first group in the next step. You should see the following metrics for each gpu in the cluster: + +![cloudwatch_gpu_metrics](/images/hpc-ray-workshop/cloudwatch_gpu_metrics.png) + +Select the gpu utilization metrics from this list. Next, select Graphed metrics tab and change the averaging time from 5 min to 10 sec. Keep in the mind that it might take several minutes before the gpu utilization will show in the CloudWatch. It takes some time for the training to start on the cluster, and also, CloudWatch metrics are delayed by almost one minute. + diff --git a/content/05-hpc-ray-workshop/10-cleanup.md b/content/05-hpc-ray-workshop/10-cleanup.md new file mode 100644 index 00000000..c6008717 --- /dev/null +++ b/content/05-hpc-ray-workshop/10-cleanup.md @@ -0,0 +1,17 @@ +--- +title: "j. Cleanup" +date: 2022-08-27 +weight: 110 +--- + +Once our training is finished, we can cleanup all the resources. + +- Shutdown the cluster using the following command: + +```bash +ray down -y cluster.yaml +``` + +- Navigate to the FSx console to deleted the filesystem. Select **ray-fsx** file system from the list, and click Actions button. From the dropdown, select Delete file system. + +- Navigate to the EC2 console and click AMIs under Images in the left pan. Select ***ray_workshop_ami** from the list and click Actions button. From the dropdown, select Deregister AMI. diff --git a/content/05-hpc-ray-workshop/_index.md b/content/05-hpc-ray-workshop/_index.md new file mode 100644 index 00000000..533613c3 --- /dev/null +++ b/content/05-hpc-ray-workshop/_index.md @@ -0,0 +1,25 @@ +--- +title: "Ray Clusters on Amazon EC2" +date: 2022-08-18 +weight: 50 +pre: "Part III ⁃ " +tags: ["Ray", "Overview"] +--- + +#### Ray Cluster in Nutshell + +Ray is a distributed computing platform that can be used to scale Python applications with minimal effort. It provides a unified way to scale Python and AI applications from a laptop to a cluster. It is designed to be general-purpose and it can run any kind of workloads. Ray consists of a core distributed runtime and a toolkit of libraries (Ray AIR) for simplifying ML compute. + +![ray-cluster-arch](/images/hpc-ray-workshop/ray_air.png) + +#### What you will do in this part of the lab + +In this workshop, you will learn how to set up a Ray cluster on [Amazon EC2](https://aws.amazon.com/ec2/), and train a [PyTorch](https://pytorch.org/) model. The workshop includes the following steps: + +- Create IAM roles to be used by the head and worker nodes in the cluster +- Set up security groups +- Create an AMI to be used by head and worker nodes in the cluster +- Set up [Amazon FSx for Luster](https://aws.amazon.com/fsx/lustre/) (FSxL) filesystem +- Set up [Amazon CloudWatch](https://aws.amazon.com/pm/cloudwatch/) agent for resource monitoring +- Spin up Ray cluster +- Train PyTorch ResNet18 model on Tiny ImageNet dataset diff --git a/content/05-after-event/04-pcluster-stacks.md b/content/99-after-event/04-pcluster-stacks.md similarity index 100% rename from content/05-after-event/04-pcluster-stacks.md rename to content/99-after-event/04-pcluster-stacks.md diff --git a/content/05-after-event/05-connect-pcmanager.md b/content/99-after-event/05-connect-pcmanager.md similarity index 100% rename from content/05-after-event/05-connect-pcmanager.md rename to content/99-after-event/05-connect-pcmanager.md diff --git a/content/05-after-event/_index.md b/content/99-after-event/_index.md similarity index 100% rename from content/05-after-event/_index.md rename to content/99-after-event/_index.md diff --git a/static/images/hpc-ray-workshop/cloudwatch_gpu_metrics.png b/static/images/hpc-ray-workshop/cloudwatch_gpu_metrics.png new file mode 100755 index 00000000..f4f12bbc Binary files /dev/null and b/static/images/hpc-ray-workshop/cloudwatch_gpu_metrics.png differ diff --git a/static/images/hpc-ray-workshop/ray-cluster-sg-inbound-rules.png b/static/images/hpc-ray-workshop/ray-cluster-sg-inbound-rules.png new file mode 100755 index 00000000..ee157840 Binary files /dev/null and b/static/images/hpc-ray-workshop/ray-cluster-sg-inbound-rules.png differ diff --git a/static/images/hpc-ray-workshop/ray-cluster-sg-outbound-rules.png b/static/images/hpc-ray-workshop/ray-cluster-sg-outbound-rules.png new file mode 100755 index 00000000..c37c7ce4 Binary files /dev/null and b/static/images/hpc-ray-workshop/ray-cluster-sg-outbound-rules.png differ diff --git a/static/images/hpc-ray-workshop/ray-fsx-network-setting.png b/static/images/hpc-ray-workshop/ray-fsx-network-setting.png new file mode 100755 index 00000000..42b11a70 Binary files /dev/null and b/static/images/hpc-ray-workshop/ray-fsx-network-setting.png differ diff --git a/static/images/hpc-ray-workshop/ray-fsx-sg-inbound-rules.png b/static/images/hpc-ray-workshop/ray-fsx-sg-inbound-rules.png new file mode 100755 index 00000000..a83337e3 Binary files /dev/null and b/static/images/hpc-ray-workshop/ray-fsx-sg-inbound-rules.png differ diff --git a/static/images/hpc-ray-workshop/ray-fsx-sg-outbound-rules.png b/static/images/hpc-ray-workshop/ray-fsx-sg-outbound-rules.png new file mode 100755 index 00000000..a36353f6 Binary files /dev/null and b/static/images/hpc-ray-workshop/ray-fsx-sg-outbound-rules.png differ diff --git a/static/images/hpc-ray-workshop/ray_air.png b/static/images/hpc-ray-workshop/ray_air.png new file mode 100755 index 00000000..ec7ef3a3 Binary files /dev/null and b/static/images/hpc-ray-workshop/ray_air.png differ