diff --git a/ray_cluster_launchers/Readme.md b/ray_cluster_launchers/Readme.md new file mode 100644 index 0000000..4e7ad73 --- /dev/null +++ b/ray_cluster_launchers/Readme.md @@ -0,0 +1,155 @@ +# Instruction of Launching Ray cluster on AWS, Azure, and GCP + + + +## Preparation - install Ray CLI +Please use pip to intall the ray CLI on local environment +``` +# install ray +pip install -U ray[default] +``` +
+ + + + + + +## Configure Ray Cluster laucher .yml files for AWS, Azure, and GCP + +All launcher template .yaml files are modified and based on Ray offical cluster config files: + +[aws-example-full.yaml](https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/aws/example-full.yaml), [azure-example-full.yaml](https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/azure/example-full.yaml), and [gcp-example-full.yaml](https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/gcp/example-full.yaml) + +
+ +### A. Configure Ray Cluster on AWS at Emory + + +1. Install and Configure [Emory TKI CLI](https://it.emory.edu/tki/) + +2. Go to AWS Console and login + +3. Go to `EC2` > `Security Group` and create a security group for ray cluster and set `GroupName` at [line 50](./aws-ray-cluster-launcher-template.yaml#L50) + +4. Go to `EC2` > `Key Pairs` and create key pair for ray cluster and set `keyName` at [line 59](./aws-ray-cluster-launcher-template.yaml#L59), [line 84](./aws-ray-cluster-launcher-template.yaml#L84) and [line 118](./aws-ray-cluster-launcher-template.yaml#L118). + +5. Go to `VPC` > `Subnets` and create subnet for cluster and set `SubnetIds` for ray header and worker nodes at [line 77](./aws-ray-cluster-launcher-template.yaml#L77) and [line 111](./aws-ray-cluster-launcher-template.yaml#L111) +set subnet + +6. login AWS CLI + +### B. Configure Ray Cluster on Azure + +1. Install and Configure [the Azure CLI](https://cloud.google.com/sdk/docs/install) + + ``` + # Install azure cli and bundle. + pip install azure-cli azure-identity azure-mgmt azure-mgmt-network + + # Login to azure. This will redirect you to your web browser. + az login + ``` +
+ +2. Use `ssh-keygen -f -t rsa -b 4096` to generate a new ssh key pair for ray cluster laucher VM. Azure ray cluster laucher will use the key to control header and worker nodes later. + ``` + # generate the ssh key pair. + ssh-keygen -f -t rsa -b 4096 + + ``` +
+ +3. Modify and Configure Ray cluster launcher file for Azure + - On [line 64, and 66](./azure-ray-cluster-launcher-template.yaml#L64), point to the ssh key that you generate on your local path. + - On [line 119](./azure-ray-cluster-launcher-template.yaml#L119), mount the ssh public key to VMs. +
+ + +### C. Configure Ray Cluster on GCP + +1. Login and create GCP project and get \ on GCP Console. User need to modify `project_id` by using user's project If on [line 42](./gcp-ray-cluster-launcher-template.yaml#L42). + +
+ +2. Go to **APIs and Services** panel to Enable the following APIs on GCP Console: + - Cloud Resource Manager API + - Compute Engine API + - Cloud OS Login API + - Identity and Access Management (IAM) API + +
+ +3. Generate a ssh key for your gcp project: + ``` + ssh-keygen -t rsa -f -C -b 2048 + ``` + +
+ +4. Go to **Metadata** panel and click **SSH KEYS** tab to upload the public ssh key on GCP project. All instances in the project inherit these SSH keys. + +
+ +5. Modify `ssh_private_key` to point the ssh private key on [line 59](./gcp-ray-cluster-launcher-template.yaml#L59). Set `KeyName` in the head and worker node on [line 77](./gcp-ray-cluster-launcher-template.yaml#L77) and [line 113](./gcp-ray-cluster-launcher-template.yaml#L113). + +
+ +6. Install and Configure [the gcloud CLI](https://cloud.google.com/sdk/docs/install) + ``` + # install pre-requisites + apt-get install apt-transport-https ca-certificates gnupg curl + + # install gcp cli + apt-get install google-cloud-cli + + # inital and config gcp + gcloud init + + ``` + +
+ +GCP References: +[How to add SSH keys to VMs](https://cloud.google.com/compute/docs/connect/add-ssh-keys#:~:text=existing%20SSH%20keys-,To%20add%20a%20public%20SSH%20key%20to,metadata%2C%20use%20the%20google_compute_project_metadata%20resource.&text=AAAAC3NzaC1lZDI1NTE5AAAAILg6UtHDNyMNAh0GjaytsJdrUxjtLy3APXqZfNZhvCeT%20test%20EOF%20%7D%20%7D-,If%20there%20are%20existing%20SSH%20keys%20in%20project%20metadata%2C%20you,the%20the%20Compute%20Engine%20API.) (step 5) + + + + + + + +## Start and Test Ray with the Ray cluster launcher +It works by running the following commands from your local machine: +``` +# Create or update the cluster +ray up .yaml + +# Get a remote screen on the head node. +ray attach .yaml + +# Try running a Ray program. +python -c 'import ray; ray.init()' +exit + +# Tear down the cluster. +ray down .yaml +``` + +![Test screenshot](./images/test_screenshot.png) + +**After Ray cluster up successfully, users should be able to check the running ray clusters on different platform console.** + +**For AWS at Emory:** +![AWS screenshot](./images/aws_instances.png) + +
+ + +**For Azure portal:** +![azure screenshot](./images/azure_portal.png) + +
+ +**For GCP Console:** +![GCP screenshot](./images/gcp_vms.png) diff --git a/ray_cluster_launchers/aws-ray-cluster-launcher-template.yaml b/ray_cluster_launchers/aws-ray-cluster-launcher-template.yaml new file mode 100644 index 0000000..3773a84 --- /dev/null +++ b/ray_cluster_launchers/aws-ray-cluster-launcher-template.yaml @@ -0,0 +1,199 @@ +# An unique identifier for the head node and workers of this cluster. +cluster_name: aws-ray-cluster + +# The maximum number of workers nodes to launch in addition to the head +# node. +max_workers: 2 + +# The autoscaler will scale up the cluster faster with higher upscaling speed. +# E.g., if the task requires adding more nodes then autoscaler will gradually +# scale up the cluster in chunks of upscaling_speed*currently_running_nodes. +# This number should be > 0. +upscaling_speed: 1.0 + +# This executes all commands on all nodes in the docker container, +# and opens all the necessary ports to support the Ray cluster. +# Empty string means disabled. +docker: + image: "rayproject/ray-ml:latest-gpu" # You can change this to latest-cpu if you don't need GPU support and want a faster startup + # image: rayproject/ray:latest-cpu # use this one if you don't need ML dependencies, it's faster to pull + container_name: "ray_container" + # If true, pulls latest version of image. Otherwise, `docker run` will only pull the image + # if no cached version is present. + pull_before_run: True + run_options: # Extra options to pass into "docker run" + - --ulimit nofile=65536:65536 + + # Example of running a GPU head with CPU workers + # head_image: "rayproject/ray-ml:latest-gpu" + # Allow Ray to automatically detect GPUs + + # worker_image: "rayproject/ray-ml:latest-cpu" + # worker_run_options: [] + +# If a node is idle for this many minutes, it will be removed. +idle_timeout_minutes: 5 + +# Cloud-provider specific configuration. +provider: + type: aws + region: us-east-1 + # Availability zone(s), comma-separated, that nodes may be launched in. + # Nodes will be launched in the first listed availability zone and will + # be tried in the subsequent availability zones if launching fails. + # availability_zone: us-east-1a,us-east-1b + # Whether to allow node reuse. If set to False, nodes will be terminated + # instead of stopped. + cache_stopped_nodes: False # If not present, the default is True. + use_internal_ips: True + security_group: + GroupName: + + +# How Ray will authenticate with newly launched nodes. +auth: + ssh_user: +# By default Ray creates a new private keypair, but you can also use your own. +# If you do so, make sure to also set "KeyName" in the head and worker node +# configurations below. + ssh_private_key: + +# Tell the autoscaler the allowed node types and the resources they provide. +# The key is the name of the node type, which is just for debugging purposes. +# The node config specifies the launch config and physical instance type. +available_node_types: + head_node: + # The node type's CPU and GPU resources are auto-detected based on AWS instance type. + # If desired, you can override the autodetected CPU and GPU resources advertised to the autoscaler. + # You can also set custom resources. + # For example, to mark a node type as having 1 CPU, 1 GPU, and 5 units of a resource called "custom", set + # resources: {"CPU": 1, "GPU": 1, "custom": 5} + # resources: {} + # Provider-specific config for this node type, e.g. instance type. By default + # Ray will auto-configure unspecified fields such as SubnetId and KeyName. + # For more documentation on available fields, see: + # http://boto3.readthedocs.io/en/latest/reference/services/ec2.html#EC2.ServiceResource.create_instances + node_config: + SubnetIds: + - + InstanceType: m5.large + # Default AMI for us-west-2. + # Check https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/_private/aws/config.py + # for default images for other zones. + ImageId: ami-07caf09b362be10b8 + KeyName: + # SecurityGroups: [public-ecg-group] + # You can provision additional disk space with a conf as follows + BlockDeviceMappings: + - DeviceName: /dev/xvda + Ebs: + VolumeSize: 150 + VolumeType: gp3 + # Additional options in the boto docs. + worker_nodes: + # The minimum number of worker nodes of this type to launch. + # This number should be >= 0. + min_workers: 1 + # The maximum number of worker nodes of this type to launch. + # This takes precedence over min_workers. + max_workers: 2 + # The node type's CPU and GPU resources are auto-detected based on AWS instance type. + # If desired, you can override the autodetected CPU and GPU resources advertised to the autoscaler. + # You can also set custom resources. + # For example, to mark a node type as having 1 CPU, 1 GPU, and 5 units of a resource called "custom", set + # resources: {"CPU": 1, "GPU": 1, "custom": 5} + # resources: {} + # Provider-specific config for this node type, e.g. instance type. By default + # Ray will auto-configure unspecified fields such as SubnetId and KeyName. + # For more documentation on available fields, see: + # http://boto3.readthedocs.io/en/latest/reference/services/ec2.html#EC2.ServiceResource.create_instances + node_config: + SubnetIds: + - + InstanceType: m5.large + # Default AMI for us-west-2. + # Check https://github.com/ray-project/ray/blob/master/python/ray/autoscaler/_private/aws/config.py + # for default images for other zones. + ImageId: ami-07caf09b362be10b8 + KeyName: + # SecurityGroups: [public-ecg-group] + # - public-ecg-group + # Run workers on spot by default. Comment this out to use on-demand. + # NOTE: If relying on spot instances, it is best to specify multiple different instance + # types to avoid interruption when one instance type is experiencing heightened demand. + # Demand information can be found at https://aws.amazon.com/ec2/spot/instance-advisor/ + BlockDeviceMappings: + - DeviceName: /dev/xvda + Ebs: + VolumeSize: 150 + VolumeType: gp3 + # InstanceMarketOptions: + # MarketType: spot + # Additional options can be found in the boto docs, e.g. + # SpotOptions: + # MaxPrice: MAX_HOURLY_PRICE + # Additional options in the boto docs. + +# Specify the node type of the head node (as configured above). +head_node_type: head_node + +# Files or directories to copy to the head and worker nodes. The format is a +# dictionary from REMOTE_PATH: LOCAL_PATH, e.g. +file_mounts: { +# "/path1/on/remote/machine": "/path1/on/local/machine", +# "/path2/on/remote/machine": "/path2/on/local/machine", +} + +# Files or directories to copy from the head node to the worker nodes. The format is a +# list of paths. The same path on the head node will be copied to the worker node. +# This behavior is a subset of the file_mounts behavior. In the vast majority of cases +# you should just use file_mounts. Only use this if you know what you're doing! +cluster_synced_files: [] + +# Whether changes to directories in file_mounts or cluster_synced_files in the head node +# should sync to the worker node continuously +file_mounts_sync_continuously: False + +# Patterns for files to exclude when running rsync up or rsync down +rsync_exclude: + - "**/.git" + - "**/.git/**" + +# Pattern files to use for filtering out files when running rsync up or rsync down. The file is searched for +# in the source directory and recursively through all subdirectories. For example, if .gitignore is provided +# as a value, the behavior will match git's behavior for finding and using .gitignore files. +rsync_filter: + - ".gitignore" + +# List of commands that will be run before `setup_commands`. If docker is +# enabled, these commands will run outside the container and before docker +# is setup. +initialization_commands: [] + +# List of shell commands to run to set up nodes. +setup_commands: + - sleep 4 + - sudo yum install -y python3-pip python-is-python3 + - pip3 install ray[default] boto3 torch + # Note: if you're developing Ray, you probably want to create a Docker image that + # has your Ray repo pre-cloned. Then, you can replace the pip installs + # below with a git checkout (and possibly a recompile). + # To run the nightly version of ray (as opposed to the latest), either use a rayproject docker image + # that has the "nightly" (e.g. "rayproject/ray-ml:nightly-gpu") or uncomment the following line: + # - pip install -U "ray[default] @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl" + +# Custom commands that will be run on the head node after common setup. +head_setup_commands: [] + +# Custom commands that will be run on worker nodes after common setup. +worker_setup_commands: [] + +# Command to start ray on the head node. You don't need to change this. +head_start_ray_commands: + - ray stop + - ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml --dashboard-host=0.0.0.0 + +# Command to start ray on worker nodes. You don't need to change this. +worker_start_ray_commands: + - ray stop + - ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076 diff --git a/ray_cluster_launchers/azure-ray-cluster-launcher-template.yaml b/ray_cluster_launchers/azure-ray-cluster-launcher-template.yaml new file mode 100644 index 0000000..d64f3d4 --- /dev/null +++ b/ray_cluster_launchers/azure-ray-cluster-launcher-template.yaml @@ -0,0 +1,182 @@ +# An unique identifier for the head node and workers of this cluster. +cluster_name: default + +# The maximum number of workers nodes to launch in addition to the head +# node. +max_workers: 2 + +# The autoscaler will scale up the cluster faster with higher upscaling speed. +# E.g., if the task requires adding more nodes then autoscaler will gradually +# scale up the cluster in chunks of upscaling_speed*currently_running_nodes. +# This number should be > 0. +upscaling_speed: 1.0 + +# This executes all commands on all nodes in the docker container, +# and opens all the necessary ports to support the Ray cluster. +# Empty object means disabled. +docker: + image: "rayproject/ray-ml:latest-gpu" # You can change this to latest-cpu if you don't need GPU support and want a faster startup + # image: rayproject/ray:latest-gpu # use this one if you don't need ML dependencies, it's faster to pull + container_name: "ray_container" + # If true, pulls latest version of image. Otherwise, `docker run` will only pull the image + # if no cached version is present. + pull_before_run: True + run_options: # Extra options to pass into "docker run" + - --ulimit nofile=65536:65536 + + # Example of running a GPU head with CPU workers + # head_image: "rayproject/ray-ml:latest-gpu" + # Allow Ray to automatically detect GPUs + + # worker_image: "rayproject/ray-ml:latest-cpu" + # worker_run_options: [] + +# If a node is idle for this many minutes, it will be removed. +idle_timeout_minutes: 5 + +# Cloud-provider specific configuration. +provider: + type: azure + # https://azure.microsoft.com/en-us/global-infrastructure/locations + location: westus2 + resource_group: ray-cluster + # set subscription id otherwise the default from az cli will be used + # subscription_id: 00000000-0000-0000-0000-000000000000 + # set unique subnet mask or a random mask will be used + # subnet_mask: 10.0.0.0/16 + # set unique id for resources in this cluster + # if not set a default id will be generated based on the resource group and cluster name + # unique_id: RAY1 + # set managed identity name and resource group + # if not set, a default user-assigned identity will be generated in the resource group specified above + # msi_name: ray-cluster-msi + # msi_resource_group: other-rg + # Set provisioning and use of public/private IPs for head and worker nodes. If both options below are true, + # only the head node will have a public IP address provisioned. + # use_internal_ips: True + # use_external_head_ip: True + +# How Ray will authenticate with newly launched nodes. +auth: + ssh_user: ubuntu + # you must specify paths to matching private and public key pair files + # use `ssh-keygen -t rsa -b 4096` to generate a new ssh key pair + ssh_private_key: + # changes to this should match what is specified in file_mounts + ssh_public_key: + +# More specific customization to node configurations can be made using the ARM template azure-vm-template.json file +# See documentation here: https://docs.microsoft.com/en-us/azure/templates/microsoft.compute/2019-03-01/virtualmachines +# Changes to the local file will be used during deployment of the head node, however worker nodes deployment occurs +# on the head node, so changes to the template must be included in the wheel file used in setup_commands section below + +# Tell the autoscaler the allowed node types and the resources they provide. +# The key is the name of the node type, which is just for debugging purposes. +# The node config specifies the launch config and physical instance type. +available_node_types: + ray.head.default: + # The resources provided by this node type. + resources: {"CPU": 2} + # Provider-specific config, e.g. instance type. + node_config: + azure_arm_parameters: + vmSize: Standard_D2s_v3 + # List images https://docs.microsoft.com/en-us/azure/virtual-machines/linux/cli-ps-findimage + imagePublisher: microsoft-dsvm + imageOffer: ubuntu-1804 + imageSku: 1804-gen2 + imageVersion: latest + + ray.worker.default: + # The minimum number of worker nodes of this type to launch. + # This number should be >= 0. + min_workers: 0 + # The maximum number of worker nodes of this type to launch. + # This takes precedence over min_workers. + max_workers: 2 + # The resources provided by this node type. + resources: {"CPU": 2} + # Provider-specific config, e.g. instance type. + node_config: + azure_arm_parameters: + vmSize: Standard_D2s_v3 + # List images https://docs.microsoft.com/en-us/azure/virtual-machines/linux/cli-ps-findimage + imagePublisher: microsoft-dsvm + imageOffer: ubuntu-1804 + imageSku: 1804-gen2 + imageVersion: latest + # optionally set priority to use Spot instances + priority: Spot + # set a maximum price for spot instances if desired + # billingProfile: + # maxPrice: -1 + +# Specify the node type of the head node (as configured above). +head_node_type: ray.head.default + +# Files or directories to copy to the head and worker nodes. The format is a +# dictionary from REMOTE_PATH: LOCAL_PATH, e.g. +file_mounts: { +# "/path1/on/remote/machine": "/path1/on/local/machine", +# "/path2/on/remote/machine": "/path2/on/local/machine", + "" : "" +} + +# Files or directories to copy from the head node to the worker nodes. The format is a +# list of paths. The same path on the head node will be copied to the worker node. +# This behavior is a subset of the file_mounts behavior. In the vast majority of cases +# you should just use file_mounts. Only use this if you know what you're doing! +cluster_synced_files: [] + +# Whether changes to directories in file_mounts or cluster_synced_files in the head node +# should sync to the worker node continuously +file_mounts_sync_continuously: False + +# Patterns for files to exclude when running rsync up or rsync down +rsync_exclude: + - "**/.git" + - "**/.git/**" + +# Pattern files to use for filtering out files when running rsync up or rsync down. The file is searched for +# in the source directory and recursively through all subdirectories. For example, if .gitignore is provided +# as a value, the behavior will match git's behavior for finding and using .gitignore files. +rsync_filter: + - ".gitignore" + +# List of commands that will be run before `setup_commands`. If docker is +# enabled, these commands will run outside the container and before docker +# is setup. +initialization_commands: + # enable docker setup + - sudo usermod -aG docker $USER || true + - sleep 10 # delay to avoid docker permission denied errors + # get rid of annoying Ubuntu message + - touch ~/.sudo_as_admin_successful + +# List of shell commands to run to set up nodes. +# NOTE: rayproject/ray-ml:latest has ray latest bundled +setup_commands: [] + # Note: if you're developing Ray, you probably want to create a Docker image that + # has your Ray repo pre-cloned. Then, you can replace the pip installs + # below with a git checkout (and possibly a recompile). + # To run the nightly version of ray (as opposed to the latest), either use a rayproject docker image + # that has the "nightly" (e.g. "rayproject/ray-ml:nightly-gpu") or uncomment the following line: + # - pip install -U "ray[default] @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp38-cp38-manylinux2014_x86_64.whl" + +# Custom commands that will be run on the head node after common setup. +# NOTE: rayproject/ray-ml:latest has azure packages bundled +head_setup_commands: [] + # - pip install -U azure-cli-core==2.22.0 azure-mgmt-compute==14.0.0 azure-mgmt-msi==1.0.0 azure-mgmt-network==10.2.0 azure-mgmt-resource==13.0.0 + +# Custom commands that will be run on worker nodes after common setup. +worker_setup_commands: [] + +# Command to start ray on the head node. You don't need to change this. +head_start_ray_commands: + - ray stop + - ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml + +# Command to start ray on worker nodes. You don't need to change this. +worker_start_ray_commands: + - ray stop + - ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076 diff --git a/ray_cluster_launchers/gcp-ray-cluster-launcher-template.yaml b/ray_cluster_launchers/gcp-ray-cluster-launcher-template.yaml new file mode 100644 index 0000000..4c21ee5 --- /dev/null +++ b/ray_cluster_launchers/gcp-ray-cluster-launcher-template.yaml @@ -0,0 +1,205 @@ +# An unique identifier for the head node and workers of this cluster. +cluster_name: gcp-ray-cluster + +# The maximum number of workers nodes to launch in addition to the head +# node. +max_workers: 2 + +# The autoscaler will scale up the cluster faster with higher upscaling speed. +# E.g., if the task requires adding more nodes then autoscaler will gradually +# scale up the cluster in chunks of upscaling_speed*currently_running_nodes. +# This number should be > 0. +upscaling_speed: 1.0 + +# This executes all commands on all nodes in the docker container, +# and opens all the necessary ports to support the Ray cluster. +# Empty string means disabled. +docker: + image: "rayproject/ray-ml:latest-gpu" # You can change this to latest-cpu if you don't need GPU support and want a faster startup + # image: rayproject/ray:latest-gpu # use this one if you don't need ML dependencies, it's faster to pull + container_name: "ray_container" + # If true, pulls latest version of image. Otherwise, `docker run` will only pull the image + # if no cached version is present. + pull_before_run: True + run_options: # Extra options to pass into "docker run" + - --ulimit nofile=65536:65536 + + # Example of running a GPU head with CPU workers + # head_image: "rayproject/ray-ml:latest-gpu" + # Allow Ray to automatically detect GPUs + + # worker_image: "rayproject/ray-ml:latest-cpu" + # worker_run_options: [] + +# If a node is idle for this many minutes, it will be removed. +idle_timeout_minutes: 5 + +# Cloud-provider specific configuration. +provider: + type: gcp + region: us-west1 + availability_zone: us-west1-a + project_id: # Globally unique project id + +# How Ray will authenticate with newly launched nodes. + +############################################################### +# +# 1. need to enable the following gcp services & APIs +# - Cloud Resource Manager API +# - Compute Engine API +# - Cloud OS Login API +# - Identity and Access Management (IAM) API +# +# 2. use `ssh-keygen -t rsa -f -C -b 2048` to generate a new ssh key pair +# +############################################################### +auth: + ssh_user: + ssh_private_key: +# If you do so, make sure to also set "KeyName" in the head and worker node +# configurations below. This requires that you have added the key into the +# project wide meta-data. +# ssh_private_key: /path/to/your/key.pem + +# Tell the autoscaler the allowed node types and the resources they provide. +# The key is the name of the node type, which is just for debugging purposes. +# The node config specifies the launch config and physical instance type. +available_node_types: + ray_head_default: + # The resources provided by this node type. + resources: {"CPU": 2} + # Provider-specific config for the head node, e.g. instance type. By default + # Ray will auto-configure unspecified fields such as subnets and ssh-keys. + # For more documentation on available fields, see: + # https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert + node_config: + KeyName: + machineType: n1-standard-2 + disks: + - boot: true + autoDelete: true + type: PERSISTENT + initializeParams: + diskSizeGb: 50 + # See https://cloud.google.com/compute/docs/images for more images + sourceImage: projects/deeplearning-platform-release/global/images/family/common-cpu + + # Additional options can be found in in the compute docs at + # https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert + + # If the network interface is specified as below in both head and worker + # nodes, the manual network config is used. Otherwise an existing subnet is + # used. To use a shared subnet, ask the subnet owner to grant permission + # for 'compute.subnetworks.use' to the ray autoscaler account... + # networkInterfaces: + # - kind: compute#networkInterface + # subnetwork: path/to/subnet + # aliasIpRanges: [] + ray_worker_small: + # The minimum number of worker nodes of this type to launch. + # This number should be >= 0. + min_workers: 1 + # The maximum number of worker nodes of this type to launch. + # This takes precedence over min_workers. + max_workers: 2 + # The resources provided by this node type. + resources: {"CPU": 2} + # Provider-specific config for the head node, e.g. instance type. By default + # Ray will auto-configure unspecified fields such as subnets and ssh-keys. + # For more documentation on available fields, see: + # https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert + node_config: + KeyName: + machineType: n1-standard-2 + disks: + - boot: true + autoDelete: true + type: PERSISTENT + initializeParams: + diskSizeGb: 50 + # See https://cloud.google.com/compute/docs/images for more images + sourceImage: projects/deeplearning-platform-release/global/images/family/common-cpu + # Run workers on preemtible instance by default. + # Comment this out to use on-demand. + scheduling: + - preemptible: true + # Un-Comment this to launch workers with the Service Account of the Head Node + # serviceAccounts: + # - email: ray-autoscaler-sa-v1@.iam.gserviceaccount.com + # scopes: + # - https://www.googleapis.com/auth/cloud-platform + + # Additional options can be found in in the compute docs at + # https://cloud.google.com/compute/docs/reference/rest/v1/instances/insert + +# Specify the node type of the head node (as configured above). +head_node_type: ray_head_default + +# Files or directories to copy to the head and worker nodes. The format is a +# dictionary from REMOTE_PATH: LOCAL_PATH, e.g. +file_mounts: { +# "/path1/on/remote/machine": "/path1/on/local/machine", +# "/path2/on/remote/machine": "/path2/on/local/machine", +} + +# Files or directories to copy from the head node to the worker nodes. The format is a +# list of paths. The same path on the head node will be copied to the worker node. +# This behavior is a subset of the file_mounts behavior. In the vast majority of cases +# you should just use file_mounts. Only use this if you know what you're doing! +cluster_synced_files: [] + +# Whether changes to directories in file_mounts or cluster_synced_files in the head node +# should sync to the worker node continuously +file_mounts_sync_continuously: False + +# Patterns for files to exclude when running rsync up or rsync down +rsync_exclude: + - "**/.git" + - "**/.git/**" + +# Pattern files to use for filtering out files when running rsync up or rsync down. The file is searched for +# in the source directory and recursively through all subdirectories. For example, if .gitignore is provided +# as a value, the behavior will match git's behavior for finding and using .gitignore files. +rsync_filter: + - ".gitignore" + +# List of commands that will be run before `setup_commands`. If docker is +# enabled, these commands will run outside the container and before docker +# is setup. +initialization_commands: [] + +# List of shell commands to run to set up nodes. +setup_commands: [] + # Note: if you're developing Ray, you probably want to create a Docker image that + # has your Ray repo pre-cloned. Then, you can replace the pip installs + # below with a git checkout (and possibly a recompile). + # To run the nightly version of ray (as opposed to the latest), either use a rayproject docker image + # that has the "nightly" (e.g. "rayproject/ray-ml:nightly-gpu") or uncomment the following line: + # - pip install -U "ray[default] @ https://s3-us-west-2.amazonaws.com/ray-wheels/latest/ray-3.0.0.dev0-cp37-cp37m-manylinux2014_x86_64.whl" + + +# Custom commands that will be run on the head node after common setup. +head_setup_commands: + - pip install google-api-python-client==1.7.8 + +# Custom commands that will be run on worker nodes after common setup. +worker_setup_commands: [] + +# Command to start ray on the head node. You don't need to change this. +head_start_ray_commands: + - ray stop + - >- + ray start + --head + --port=6379 + --object-manager-port=8076 + --autoscaling-config=~/ray_bootstrap_config.yaml + +# Command to start ray on worker nodes. You don't need to change this. +worker_start_ray_commands: + - ray stop + - >- + ray start + --address=$RAY_HEAD_IP:6379 + --object-manager-port=8076 diff --git a/ray_cluster_launchers/images/aws_instances.png b/ray_cluster_launchers/images/aws_instances.png new file mode 100644 index 0000000..790b869 Binary files /dev/null and b/ray_cluster_launchers/images/aws_instances.png differ diff --git a/ray_cluster_launchers/images/azure_portal.png b/ray_cluster_launchers/images/azure_portal.png new file mode 100644 index 0000000..146e5cc Binary files /dev/null and b/ray_cluster_launchers/images/azure_portal.png differ diff --git a/ray_cluster_launchers/images/gcp_vms.png b/ray_cluster_launchers/images/gcp_vms.png new file mode 100644 index 0000000..03cc644 Binary files /dev/null and b/ray_cluster_launchers/images/gcp_vms.png differ diff --git a/ray_cluster_launchers/images/test_screenshot.png b/ray_cluster_launchers/images/test_screenshot.png new file mode 100644 index 0000000..edb26a1 Binary files /dev/null and b/ray_cluster_launchers/images/test_screenshot.png differ