Skip to content

Commit

Permalink
Runs on Ray cluster successfully. Changed resource-limits, fixed ray …
Browse files Browse the repository at this point in the history
…networking and environment.
  • Loading branch information
flimdejong committed Nov 28, 2024
1 parent 15b0666 commit 9cae1bb
Show file tree
Hide file tree
Showing 13 changed files with 432 additions and 196 deletions.
2 changes: 1 addition & 1 deletion Dockerfile.ray
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
FROM rayproject/ray:latest-py310

# Install dependencies in a single layer to keep it cached
RUN pip install torch==2.5.1 gymnasium numpy==1.24.3 ray[rllib]==2.38.0 pyzmq
RUN pip install torch==2.5.1 gymnasium numpy==1.24.3 ray[rllib]==2.38.0 pyzmq protobuf websockets

# Copy the entire roboteam root folder (including the roboteam_ai and roboteam_networking folders)
COPY roboteam_ai /roboteam/roboteam_ai
Expand Down
6 changes: 4 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,12 +37,14 @@ To enable Tracy
- Information is in the tracy [docs](https://github.com/wolfpld/tracy)
- Run AI


### Use of Ray

The dockerimage Dockerfile.ray is used to build a docker image with the ray project's official ray image as a base. It also adds the necessary libraries to the image and the roboteam RL code to the image. Only build it if you want to deploy it to a cluster.

Build the docker image using the following command from the root folder:

- docker build -t roboteamtwente/ray:development -f Dockerfile.ray .

Push it using the following command:
- docker push roboteamtwente/ray:development

- docker push roboteamtwente/ray:development
21 changes: 13 additions & 8 deletions docker/runner/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,20 +2,23 @@
In a ray or distributed computing cluster, the terms "head node" and "worker nodes" refer to different roles that containers play in the cluster. The head node is the master node in a Ray cluster. You typically have one head node. Worker nodes are the containers that execute the jobs, in parallel. You can have as many worker nodes as you want.

-----------------------------------------------------------
## Installing Kuberay:
curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
helm repo add kuberay https://ray-project.github.io/kuberay-helm/

## Installing Kuberay

curl <https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3> | bash
helm repo add kuberay <https://ray-project.github.io/kuberay-helm/>
helm repo update
helm install kuberay-operator kuberay/kuberay-operator --namespace ray-system --create-namespace

https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/config.html
<https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/config.html>
The above source was used for creating the ray-cluster.yaml

Installing kubernetes and minikubernetes, you can follow this guide to check out how to install and run them: https://medium.com/@areesmoon/installing-minikube-on-ubuntu-20-04-lts-focal-fossa-b10fad9d0511
Installing kubernetes and minikubernetes, you can follow this guide to check out how to install and run them: <https://medium.com/@areesmoon/installing-minikube-on-ubuntu-20-04-lts-focal-fossa-b10fad9d0511>

Use 'pip install ray' and then 'pip show ray' to get your version of ray.

----------------------------------------------------------------------------------
-----------------------------------------------------------

After you have both kubernetes and ray, use the following command to create a cluster: kubectl apply -f ray-cluster.yaml
This cluster launches a ray head node and one worker node. Launch the external simulator using kubectl apply -f simulator.yaml

Expand All @@ -24,11 +27,13 @@ This cluster launches a ray head node and one worker node. Launch the external s
Use to forward the needed port to the ray service: kubectl port-forward svc/<cluster name> 8265:8265
This is the port that will be used inside ray_jobs.py, where we submit the jobs to ray.

-----------------------------------------------------------

-----------------------------------------------------
## Useful commands

kubectl apply -f ray-cluster.yaml
kubectl delete -f ray-cluster.yaml
helm install kuberay-operator ray/kuberay-operator
helm uninstall kuberay-operator
kubectl port-forward svc/roboteam-ray-cluster-head-nodeport 8265:8265 6379:6379 10001:10001 8000:8000 &
kubectl port-forward svc/roboteam-ray-cluster-head-nodeport 8265:8265 6379:6379 10001:10001 8000:8000 &
minikube start -p ray --nodes 2 --memory 4000 --cpus 3
22 changes: 12 additions & 10 deletions docker/runner/ray-cluster-combined.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -34,11 +34,11 @@ spec:
- containerPort: 8000 # Serve
resources:
requests:
cpu: "500m"
memory: "1Gi"
limits:
cpu: "1"
memory: "2Gi"
limits:
cpu: "2"
memory: "4Gi"
env:
- name: POD_IP
valueFrom:
Expand Down Expand Up @@ -70,15 +70,17 @@ spec:
# Worker node configuration with integrated simulator
workerGroupSpecs:
- groupName: worker-group
replicas: 1
replicas: 1
minReplicas: 1 # Specify minimum number of worker replicas
maxReplicas: 1 # Optional: specify maximum number of replicas
rayStartParams:
num-cpus: "1"
template:
metadata:
labels:
app: ray-worker
spec:
hostNetwork: true
hostNetwork: False
dnsPolicy: ClusterFirstWithHostNet
affinity:
nodeAffinity:
Expand Down Expand Up @@ -109,8 +111,8 @@ spec:
cpu: 500m
memory: 1Gi
limits:
cpu: 2000m
memory: 4Gi
cpu: 1500m
memory: 3Gi
env:
- name: LD_LIBRARY_PATH
value: /home/roboteam/build/release/lib
Expand Down Expand Up @@ -187,10 +189,10 @@ spec:
resources:
requests:
cpu: 120m
memory: 20Mi
memory: 40Mi
limits:
cpu: 150m
memory: 50Mi
memory: 100Mi

# Simulator
- name: erforce-simulator
Expand Down Expand Up @@ -295,4 +297,4 @@ spec:
targetPort: 8081
- name: simulator
port: 5558
targetPort: 5558
targetPort: 5558
80 changes: 60 additions & 20 deletions docker/runner/ray-cluster.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -8,39 +8,46 @@ spec:
headGroupSpec:
rayStartParams:
dashboard-host: "0.0.0.0"
node-ip-address: "$(HOST_IP)"
template:
metadata:
labels:
app: ray-head
spec:
hostNetwork: false
# Add node affinity for head node
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- multinode-demo
containers:
- name: ray-head
image: roboteamtwente/ray:development
imagePullPolicy: Always # Always pull the latest image
imagePullPolicy: Always
ports:
- containerPort: 8265 # dashboard port
- containerPort: 6379 # redis port
- containerPort: 10001 # GCS server port
- containerPort: 8000 # Serve port
- containerPort: 8265
- containerPort: 6379
- containerPort: 10001
- containerPort: 8000
resources:
requests:
cpu: "500m"
memory: "1Gi" # Increased from 256Mi
memory: "1Gi"
limits:
cpu: "1" # Changed from 600 (which was too high)
memory: "2Gi" # Increased from 512Mi

cpu: "1"
memory: "2Gi"
env:
- name: SIMULATION_HOST
value: "127.0.0.1" # Using localhost since we're on host network
- name: VISION_PORT
value: "10020" # Match your simulator's vision port
- name: REFEREE_PORT
value: "10003" # Match your simulator's referee port

- name: HOST_IP
valueFrom:
fieldRef:
fieldPath: status.hostIP
command: ["/bin/bash", "-c", "--"]
args: ["ray start --head --port=6379 --dashboard-host=0.0.0.0 --block"]
args: ["ray start --head --port=6379 --bind-address=0.0.0.0 --dashboard-host=0.0.0.0 --node-ip-address=$(HOST_IP) --block"]
livenessProbe:
exec:
command:
Expand All @@ -66,7 +73,7 @@ spec:
# Worker node configuration
workerGroupSpecs:
- groupName: worker-group
replicas: 1 # Number of worker nodes
replicas: 1
rayStartParams:
num-cpus: "1"
template:
Expand All @@ -75,10 +82,20 @@ spec:
app: ray-worker
spec:
hostNetwork: true
# Replace pod anti-affinity with node affinity
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: kubernetes.io/hostname
operator: In
values:
- multinode-demo-m02
containers:
- name: ray-worker
image: roboteamtwente/ray:development
imagePullPolicy: Always # Always pull the latest image
imagePullPolicy: Always
resources:
requests:
cpu: 500m
Expand Down Expand Up @@ -130,4 +147,27 @@ spec:
- name: serve
port: 8000
targetPort: 8000
nodePort: 30800 # Serve
nodePort: 30800 # Serve

---
apiVersion: v1
kind: Service
metadata:
name: roboteam-ray-cluster-head-svc
spec:
type: ClusterIP
selector:
app: ray-head
ports:
- name: redis
port: 6379
targetPort: 6379
- name: gcs
port: 10001
targetPort: 10001
- name: dashboard
port: 8265
targetPort: 8265
- name: serve
port: 8000
targetPort: 8000
18 changes: 17 additions & 1 deletion docker/runner/simulator.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -209,4 +209,20 @@ spec:
targetPort: 8080
- name: gc
port: 8081
targetPort: 8081
targetPort: 8081
- name: zmq
port: 5558
targetPort: 5558
protocol: TCP
- name: sim-control
port: 10300
targetPort: 10300
protocol: UDP
- name: vision
port: 10020
targetPort: 10020
protocol: UDP
- name: referee
port: 10003
targetPort: 10003
protocol: UDP
44 changes: 11 additions & 33 deletions roboteam_ai/src/RL/RL_Ray/train.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,14 +18,7 @@
import warnings
warnings.filterwarnings('ignore', category=DeprecationWarning)

def verify_imports():
import numpy
import torch
print(f"Local NumPy version: {numpy.__version__}")
print(f"Local PyTorch version: {torch.__version__}")

def main():
verify_imports()

if not ray.is_initialized():
ray.init(
Expand All @@ -34,63 +27,48 @@ def main():
runtime_env={
"env_vars": {
"NUMPY_EXPERIMENTAL_ARRAY_FUNCTION": "0",

},
# "pip": [
# "numpy==1.24.3",
# "pyzmq==26.2.0"
# ]
"py_modules": [
os.path.join(roboteam_ai_root, "roboteam_ai"),
os.path.join(roboteam_ai_root, "roboteam_networking")
]
}
)

# ray.init()

# We can set env_config here
def env_creator(env_config):
return RoboTeamEnv(env_config) # This passes the config to your env

# Register the environment
register_env("RoboTeamEnv", env_creator)

# Create list of callbacks
callbacks = [
JsonLoggerCallback(),
CSVLoggerCallback(),
]

config = (
PPOConfig()
.environment("RoboTeamEnv")
.framework("torch")
.resources(num_gpus=0)
.env_runners(
num_env_runners=1,
num_envs_per_env_runner=1,
sample_timeout_s=None
)
# .api_stack(
# enable_rl_module_and_learner=True,
# enable_env_runner_and_connector_v2=True
# )
num_envs_per_env_runner=1, # If you use vectorized env, otherwise set to 1
rollout_fragment_length=16,
sample_timeout_s=30,
create_env_on_local_worker=False) # This makes sure that we don't run a local environment
.api_stack(
enable_rl_module_and_learner=True,
enable_env_runner_and_connector_v2=True)
.debugging(
log_level="DEBUG",
seed=42
)
#.callbacks(callbacks)
.evaluation(evaluation_interval=10)
)

print("Starting training...")
algo = config.build()

for i in range(10):
print(f"\nStarting iteration {i}")
result = algo.train()
result.pop("config")
print("\nTraining metrics:")
print(f"Episode Reward Mean: {result.get('episode_reward_mean', 'N/A')}")
print(f"Episode Length Mean: {result.get('episode_len_mean', 'N/A')}")
print(f"Total Timesteps: {result.get('timesteps_total', 'N/A')}")
pprint(result)

if __name__ == "__main__":
Expand Down
2 changes: 0 additions & 2 deletions roboteam_ai/src/RL/env.py
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,6 @@ class RoboTeamEnv(gymnasium.Env):
def __init__(self, config=None):
self.config = config or {} # Config placeholder


self.MAX_ROBOTS_US = 10

# Define the number of robots that are present in each grid + ball location
Expand Down Expand Up @@ -161,7 +160,6 @@ def get_observation(self):
'ball_position': self.ball_quadrant,
'is_yellow_dribbling' : self.is_yellow_dribbling
}
print("obs: ", observation_space)

return observation_space, self.calculate_reward()

Expand Down
Loading

0 comments on commit 9cae1bb

Please sign in to comment.