Runs on Ray cluster successfully. Changed resource-limits, fixed ray …

…networking and environment.
RoboTeamTwente · Nov 28, 2024 · 9cae1bb · 9cae1bb
1 parent 15b0666
commit 9cae1bb
Show file tree

Hide file tree

Showing 13 changed files with 432 additions and 196 deletions.
diff --git a/Dockerfile.ray b/Dockerfile.ray
@@ -5,7 +5,7 @@
 FROM rayproject/ray:latest-py310
 
 # Install dependencies in a single layer to keep it cached
-RUN pip install torch==2.5.1 gymnasium numpy==1.24.3 ray[rllib]==2.38.0 pyzmq
+RUN pip install torch==2.5.1 gymnasium numpy==1.24.3 ray[rllib]==2.38.0 pyzmq protobuf websockets
 
 # Copy the entire roboteam root folder (including the roboteam_ai and roboteam_networking folders)
 COPY roboteam_ai /roboteam/roboteam_ai

diff --git a/README.md b/README.md
@@ -37,12 +37,14 @@ To enable Tracy
   - Information is in the tracy [docs](https://github.com/wolfpld/tracy)
 - Run AI
 
-
 ### Use of Ray
+
 The dockerimage Dockerfile.ray is used to build a docker image with the ray project's official ray image as a base. It also adds the necessary libraries to the image and the roboteam RL code to the image. Only build it if you want to deploy it to a cluster.
 
 Build the docker image using the following command from the root folder:
+
 - docker build -t roboteamtwente/ray:development -f Dockerfile.ray .
 
 Push it using the following command:
-- docker push roboteamtwente/ray:development
+
+- docker push roboteamtwente/ray:development
diff --git a/docker/runner/README.md b/docker/runner/README.md
@@ -2,20 +2,23 @@
 In a ray or distributed computing cluster, the terms "head node" and "worker nodes" refer to different roles that containers play in the cluster. The head node is the master node in a Ray cluster. You typically have one head node. Worker nodes are the containers that execute the jobs, in parallel. You can have as many worker nodes as you want.
 
 -----------------------------------------------------------
-## Installing Kuberay:
-curl https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3 | bash
-helm repo add kuberay https://ray-project.github.io/kuberay-helm/
+
+## Installing Kuberay
+
+curl <https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3> | bash
+helm repo add kuberay <https://ray-project.github.io/kuberay-helm/>
 helm repo update
 helm install kuberay-operator kuberay/kuberay-operator --namespace ray-system --create-namespace
 
-https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/config.html
+<https://docs.ray.io/en/latest/cluster/kubernetes/user-guides/config.html>
 The above source was used for creating the ray-cluster.yaml
 
-Installing kubernetes and minikubernetes, you can follow this guide to check out how to install and run them: https://medium.com/@areesmoon/installing-minikube-on-ubuntu-20-04-lts-focal-fossa-b10fad9d0511
+Installing kubernetes and minikubernetes, you can follow this guide to check out how to install and run them: <https://medium.com/@areesmoon/installing-minikube-on-ubuntu-20-04-lts-focal-fossa-b10fad9d0511>
 
 Use 'pip install ray' and then 'pip show ray' to get your version of ray.
 
-----------------------------------------------------------------------------------
+-----------------------------------------------------------
+
 After you have both kubernetes and ray, use the following command to create a cluster: kubectl apply -f ray-cluster.yaml
 This cluster launches a ray head node and one worker node. Launch the external simulator using kubectl apply -f simulator.yaml
 
@@ -24,11 +27,13 @@ This cluster launches a ray head node and one worker node. Launch the external s
 Use to forward the needed port to the ray service: kubectl port-forward svc/<cluster name> 8265:8265
 This is the port that will be used inside ray_jobs.py, where we submit the jobs to ray.
 
+-----------------------------------------------------------
 
------------------------------------------------------
 ## Useful commands
+
 kubectl apply -f ray-cluster.yaml
 kubectl delete -f ray-cluster.yaml
 helm install kuberay-operator ray/kuberay-operator
 helm uninstall kuberay-operator
-kubectl port-forward svc/roboteam-ray-cluster-head-nodeport 8265:8265 6379:6379 10001:10001 8000:8000 &
+kubectl port-forward svc/roboteam-ray-cluster-head-nodeport 8265:8265 6379:6379 10001:10001 8000:8000 &
+minikube start -p ray --nodes 2 --memory 4000 --cpus 3
diff --git a/docker/runner/ray-cluster-combined.yaml b/docker/runner/ray-cluster-combined.yaml
@@ -34,11 +34,11 @@ spec:
           - containerPort: 8000  # Serve
           resources:
             requests:
-              cpu: "500m"
-              memory: "1Gi"
-            limits:
               cpu: "1"
               memory: "2Gi"
+            limits:
+              cpu: "2"
+              memory: "4Gi"
           env:
           - name: POD_IP
             valueFrom:
@@ -70,15 +70,17 @@ spec:
   # Worker node configuration with integrated simulator
   workerGroupSpecs:
     - groupName: worker-group
-      replicas: 1
+      replicas: 1      
+      minReplicas: 1  # Specify minimum number of worker replicas
+      maxReplicas: 1  # Optional: specify maximum number of replicas
       rayStartParams:
         num-cpus: "1"
       template:
         metadata:
           labels:
             app: ray-worker
         spec:
-          hostNetwork: true
+          hostNetwork: False
           dnsPolicy: ClusterFirstWithHostNet
           affinity:
             nodeAffinity:
@@ -109,8 +111,8 @@ spec:
                 cpu: 500m
                 memory: 1Gi
               limits:
-                cpu: 2000m
-                memory: 4Gi
+                cpu: 1500m
+                memory: 3Gi
             env:
             - name: LD_LIBRARY_PATH
               value: /home/roboteam/build/release/lib
@@ -187,10 +189,10 @@ spec:
             resources:
               requests:
                 cpu: 120m
-                memory: 20Mi
+                memory: 40Mi
               limits:
                 cpu: 150m
-                memory: 50Mi
+                memory: 100Mi
 
           # Simulator
           - name: erforce-simulator
@@ -295,4 +297,4 @@ spec:
     targetPort: 8081
   - name: simulator
     port: 5558
-    targetPort: 5558
+    targetPort: 5558
diff --git a/docker/runner/ray-cluster.yaml b/docker/runner/ray-cluster.yaml
@@ -8,39 +8,46 @@ spec:
   headGroupSpec:
     rayStartParams:
       dashboard-host: "0.0.0.0"
+      node-ip-address: "$(HOST_IP)"
     template:
       metadata:
         labels:
           app: ray-head
       spec:
         hostNetwork: false
+        # Add node affinity for head node
+        affinity:
+          nodeAffinity:
+            requiredDuringSchedulingIgnoredDuringExecution:
+              nodeSelectorTerms:
+              - matchExpressions:
+                - key: kubernetes.io/hostname
+                  operator: In
+                  values:
+                  - multinode-demo
         containers:
         - name: ray-head
           image: roboteamtwente/ray:development
-          imagePullPolicy: Always # Always pull the latest image
+          imagePullPolicy: Always
           ports:
-          - containerPort: 8265  # dashboard port
-          - containerPort: 6379  # redis port 
-          - containerPort: 10001  # GCS server port
-          - containerPort: 8000   # Serve port
+          - containerPort: 8265
+          - containerPort: 6379
+          - containerPort: 10001
+          - containerPort: 8000
           resources:
             requests:
               cpu: "500m"
-              memory: "1Gi"    # Increased from 256Mi
+              memory: "1Gi"
             limits:
-              cpu: "1"         # Changed from 600 (which was too high)
-              memory: "2Gi"    # Increased from 512Mi
-
+              cpu: "1"
+              memory: "2Gi"
           env:
-          - name: SIMULATION_HOST
-            value: "127.0.0.1"  # Using localhost since we're on host network
-          - name: VISION_PORT
-            value: "10020"      # Match your simulator's vision port
-          - name: REFEREE_PORT
-            value: "10003"      # Match your simulator's referee port
-
+          - name: HOST_IP
+            valueFrom:
+              fieldRef:
+                fieldPath: status.hostIP
           command: ["/bin/bash", "-c", "--"]
-          args: ["ray start --head --port=6379 --dashboard-host=0.0.0.0 --block"]
+          args: ["ray start --head --port=6379 --bind-address=0.0.0.0 --dashboard-host=0.0.0.0 --node-ip-address=$(HOST_IP) --block"]
           livenessProbe:
             exec:
               command:
@@ -66,7 +73,7 @@ spec:
   # Worker node configuration
   workerGroupSpecs:
     - groupName: worker-group
-      replicas: 1  # Number of worker nodes
+      replicas: 1
       rayStartParams:
         num-cpus: "1" 
       template:
@@ -75,10 +82,20 @@ spec:
             app: ray-worker
         spec:
           hostNetwork: true
+          # Replace pod anti-affinity with node affinity
+          affinity:
+            nodeAffinity:
+              requiredDuringSchedulingIgnoredDuringExecution:
+                nodeSelectorTerms:
+                - matchExpressions:
+                  - key: kubernetes.io/hostname
+                    operator: In
+                    values:
+                    - multinode-demo-m02
           containers:
           - name: ray-worker
             image: roboteamtwente/ray:development
-            imagePullPolicy: Always # Always pull the latest image
+            imagePullPolicy: Always
             resources:
               requests:
                 cpu: 500m
@@ -130,4 +147,27 @@ spec:
   - name: serve
     port: 8000
     targetPort: 8000
-    nodePort: 30800  # Serve
+    nodePort: 30800  # Serve
+
+---
+apiVersion: v1
+kind: Service
+metadata:
+  name: roboteam-ray-cluster-head-svc
+spec:
+  type: ClusterIP
+  selector:
+    app: ray-head
+  ports:
+  - name: redis
+    port: 6379
+    targetPort: 6379
+  - name: gcs
+    port: 10001
+    targetPort: 10001
+  - name: dashboard
+    port: 8265
+    targetPort: 8265
+  - name: serve
+    port: 8000
+    targetPort: 8000
diff --git a/docker/runner/simulator.yaml b/docker/runner/simulator.yaml
@@ -209,4 +209,20 @@ spec:
       targetPort: 8080
     - name: gc
       port: 8081
-      targetPort: 8081
+      targetPort: 8081
+    - name: zmq
+      port: 5558
+      targetPort: 5558
+      protocol: TCP
+    - name: sim-control
+      port: 10300
+      targetPort: 10300
+      protocol: UDP
+    - name: vision
+      port: 10020
+      targetPort: 10020
+      protocol: UDP
+    - name: referee
+      port: 10003
+      targetPort: 10003
+      protocol: UDP
diff --git a/roboteam_ai/src/RL/RL_Ray/train.py b/roboteam_ai/src/RL/RL_Ray/train.py
@@ -18,14 +18,7 @@
 import warnings
 warnings.filterwarnings('ignore', category=DeprecationWarning)
 
-def verify_imports():
-    import numpy
-    import torch
-    print(f"Local NumPy version: {numpy.__version__}")
-    print(f"Local PyTorch version: {torch.__version__}")
-
 def main():
-    verify_imports()
 
     if not ray.is_initialized():
         ray.init(
@@ -34,63 +27,48 @@ def main():
             runtime_env={
                 "env_vars": {
                     "NUMPY_EXPERIMENTAL_ARRAY_FUNCTION": "0",
-
                 },
-                # "pip": [
-                #     "numpy==1.24.3",
-                #     "pyzmq==26.2.0"
-                # ]
+                "py_modules": [
+                        os.path.join(roboteam_ai_root, "roboteam_ai"),
+                        os.path.join(roboteam_ai_root, "roboteam_networking")
+                    ]
             }
         )
 
     # ray.init()
 
-    # We can set env_config here
     def env_creator(env_config):
         return RoboTeamEnv(env_config)  # This passes the config to your env
 
     # Register the environment
     register_env("RoboTeamEnv", env_creator)
 
-    # Create list of callbacks
-    callbacks = [
-        JsonLoggerCallback(),
-        CSVLoggerCallback(),
-    ]
-
     config = (
         PPOConfig()
         .environment("RoboTeamEnv")
         .framework("torch")
         .resources(num_gpus=0)
         .env_runners(
             num_env_runners=1,
-            num_envs_per_env_runner=1,
-            sample_timeout_s=None
-        )
-#         .api_stack(
-#             enable_rl_module_and_learner=True,
-#             enable_env_runner_and_connector_v2=True
-# )
+            num_envs_per_env_runner=1, # If you use vectorized env, otherwise set to 1
+            rollout_fragment_length=16,
+            sample_timeout_s=30,
+            create_env_on_local_worker=False) # This makes sure that we don't run a local environment
+        .api_stack(
+            enable_rl_module_and_learner=True,
+            enable_env_runner_and_connector_v2=True)
         .debugging(
             log_level="DEBUG",
             seed=42
         )
-        #.callbacks(callbacks)
-        .evaluation(evaluation_interval=10)
     )
 
     print("Starting training...")
     algo = config.build()
 
     for i in range(10):
-        print(f"\nStarting iteration {i}")
         result = algo.train()
         result.pop("config")
-        print("\nTraining metrics:")
-        print(f"Episode Reward Mean: {result.get('episode_reward_mean', 'N/A')}")
-        print(f"Episode Length Mean: {result.get('episode_len_mean', 'N/A')}")
-        print(f"Total Timesteps: {result.get('timesteps_total', 'N/A')}")
         pprint(result)
 
 if __name__ == "__main__":

diff --git a/roboteam_ai/src/RL/env.py b/roboteam_ai/src/RL/env.py
@@ -34,7 +34,6 @@ class RoboTeamEnv(gymnasium.Env):
     def __init__(self, config=None):
         self.config = config or {} # Config placeholder
 
-
         self.MAX_ROBOTS_US = 10
 
         # Define the number of robots that are present in each grid + ball location
@@ -161,7 +160,6 @@ def get_observation(self):
             'ball_position': self.ball_quadrant,
             'is_yellow_dribbling' : self.is_yellow_dribbling
         }
-        print("obs: ", observation_space)
 
         return observation_space, self.calculate_reward()