This repo contains the instructions and code to access and run your processes on the ISLa mini-cluster. Please read the instructions below carefully before proceeding. If you have any questions or errors, contact [email protected].
Our mini-cluster is based on a publisher-subscriber architecture. Specifically, we report the steps to obtain login credentials and the procedures to be performed to launch the scripts on our machines.
Contact [email protected] to obtain credentials for one of the cluster nodes. In the email, specify whether you intend to use GPU or CPU only. The administrator, given the availability of nodes, will provide you with a username, password, and the IP address of the node assigned.
CPU - RAM | GPU | |
---|---|---|
Node 1 | 20 cores - 64 Gb | Nvidia RTX 4070 ti 12 Gb |
Node 2 | 8 cores - 48 Gb | Nvidia RTX 2070 Super 8Gb |
the second step involves preparing scripts to run on the cluster on your machine(!). That is, the cluster should be used only to run jobs and not to program. Moreover, the result of the code execution should be written in a specific file (txt, CSV, etc...).
For the time being, our mini-cluster only supports the execution of Python scripts.
If you need to run any particular script other than Python, send an email to [email protected].
Below are all the instructions for copying and eventually creating your own Python virtual environment.
To access the assigned node, ssh and a VPN should be used. In particular, once you have connected to the univr net using a VPN, open a terminal in your pc and type:
> ssh username@<IP_Node>
and click enter (the password will be required).
Now, you are inside the assigned node. As specified before, you should not code inside the node. Once you have created in your pc all the necessary Python scripts to be executed inside the node, you can transfer these scripts using sftp or scp. For example, suppose you have the file example.py on your desktop, and you want to copy this file from your pc to the assigned node. Just open another terminal and type:
> sftp username@<IP_Node>
(the password will be required). Once you are inside the node in this mode, just type in the terminal:
> put -r Desktop/example.py /home/student/Desktop/
This command will copy your Python script (located on your pc) to the Desktop folder of the assigned node. Similarly, you can retrieve files from the node to your computer using:
> get -r /home/student/example.py
assuming your file has been copied in the path /home/student/.
You can create various Python virtual environments to have 100% compatibility with your personal computer. These environments work similarly to Conda. To create a virtual environment, follow these commands:
> mkdir venv
> python -m venv venv/name_venv
> source venv/name_venv/bin/activate
now the name_venv virtual environment is activated, and you can install all the required dependencies with pip.
To run your scripts in the assigned node, you need to create a txt configuration file. This must have the following name ''username_config.txt''. This file should contain the following commands:
python /home/student/example.py parameter1 parameter2
where home/student/example.py should be replaced with the absolute path to the python file (in the assigned node) to be run. parameter1, paramenters2 should be replaced with the ones necessary (if any). Multiple rows can be specified in the file, where each row can have different parameters (e.g., different seeds).
In order to use the mini-cluster, once you enter in ssh in the node, you should clone this repo https://github.com/Isla-lab/node_laucher.
In order to use the GPU, you should follow these steps:
> cd node_laucher
> ./add_paths
To run your jobs on the assigned node of the mini-cluster just type:
> screen -S name_screen -dm bash -c 'python node.py username /home/username/path_to_your_config/your_config.txt GPU n_parallel'
Screen or GNU Screen is a terminal multiplexer. In other words, it means that you can start a screen session and then open any number of windows (virtual terminals) inside that session. Processes running on Screen will continue to run when their window is not visible, even if you get disconnected. DO NOT use ctrl+C to stop your script, otherwise you will block the priority queue!
Parameters:
- name_screen: name of the screen to be used to re-enable a particular session.
- username: the username used to log in to SSH.
- /home/username/path_to_your_script/your_script.py: absolute path to the end to run inside the node (you can retrieve the absolute path by typing the pwd command from terminal)
- GPU: boolean to indicate whether or not you want to use the GPU
-
n_parallel: if GPU == False should be an integer
$\geq$ 1 indicating how many rows you want to run in parallel of the username_config.txt file.
Regarding the parameter n_parallel, as specified it is possible to indicate how many lines of the configuration .txt file run in parallel. Please do some preliminary tests on your machine to understand the load on the CPU cores before launching on the node. We report in the next section on best practices for preparing python scripts to catch any runtime or other errors.
Suppose we (username: student) want to run the example.py created in your pc. Suppose the file is something like this:
import sys
from datetime import datetime
import time
import logging
logging.basicConfig(filename=f'output_{datetime.now().strftime("%d-%m-%Y_%H:%M:%S")}.log', level=logging.DEBUG)
logger=logging.getLogger(__name__)
try:
arg1 = sys.argv[1]
logger.info(arg1)
print(arg1)
time.sleep(10)
except Exception as e:
logger.error(e)
This script creates a log file where any errors will be saved. It then prints on the screen the parameters that are passed to the script. Please notice that it is good practice to put your code enclosed in try-except commands in order to figure out why your script possibly does not work. We then create the .txt configuration file to run our script on the assigned node. Specifically, suppose we want to use only the CPU and send 2 executions in parallel at a time. So we write the file as:
python /home/student/Desktop/example_file.py hello_1
python /home/student/Desktop/example_file.py hello_2
python /home/student/Desktop/example_file.py hello_3
python /home/student/Desktop/example_file.py hello_4
python /home/student/Desktop/example_file.py hello_5
python /home/student/Desktop/example_file.py hello_6
python /home/student/Desktop/example_file.py hello_7
Hence, we specify the command (python) and the absolute path where our script will be copied to the assigned node. Finally, as the last argument, any parameters that our script requires (if needed).
First of all, we connect via VPN to the UNIVR network. Now that we are connected to the UNIVR network, let's access via SSH to the assigned node (for this example, the node will be called server). Hence, let's open a terminal and digit:
ssh student@<IP_Node_server>
If everything is correct you should see in the terminal:
student@pop-os:~$
Now open another terminal window and navigate to the folder where the files we want to copy are located. Here we again access the assigned node, but this time with sftp:
sftp student@<IP_Node_server>
If everything is correct you should see in the terminal:
Connected to <IP_Node_server>.
sftp>
We now copy the files in the assigned node using the terminal with sftp with the commands:
sftp> put -r example.py /home/student/Desktop/
sftp> put -r student_config.txt /home/student/Desktop/
Now, by typing the ls command into the desktop of the assigned node, we should see our files.
First of all, install paho-mqtt and yaml using this command:
pip install paho-mqtt
pip install pyyaml
Let's now clone this repo in the assigned node. In the terminal with ssh type:
git clone https://github.com/Isla-lab/node_laucher.git
To launch the script on the assigned machine, we type the following command in the terminal with ssh:
screen -S name_screen -dm bash -c 'python node_launcher/node.py student /home/student/Desktop/student_config.txt False 2'
This command will run two rows of your config.txt in parallel using the CPU until the end of the file. To check the terminal use:
screen -r name_screen
To detach from the screen, use ctrl+a and ctrl+d. The result on your screen should be:
Your job is starting...
hello_2
hello_1
hello_3
hello_4
hello_6
hello_5
hello_7
End all your jobs!
[screen is terminating]