update gpu tutorial

bihealth · Aug 29, 2024 · ac11aab · ac11aab
1 parent a7f18c2
commit ac11aab
Showing 1 changed file with 146 additions and 144 deletions.
diff --git a/bih-cluster/docs/how-to/connect/gpu-nodes.md b/bih-cluster/docs/how-to/connect/gpu-nodes.md
@@ -1,144 +1,146 @@
-# How-To: Connect to GPU Nodes
-
-The cluster has seven nodes with four Tesla V100 GPUs each: `hpc-gpu-{1..7}` and one node with 10 A400 GPUs: `hpc-gpu-8`.
-
-Connecting to a node with GPUs is easy.
-You simply request a GPU using the `--gres=gpu:$CARD:COUNT` (for `CARD=tesla` or `CARD=a40`) argument to `srun` and `batch`.
-This will automatically place your job in the `gpu` partition (which is where the GPU nodes live) and allocate a number of `COUNT` GPUs to your job.
-
-!!! info
-    Fair use rules apply.
-    As GPU nodes are a limited resource, excessive use by single users is prohibited and can lead to mitigating actions.
-    Be nice and cooperative with other users.
-    Tip: `getent passwd USER_NAME` will give you a user's contact details.
-
-!!! warning "Interactive Use of GPU Nodes is Discouraged"
-
-    While interactive computation on the GPU nodes is convenient, it makes it very easy to forget a job after your computation is complete and let it run idle.
-    While your job is allocated, it blocks the **allocated** GPUs and other users cannot use them although you might not be actually using them.
-    Please prefer batch jobs for your GPU jobs over interactive jobs.
-
-    Furthermore, interactive GPU jobs are currently limited to 24 hours.
-    We will monitor the situation and adjust that limit to optimize GPU usage and usability.
-
-    Please also note that allocation of GPUs through Slurm is mandatory, in other words: Using GPUs via SSH sessions is prohibited.
-    The scheduler is not aware of manually allocated GPUs and this interferes with other users' jobs.
-
-## Preparation
-
-We will setup a miniconda installation with `pytorch` testing the GPU.
-If you already have this setup then you can skip this step
-
-```bash
-hpc-login-1:~$ srun --pty bash
-med0703:~$ wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
-med0703:~$ bash Miniconda3-latest-Linux-x86_64.sh -b -p ~/miniconda3
-med0703:~$ source ~/miniconda3/bin/activate
-med0703:~$ conda create -y -n gpu-test pytorch cudatoolkit=10.2 -c pytorch
-med0703:~$ conda activate gpu-test
-med0703:~$ python -c 'import torch; print(torch.cuda.is_available())'
-False
-med0703:~$ exit
-hpc-login-1:~$
-```
-
-The `False` shows that CUDA is not available on the node but that is to be expected.
-We're only warming up!
-
-## Allocating GPUs
-
-Let us now allocate a GPU.
-The Slurm schedule will properly allocate GPUs for you and setup the environment variable that tell CUDA which devices are available.
-The following dry run shows these environment variables (and that they are not available on the login node).
-
-```bash
-hpc-login-1:~$ export | grep CUDA_VISIBLE_DEVICES
-hpc-login-1:~$ srun --gres=gpu:tesla:1 --pty bash
-med0303:~$ export | grep CUDA_VISIBLE_DEVICES
-declare -x CUDA_VISIBLE_DEVICES="0"
-med0303:~$ exit
-hpc-login-1:~$ srun --gres=gpu:tesla:2 --pty bash
-med0303:~$ export | grep CUDA_VISIBLE_DEVICES
-declare -x CUDA_VISIBLE_DEVICES="0,1"
-```
-
-You can see that you can also reserve multiple GPUs.
-If we were to open two concurrent connections (e.g., in a `screen`) to the same node when allocating one GPU each, the allocated GPUs would be non-overlapping.
-Note that any two jobs are isolated using Linux cgroups ("container" technology 🚀) so you cannot accidentally use a GPU of another job.
-
-Now to the somewhat boring part where we show that CUDA actually works.
-
-```bash
-hpc-login-1:~$ srun --gres=gpu:tesla:1 --pty bash
-hpc-gpu-1:~$ nvcc --version
-nvcc: NVIDIA (R) Cuda compiler driver
-Copyright (c) 2005-2019 NVIDIA Corporation
-Built on Wed_Oct_23_19:24:38_PDT_2019
-Cuda compilation tools, release 10.2, V10.2.89
-hpc-gpu-1:~$ source ~/miniconda3/bin/activate
-hpc-gpu-1:~$ conda activate gpu-test
-hpc-gpu-1:~$ python -c 'import torch; print(torch.cuda.is_available())'
-True
-```
-
-!!! note
-
-    Recently, `--gres=gpu:tesla:COUNT` was often not able to allocate the right partion on it's own.
-    If scheduling a GPU fails, consider additionally indicating the GPU partion explicitely with `--partition gpu` (or `#SBATCH --partition gpu` in batch file).
-
-    Also make sure to read the FAQ entry "[I have problems connecting to the GPU node! What's wrong?](../../help/faq.md#i-have-problems-connecting-to-the-gpu-node-whats-wrong)" if you encounter problems.   
-
-## Bonus #1: Who is using the GPUs?
-
-Use `squeue` to find out about currently queued jobs (the `egrep` only keeps the header and entries in the `gpu` partition).
-
-```bash
-hpc-login-1:~$ squeue | egrep -iw 'JOBID|gpu'
-             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
-                33       gpu     bash holtgrem  R       2:26      1 hpc-gpu-1
-```
-
-## Bonus #2: Is the GPU running?
-
-To find out how active the GPU nodes actually are, you can connect to the nodes (without allocating a GPU you can do this even if the node is full) and then use `nvidia-smi`.
-
-```bash
-hpc-login-1:~$ ssh hpc-gpu-1 bash
-hpc-gpu-1:~$ nvidia-smi
-Fri Mar  6 11:10:08 2020
-+-----------------------------------------------------------------------------+
-| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
-|-------------------------------+----------------------+----------------------+
-| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
-| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
-|===============================+======================+======================|
-|   0  Tesla V100-SXM2...  Off  | 00000000:18:00.0 Off |                    0 |
-| N/A   62C    P0   246W / 300W |  16604MiB / 32510MiB |     99%      Default |
-+-------------------------------+----------------------+----------------------+
-|   1  Tesla V100-SXM2...  Off  | 00000000:3B:00.0 Off |                    0 |
-| N/A   61C    P0   270W / 300W |  16604MiB / 32510MiB |    100%      Default |
-+-------------------------------+----------------------+----------------------+
-|   2  Tesla V100-SXM2...  Off  | 00000000:86:00.0 Off |                    0 |
-| N/A   39C    P0    55W / 300W |      0MiB / 32510MiB |      0%      Default |
-+-------------------------------+----------------------+----------------------+
-|   3  Tesla V100-SXM2...  Off  | 00000000:AF:00.0 Off |                    0 |
-| N/A   44C    P0    60W / 300W |      0MiB / 32510MiB |      4%      Default |
-+-------------------------------+----------------------+----------------------+
-
-+-----------------------------------------------------------------------------+
-| Processes:                                                       GPU Memory |
-|  GPU       PID   Type   Process name                             Usage      |
-|=============================================================================|
-|    0     43461      C   python                                     16593MiB |
-|    1     43373      C   python                                     16593MiB |
-+-----------------------------------------------------------------------------+
-```
-
-## Fair Share / Fair Use
-
-**Note that allocating a GPU makes it unavailable for everyone else.**
-**There is currently no technical restriction in place on the GPU nodes, so please behave nicely and cooperatively.**
-If you see someone blocking the GPU nodes for long time, then first find out who it is.
-You can type `getent passwd USER_NAME` on any cluster node to see the email address (and work phone number if added) of the user.
-Send a friendly email to the other user, most likely they blocked the node accidentally.
-If you cannot resolve the issue (e.g., the user is not reachable) then please contact [email protected] with your issue.
+# How-To: Connect to GPU Nodes
+
+The cluster has seven nodes with four Tesla V100 GPUs each: `hpc-gpu-{1..7}` and one node with 10 A40 GPUs: `hpc-gpu-8`.
+
+Connecting to a node with GPUs is easy.
+You request one or more GPU cores by adding a generic resources flag to your Slurm job submission via `srun` or `sbatch`.
+- `--gres=gpu:tesla:COUNT` will request NVIDIA V100 cores.
+- `--gres=gpu:tesla:COUNT` will request NVIDIA A40 cores.
+- `--gres=gpu:COUNT` will request any available GPU cores.
+
+Your job will be automatically placed in the Slurm `gpu` partition and allocated a number of `COUNT` GPUs.
+
+!!! info
+    Fair use rules apply.
+    As GPU nodes are a limited resource, excessive use by single users is prohibited and can lead to mitigating actions.
+    Be nice and cooperative with other users.
+    Tip: `getent passwd USER_NAME` will give you a user's contact details.
+
+!!! warning "Interactive Use of GPU Nodes is Discouraged"
+
+    While interactive computation on the GPU nodes is convenient, it makes it very easy to forget a job after your computation is complete and let it run idle.
+    While your job is allocated, it blocks the **allocated** GPUs and other users cannot use them although you might not be actually using them.
+    Please prefer batch jobs for your GPU jobs over interactive jobs.
+
+    Furthermore, interactive GPU jobs are currently limited to 24 hours.
+    We will monitor the situation and adjust that limit to optimize GPU usage and usability.
+
+    Please also note that allocation of GPUs through Slurm is mandatory, in other words: Using GPUs via SSH sessions is prohibited.
+    The scheduler is not aware of manually allocated GPUs and this interferes with other users' jobs.
+
+## Usage example
+### Preparation
+We will setup a miniconda installation with `pytorch` testing the GPU.
+If you already have this setup then you can skip this step
+
+```bash
+hpc-login-1:~$ srun --pty bash
+hpc-cpu-1:~$ wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
+hpc-cpu-1:~$ bash Miniconda3-latest-Linux-x86_64.sh -b -p ~/work/miniconda3
+hpc-cpu-1:~$ source ~/work/miniconda3/bin/activate
+hpc-cpu-1:~$ conda create -y -n gpu-test pytorch cudatoolkit=10.2 -c pytorch
+hpc-cpu-1:~$ conda activate gpu-test
+hpc-cpu-1:~$ python -c 'import torch; print(torch.cuda.is_available())'
+False
+hpc-cpu-1:~$ exit
+hpc-login-1:~$
+```
+
+The `False` shows that CUDA is not available on the node but that is to be expected.
+We're only warming up!
+
+### Allocating GPUs
+
+Let us now allocate a GPU.
+The Slurm schedule will properly allocate GPUs for you and setup the environment variable that tell CUDA which devices are available.
+The following dry run shows these environment variables (and that they are not available on the login node).
+
+```bash
+hpc-login-1:~$ export | grep CUDA_VISIBLE_DEVICES
+hpc-login-1:~$ srun --gres=gpu:tesla:1 --pty bash
+hpc-gpu-1:~$ export | grep CUDA_VISIBLE_DEVICES
+declare -x CUDA_VISIBLE_DEVICES="0"
+hpc-gpu-1:~$ exit
+hpc-login-1:~$ srun --gres=gpu:tesla:2 --pty bash
+hpc-gpu-1:~$ export | grep CUDA_VISIBLE_DEVICES
+declare -x CUDA_VISIBLE_DEVICES="0,1"
+```
+
+As you see, you can also reserve multiple GPUs.
+If we were to open two concurrent connections (e. g. in a `screen`) to the same node when allocating one GPU each, the allocated GPUs would be non-overlapping.
+Note that any two jobs are isolated using Linux cgroups ("container" technology) so you cannot accidentally use a GPU of another job.
+
+Now to the somewhat boring part where we show that CUDA actually works.
+
+```bash
+hpc-login-1:~$ srun --gres=gpu:tesla:1 --pty bash
+hpc-gpu-1:~$ nvcc --version
+nvcc: NVIDIA (R) Cuda compiler driver
+Copyright (c) 2005-2019 NVIDIA Corporation
+Built on Wed_Oct_23_19:24:38_PDT_2019
+Cuda compilation tools, release 10.2, V10.2.89
+hpc-gpu-1:~$ source ~/work/miniconda3/bin/activate
+hpc-gpu-1:~$ conda activate gpu-test
+hpc-gpu-1:~$ python -c 'import torch; print(torch.cuda.is_available())'
+True
+```
+
+!!! note
+    If scheduling a GPU fails, consider explicitely requesting the GPU partion via
+    `--partition gpu` (or `#SBATCH --partition gpu`).
+
+    Also make sure to read the FAQ entry "[I have problems connecting to the GPU node! What's wrong?](../../help/faq.md#i-have-problems-connecting-to-the-gpu-node-whats-wrong)" if you encounter problems.
+
+## Bonus #1: Who is using the GPUs?
+
+Use `squeue` to find out about currently queued jobs (the `egrep` only keeps the header and entries in the `gpu` partition).
+
+```bash
+hpc-login-1:~$ squeue | egrep -iw 'JOBID|gpu'
+             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
+                33       gpu     bash holtgrem  R       2:26      1 hpc-gpu-1
+```
+
+## Bonus #2: Is the GPU running?
+
+To find out how active the GPU nodes actually are, you can connect to the nodes (without allocating a GPU; you can do this even if the node is full) and then use `nvidia-smi`.
+
+```bash
+hpc-login-1:~$ ssh hpc-gpu-1 bash
+hpc-gpu-1:~$ nvidia-smi
+Fri Mar  6 11:10:08 2020
++-----------------------------------------------------------------------------+
+| NVIDIA-SMI 440.33.01    Driver Version: 440.33.01    CUDA Version: 10.2     |
+|-------------------------------+----------------------+----------------------+
+| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
+| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
+|===============================+======================+======================|
+|   0  Tesla V100-SXM2...  Off  | 00000000:18:00.0 Off |                    0 |
+| N/A   62C    P0   246W / 300W |  16604MiB / 32510MiB |     99%      Default |
++-------------------------------+----------------------+----------------------+
+|   1  Tesla V100-SXM2...  Off  | 00000000:3B:00.0 Off |                    0 |
+| N/A   61C    P0   270W / 300W |  16604MiB / 32510MiB |    100%      Default |
++-------------------------------+----------------------+----------------------+
+|   2  Tesla V100-SXM2...  Off  | 00000000:86:00.0 Off |                    0 |
+| N/A   39C    P0    55W / 300W |      0MiB / 32510MiB |      0%      Default |
++-------------------------------+----------------------+----------------------+
+|   3  Tesla V100-SXM2...  Off  | 00000000:AF:00.0 Off |                    0 |
+| N/A   44C    P0    60W / 300W |      0MiB / 32510MiB |      4%      Default |
++-------------------------------+----------------------+----------------------+
+
++-----------------------------------------------------------------------------+
+| Processes:                                                       GPU Memory |
+|  GPU       PID   Type   Process name                             Usage      |
+|=============================================================================|
+|    0     43461      C   python                                     16593MiB |
+|    1     43373      C   python                                     16593MiB |
++-----------------------------------------------------------------------------+
+```
+
+## Fair Share / Fair Use
+
+**Note that allocating a GPU makes it unavailable for everyone else, so please behave nicely and be cooperative.**
+If you see someone blocking the GPU nodes for a long time, first find out who it is.
+You can type `getent passwd USER_NAME` on any cluster node to see their email address (and work phone number if added).
+Send a friendly email, most likely they blocked the node accidentally.
+If you cannot resolve the issue (e. g. the user is not reachable) then please contact [email protected].