-
Notifications
You must be signed in to change notification settings - Fork 68
SLURM on aims2
SLURM is a resource manager that provides cooperative parallel computing among several users. SLURM is available on aims2 which has 8 nodes named greyworm1, ..., greyworm8; these are the resources. These nodes are only available through aims1, aims2, or other aims/pcmdi managed nodes that have access to the private network. Tony can check this for you. Once your account is setup, you can login into them, even though you rarely need to do so. There is also has an nfs mounted drive on /opt/nfs. You'll need a directory setup for your own use. See Tony for an account.
- Note: the slurm client software modules, e.g. sinfo, squeue, srun, etc. can be installed on other nodes, and in fact done so on aims1. The examples here assume aims2
Once your account is setup, login to aims2 with
ssh -Y aims2
This will pass your X environment to aims2. Note that it does not pass to any greyworm, which is a problem if you're running an app that creates png files. Diagnostics do just that. This issue is solved with the subsequent build. Once your /opt/nfs/username directory is setup, go there and clone uvcdat in a directory named uvcdat. The goal is to create a version that runs in this nfs directory. Next create a build directory such as build_nfs and move into this directory. Run
cmake ../uvcdat/ -DCMAKE_INSTALL_PREFIX=../nfs_uvcdat -DCDAT_BUILD_GUI=OFF -DCDAT_BUILD_PARALLEL=ON -DCDAT_BUILD_OFFSCREEN=ON
Once complete run
make -j8
A version of uvcdat is now available by running
source /opt/nfs/username/nfs_uvcdat/bin/setup_runtime.csh
Three features are available: parallel processing, creating png files offline and no gui. Note the last 2 features are related. To view png files you'll need a second session to run
gthumb filename
Also in your .login put
setenv UVCDAT_ANONYMOUS_LOG no
The version of uvcmetrics in uvcdat is whatever version that was included at the time of the release. To get a later version in /opt/nfs/username create uvcmetrics and go into this directory. Clone a branch, devel for the latest, from github and run to make sure it's in the version of uvcdat just created
python setup install
SLURM has 2 basic interfaces: srun and sbatch. Run srun with
srun -N1 mpirun -n 2 python /opt/nfs/username/uvcmetrics/src/python/mpi_examples/simple.py
Running this way will wait until the necessary resources are available. This could take a long time. The better way is using sbatch described below. First, let's pick apart what is happening.
-N1
is asking for a single node; for example -N4 asks for 4 nodes. Next mpirun -n 2
is asking for mpirun to run using 2 processes; increase to 4 to use 4 processes, etc. The application that will actually execute is simple.py
which is located in the specified directory. Note it is a fully qualified directory specification. Anything shorter would probably fail.
The better way of interacting with SLURM is with sbatch. It seems to require a shell script. In this case the script named simple.sh with the following content
#!/bin/bash
source /opt/nfs/username/nfs_uvcdat/bin/setup_runtime.sh
mpirun -n 2 python /opt/nfs/username/uvcmetrics/src/python/mpi_examples/simple.py
Then run
sbatch -N1 simple.sh
This will put the job in a queue that SLURM manages. To see if your job is running or waiting run
squeue
##Useful links http://www.nccs.nasa.gov/primer/slurm/slurm.html