Skip to content

SLURM on aims2

James McEnerney edited this page Dec 2, 2015 · 13 revisions

Running with SLURM on aims2

Configuration and Accounts

SLURM is a resource manager that provides cooperative parallel computing among several users. SLURM is available on aims2 which has 8 nodes named greyworm1, ..., greyworm8; these are the resources. These nodes are only available through aims1, aims2, or other aims/pcmdi managed nodes that have access to the private network. Tony can check this for you. Once your account is setup, you can login into them, even though you rarely need to do so. There is also has an nfs mounted drive on /opt/nfs. You'll need a directory setup for your own use. See Tony for an account.

  • Note: the slurm client software modules, e.g. sinfo, squeue, srun, etc. can be installed on other nodes, and in fact done so on aims1. The examples here assume aims2

Setup uvcdat

Once your account is setup, login to aims2 with

ssh -Y aims2

This will pass your X environment to aims2. Note that it does not pass to any greyworm, which is a problem if you're running an app that creates png files. Diagnostics do just that. This issue is solved with the subsequent build. Once your /opt/nfs/username directory is setup, go there and clone uvcdat in a directory named uvcdat. The goal is to create a version that runs in this nfs directory. Next create a build directory such as build_nfs and move into this directory. Run

cmake ../uvcdat/ -DCMAKE_INSTALL_PREFIX=../nfs_uvcdat -DCDAT_BUILD_GUI=OFF -DCDAT_BUILD_PARALLEL=ON -DCDAT_BUILD_OFFSCREEN=ON

Once complete run

make -j8

A version of uvcdat is now available by running

source /opt/nfs/username/nfs_uvcdat/bin/setup_runtime.csh

Three features are available: parallel processing, creating png files offline and no gui. Note the last 2 features are related. To view png files you'll need a second session to run

gthumb filename

Also in your .login put

setenv UVCDAT_ANONYMOUS_LOG no

Setup uvcmetrics

The version of uvcmetrics in uvcdat is whatever version that was included at the time of the release. To get a later version in /opt/nfs/username create uvcmetrics and go into this directory. Clone a branch, devel for the latest, from github and run to make sure it's in the version of uvcdat just created

python setup install

Interacting with SLURM

SLURM has 2 basic interfaces: srun and sbatch. Run srun with

srun -N1 mpirun -n 2 python /opt/nfs/username/uvcmetrics/src/python/mpi_examples/simple.py

Running this way will wait until the necessary resources are available. This could take a long time. The better way is using sbatch described below. First, let's pick apart what is happening.

-N1 is asking for a single node; for example -N4 asks for 4 nodes. Next mpirun -n 2 is asking for mpirun to run using 2 processes; increase to 4 to use 4 processes, etc. The application that will actually execute is simple.py which is located in the specified directory. Note it is a fully qualified directory specification. Anything shorter would probably fail.

The better way of interacting with SLURM is with sbatch. It seems to require a shell script. In this case the script named simple.sh with the following content

#!/bin/bash

source /opt/nfs/username/nfs_uvcdat/bin/setup_runtime.sh

mpirun -n 2 python /opt/nfs/username/uvcmetrics/src/python/mpi_examples/simple.py

Then run

sbatch -N1 simple.sh

This will put the job in a queue that SLURM manages. To see if your job is running or waiting run

squeue

##Useful links http://www.nccs.nasa.gov/primer/slurm/slurm.html

Clone this wiki locally