Here are examples for running R code on CSC's Puhti supercluster as four different job styles: interactive, simple serial, array and parellel. For parallel there are 3 options with different R libraries: snow
, parallel
and future
. The interactive style is best for developing your scripts, usually with limited test data. For computationally more demanding analysis you have to use Puhti's batch system for requesting the resources and running your script.
The contours are calculated based on NLS 10m DEM data in geotiff format with terra
package. The results are saved in GeoPackge format.
If an R script is ready on laptop, then for running it in Puhti normally you need to edit only the paths to files and folders. Sometimes it might be necessary to install new R libraries.
Additional info: Puhti batch job system documentation
Files in this example:
- mapsheets.txt file - list of mapsheets to process. Open the file. How many mapsheets (=files) there is?
- The input NLS 10m DEM is already available in Puhti's GIS data folder :
/appl/data/geo/mml/dem10m/2019/
If you want to preview the files with QGIS, open one or a few DEM files from mapsheets.txt file. For seeing all of Finland you can open/appl/data/geo/mml/dem10m/dem10m_hierarchical.vrt
. - In each of the subfolders there are files for one job type or parallel library. Each subfolder has 2 files:
- An .R file for defining the tasks to be done.
- A batch job .sh file for submitting the job to Puhti SLURM.
Important
In these scripts project_20xxxxx
has been used as an example project name. Change the project name to your own CSC project name.
cscusername
is example username, replace with your username.
-
Open Puhti web interface and log in with your CSC user account.
-
Start interactive session and start RStudio, from the front page or
Apps -> RStudio
- (Reservation:
geocomputing_thu
, only during course) - Project:
project_20xxxxx
- Partition:
interactive
(small
during course) - Partition: interactive
- CPU cores: 1
- Memory: 4
- Local disk: 2
- Time: 2:00:00
- R version: r-env/4.4.0
- (Reservation:
-
Get exercise materials. Clone geocomputing Github repository. In RStudio:
File -> New project -> Version control -> Git
- Repository URL: https://github.com/csc-training/geocomputing.git
- Project directory name: geocomputing
- Create project as subdirectory of ->
Browse -> ... (in upper right corner) -> Path to folder
: /scratch/project_20xxxxx/students/cscusername- If you do not yet have such directory, move to /scratch/project_20xxxxx or /scratch/project_20xxxxx/students/ as path to folder and create a new directory and enter it)
-
In the files window in lower right, move to folder
R/puhti/01_serial
. -
Set the working directory.
Session -> Set working directory -> To Files Pane location
-
Open 01_serial/Contours_simple.R. This is a basic R script, which uses a for loop for going through all 3 files.
-
Check that needed R libraries are available in Puhti. Which libraries are used in this script? Run the libraries loading part in RStudio.
-
Run the rest of the commands from RStudio.
-
Check that there are 3 Geopackage files with contours in your working directory in RStudio.
-
Optional, check your results with QGIS
For simple 1 core batch job, use the same R script as for interactive working.
-
01_serial/serial_batch_job.sh.
- Fix the project name in the beginning of the file to match your CSC project name.
- Where are output and error messages written?
- How many cores are reserved, and for how long a time?
- How much memory is reserved?
- Which partition is used?
- Which module is used?
-
Open another web tab and open Puhti web interface. Open Puhti shell (
Tools -> Login node shell
) and submit batch job. (Use Shift-Insert or Ctrl+V for paste.)
cd /scratch/project_20xxxxx/students/cscusername/geocomputing/R/puhti/01_serial
sbatch serial_batch_job.sh
sbatch
prints out a job id, use it to check the state and the efficiency of the batch job. Did you reserve a good amount of memory?
seff <job_id>
- See output of slurm-<job_id>.out and slurm-<job_id>.err for any possible errors and other outputs.
- For seeing the files use RStudio or Linux
less <filename>
- With
tail -f <filename>
it is possible to see also how the output files are written during the job.
- For seeing the files use RStudio or Linux
- Check that you have 3 new GeoPackge files in the working folder.
- Check the resources used in another way.
sacct -j <job_id> -o elapsed,TotalCPU,reqmem,maxrss,AllocCPUS
- elapsed – time used by the job
- TotalCPU – time used by all cores together
- reqmem – amount of requested memory
- maxrss – maximum resident set size of all tasks in job
- AllocCPUS – how many CPUs were reserved
In this case the R code takes care of dividing the work to parallel processes, one for each input file. R has several packages for code parallelization, here examples for snow
, parallel
and future
are provided. future
package is likely the easiest to use. future
has also two internal options multicore
and cluster
. parallel
and future
with multicore
can be used in one node, so max 40 cores. snow
and future
with cluster
can be used on several nodes.
-
02_parallel_future/parallel_batch_job_future_cluster.sh batch job file for
future
withcluster
.- Fix the project name in the beginning of the file to match your CSC project name.
--ntasks=4
reserves 4 cores:snow
andfuture
withcluster
option require one additional process for master process, so that if there are 3 mapsheets to process 4 cores have to be reserved--mem-per-cpu=1000
reserves memory per coresrun apptainer_wrapper exec RMPISNOW --no-save --slave -f Calc_contours_future_cluster.R
startsRMPISNOW
which enables using several nodes.RMPISNOW
can not be tested from Rstudio.
-
02_parallel_future/Calc_contours_future_cluster.R
- Note in the end of the script how cluster is started, processes divided to workers with
future-map()
and cluster is stopped. - For loop has been removed, each worker calculates one file.
- Optional, compare to 03_parallel_snow/Calc_contours_snow.R.
future
package takes care of exporting variables and libraries to workers itself, insnow
andparallel
it is user's responsibility.
- Note in the end of the script how cluster is started, processes divided to workers with
-
Submit the parallel job to Puhti
cd ../02_parallel_future
sbatch parallel_batch_job_future_cluster.sh
- Check with
seff
andsacct
how much time and resources you used?
Array jobs are an easy way of taking advantage of Puhti's parallel processing capabilities. Array jobs are useful when same code is executed many times for different datasets or with different parameters. In GIS context, a typical use case would be to run some model on a study area split into multiple files, where the output from one file doesn't have an impact on the result of another area.
In the array job example the idea is that the R script will run one process for every given input file as opposed to running a for loop within the script. That means that the R script has to read the file to be processed from commandline argument.
-
05_array/array_batch_job.sh array job batch file. Changes compared to simple serial job:
- Fix the project name in the beginning of the file to match your CSC project name.
--array
parameter is used to tell how many jobs to start. Value 1-3 in this case means that$SLURM_ARRAY_TASK_ID
variable will be from 1 to 3.sed
is used to read the lines frommapsheets.txt
file and make the lines available as bash script variables.- The R script is started with file name as argument.
- Output from each job is written to slurm-<job_id><array_id>.out and slurm-<array_id>%a.err files.
- Memory and time allocations are per job.
- The image name is provided as an argument in the batch job script to the R script.
-
- R script reads the input DEM file from the argument, which is set inside the batch job file.
- For loop has been removed, each job calculates only one file.
-
Submit the array job
cd ../05_array
sbatch array_batch_job.sh
- Check with
seff
andsacct
how much time and resources you used?