Skip to content

Latest commit

 

History

History

puhti

R Puhti examples, calculating contours

Here are examples for running R code on CSC's Puhti supercluster as four different job styles: interactive, simple serial, array and parellel. For parallel there are 3 options with different R libraries: snow, parallel and future. The interactive style is best for developing your scripts, usually with limited test data. For computationally more demanding analysis you have to use Puhti's batch system for requesting the resources and running your script.

The contours are calculated based on NLS 10m DEM data in geotiff format with terra package. The results are saved in GeoPackge format.

If an R script is ready on laptop, then for running it in Puhti normally you need to edit only the paths to files and folders. Sometimes it might be necessary to install new R libraries.

Additional info: Puhti batch job system documentation

Files in this example:

  • mapsheets.txt file - list of mapsheets to process. Open the file. How many mapsheets (=files) there is?
  • The input NLS 10m DEM is already available in Puhti's GIS data folder : /appl/data/geo/mml/dem10m/2019/ If you want to preview the files with QGIS, open one or a few DEM files from mapsheets.txt file. For seeing all of Finland you can open /appl/data/geo/mml/dem10m/dem10m_hierarchical.vrt.
  • In each of the subfolders there are files for one job type or parallel library. Each subfolder has 2 files:
    • An .R file for defining the tasks to be done.
    • A batch job .sh file for submitting the job to Puhti SLURM.

Important

In these scripts project_20xxxxx has been used as an example project name. Change the project name to your own CSC project name. cscusername is example username, replace with your username.

Interactive working

  • Open Puhti web interface and log in with your CSC user account.

  • Start interactive session and start RStudio, from the front page or Apps -> RStudio

    • (Reservation: geocomputing_thu, only during course)
    • Project: project_20xxxxx
    • Partition: interactive (small during course)
    • Partition: interactive
    • CPU cores: 1
    • Memory: 4
    • Local disk: 2
    • Time: 2:00:00
    • R version: r-env/4.4.0
  • Get exercise materials. Clone geocomputing Github repository. In RStudio: File -> New project -> Version control -> Git

    • Repository URL: https://github.com/csc-training/geocomputing.git
    • Project directory name: geocomputing
    • Create project as subdirectory of -> Browse -> ... (in upper right corner) -> Path to folder: /scratch/project_20xxxxx/students/cscusername
      • If you do not yet have such directory, move to /scratch/project_20xxxxx or /scratch/project_20xxxxx/students/ as path to folder and create a new directory and enter it)
  • In the files window in lower right, move to folder R/puhti/01_serial.

  • Set the working directory. Session -> Set working directory -> To Files Pane location

  • Open 01_serial/Contours_simple.R. This is a basic R script, which uses a for loop for going through all 3 files.

  • Check that needed R libraries are available in Puhti. Which libraries are used in this script? Run the libraries loading part in RStudio.

  • Run the rest of the commands from RStudio.

  • Check that there are 3 Geopackage files with contours in your working directory in RStudio.

  • Optional, check your results with QGIS

Simple batch job

For simple 1 core batch job, use the same R script as for interactive working.

  • 01_serial/serial_batch_job.sh.

    • Fix the project name in the beginning of the file to match your CSC project name.
    • Where are output and error messages written?
    • How many cores are reserved, and for how long a time?
      • How much memory is reserved?
      • Which partition is used?
      • Which module is used?
  • Open another web tab and open Puhti web interface. Open Puhti shell (Tools -> Login node shell) and submit batch job. (Use Shift-Insert or Ctrl+V for paste.)

cd /scratch/project_20xxxxx/students/cscusername/geocomputing/R/puhti/01_serial
sbatch serial_batch_job.sh
  • sbatch prints out a job id, use it to check the state and the efficiency of the batch job. Did you reserve a good amount of memory?
seff <job_id>
  • See output of slurm-<job_id>.out and slurm-<job_id>.err for any possible errors and other outputs.
    • For seeing the files use RStudio or Linux less <filename>
    • With tail -f <filename> it is possible to see also how the output files are written during the job.
  • Check that you have 3 new GeoPackge files in the working folder.
  • Check the resources used in another way.
sacct -j <job_id> -o elapsed,TotalCPU,reqmem,maxrss,AllocCPUS
- elapsed – time used by the job
- TotalCPU – time used by all cores together
- reqmem – amount of requested memory
- maxrss – maximum resident set size of all tasks in job
- AllocCPUS – how many CPUs were reserved

Parallel job

In this case the R code takes care of dividing the work to parallel processes, one for each input file. R has several packages for code parallelization, here examples for snow, parallel and future are provided. future package is likely the easiest to use. future has also two internal options multicore and cluster. parallel and future with multicore can be used in one node, so max 40 cores. snow and future with cluster can be used on several nodes.

  • 02_parallel_future/parallel_batch_job_future_cluster.sh batch job file for future with cluster.

    • Fix the project name in the beginning of the file to match your CSC project name.
    • --ntasks=4 reserves 4 cores: snow and future with cluster option require one additional process for master process, so that if there are 3 mapsheets to process 4 cores have to be reserved
    • --mem-per-cpu=1000 reserves memory per core
    • srun apptainer_wrapper exec RMPISNOW --no-save --slave -f Calc_contours_future_cluster.R starts RMPISNOW which enables using several nodes. RMPISNOW can not be tested from Rstudio.
  • 02_parallel_future/Calc_contours_future_cluster.R

    • Note in the end of the script how cluster is started, processes divided to workers with future-map() and cluster is stopped.
    • For loop has been removed, each worker calculates one file.
    • Optional, compare to 03_parallel_snow/Calc_contours_snow.R. future package takes care of exporting variables and libraries to workers itself, in snow and parallel it is user's responsibility.
  • Submit the parallel job to Puhti

cd ../02_parallel_future
sbatch parallel_batch_job_future_cluster.sh
  • Check with seff and sacct how much time and resources you used?

Array job

Array jobs are an easy way of taking advantage of Puhti's parallel processing capabilities. Array jobs are useful when same code is executed many times for different datasets or with different parameters. In GIS context, a typical use case would be to run some model on a study area split into multiple files, where the output from one file doesn't have an impact on the result of another area.

In the array job example the idea is that the R script will run one process for every given input file as opposed to running a for loop within the script. That means that the R script has to read the file to be processed from commandline argument.

  • 05_array/array_batch_job.sh array job batch file. Changes compared to simple serial job:

    • Fix the project name in the beginning of the file to match your CSC project name.
    • --array parameter is used to tell how many jobs to start. Value 1-3 in this case means that $SLURM_ARRAY_TASK_ID variable will be from 1 to 3.
      • sed is used to read the lines from mapsheets.txt file and make the lines available as bash script variables.
      • The R script is started with file name as argument.
    • Output from each job is written to slurm-<job_id><array_id>.out and slurm-<array_id>%a.err files.
    • Memory and time allocations are per job.
    • The image name is provided as an argument in the batch job script to the R script.
  • 05_array/Contours_array.R.

    • R script reads the input DEM file from the argument, which is set inside the batch job file.
    • For loop has been removed, each job calculates only one file.
  • Submit the array job

cd ../05_array
sbatch array_batch_job.sh
  • Check with seff and sacct how much time and resources you used?