GDAL reprents here a commandline tool that is used via Linux bash scripts. The example includes:
- Using GDAL tools interactively.
- Using GDAL tools via batch jobs:
- A basic serial batch job, where several files are handled in a bash script for loop, one after the other. Only 1 core is used.
- Parallel batch job, where different files are handled in parallel with GNU-parallel. Up to one node can be used, in Puhti that is up to 40 cores.
GDAL includes many other useful commandline tools, which usually are very efficient. In this example, we will reproject the coordinate system of multiple files in a folder and save the file in Cloud-optimized format. Linux bash script is used for starting the GDAL commands.
Important
In these scripts project_20xxxxx
has been used as example project name. Change the project name to your own CSC project name.
cscusername
is example username, replace with your username.
- Make a folder for the exercise materials and make it your working directory
- Change the project name and username.
mkdir -p /scratch/project_20xxxxx/students/cscusername
cd /scratch/project_20xxxxx/students/cscusername
- Copy the example scripts to Puhti.
git clone https://github.com/csc-training/geocomputing.git
- Move to the GDAL exercise folder.
cd geocomputing/gdal
With gdalinfo
and ogrinfo
it is often helpful to check the files. This is a light-weight task, so it can be done from the login node without an interactive session.
- Open Puhti web interface and log in with CSC user account.
- Open login node shell:
Tools -> Login node shell
- To have GDAL tools available, load geoconda module. Also several other modules include GDAL tools, see CSC Docs: GDAL page for details.
module load geoconda
- Check a file with
gdalinfo
. What is the coordinate system? Is the file internally tiled? Does it have overviews?
gdalinfo /appl/data/geo/mml/dem10m/2019/W3/W33/W3333.tif
Important
If you want to run more computationally heavy GDAL commands, then use interactive session on a compute node, easiest with a Compute node shell in the web interface.
We will use Puhti web interface simple file editor for editing the files in this exercise.
- Open another tab in your web browser to Puhti web interface.
- Open Files ->
/scratch/project_20xxxx
- Open folders:
students
->cscusername
->geocomputing
->gdal
Open the files with Edit under the menu on the right of the file name.
-
gdal_serial.sh - the bash script, includes GDAL commands to be executed. A for loop is used for handling several files.
- Change the project name and username to yours in the GDAL command. Save.
-
gdal_batch_job_serial.sh - the batch job script. Where are the output and error messages written? How many cores and for how long time are they reserved? How much memory? Which partition is used? Which modules are used?
- Change the project name in
#SBATCH --account
setting. Save.
- Change the project name in
-
In the Login shell window, run the script as batch file:
sbatch gdal_batch_job_serial.sh
-
See output of slurm-<job_id>.out and slurm-<job_id>.err for any possible errors and other outputs.
- For seeing the files use Puhti web interface or Linux
less <filename>
- With
tail -f slurm-<job_id>.out
it is possible to see also how the output files are written during the job.
- For seeing the files use Puhti web interface or Linux
-
Check that you have new GeoTiff files in the working folder. Check the result file with
gdalinfo
. What is the coordinate system? Are the files tiled? Do they have overviews? -
sbatch
prints out a job id, use it to check the state and efficiency of the batch job. Did you reserve a good amount of memory? How long did the script run?
seff <job_id>
-
Once the job is finished, see the output in slurm-<job_id>.out and slurm-<job_id>.err for any possible errors and other outputs.
-
Check the resources used in another way.
sacct -j <job_id> -o elapsed,TotalCPU,reqmem,maxrss,AllocCPUS
- elapsed – time used by the job
- TotalCPU – time used by all cores together
- reqmem – amount of requested memory
- maxrss – maximum resident set size of all tasks in the job.
- AllocCPUS – how many CPUs were reserved
Important
Most GDAL tools support only usage of singe core, therefore in this example only 1 core is reserved and used. gdalwarp can benefit from multiple cores. If you are using gdalwarp
, give the job more cores in the batch job file and add the -multi setting to the command in the bash script.
GNU parallel is used for handling several files in parallel. In this way, max one node (= 40 cores in Puhti) can be used. If even more is needed, see "Workflow for many small, independent runs" tutorial how to combine this with array jobs.
Open the files with Edit:
-
gdal_parallel.sh - the bash script, it includes GDAL commands to be executed for one file. The for loop is removed.
- No edits are needed, for only viewing the file, click the file name on the Files page.
-
gdal_batch_job_parallel.sh - the batch job script. How many cores are reserved? How much memory? Which modules are used?
- Change the project name in
#SBATCH --account
setting. Save.
- Change the project name in
-
Run the script as batch file:
sbatch gdal_batch_job_parallel.sh
- Check the parallel batch job results with
seff
. Did you reserve a good amount of memory? What was the CPU-efficiency? How long did the script run?