You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm not sure if it is an issue or a feature, but I've noticed a problem with jobs submitted on a cluster with slurm 17.02.7
A job is actually an R script that launches system commands in parallel (a command is the blastn executable, which is used to find homologies between DNA sequences).
So the R function would look like:
runBlast = function(in, db, out, temp, done) {
system(paste("blastn -in", in, "-db", db, "-out", temp)) #the blast command writes results to filename stored in variable temp
file.rename(temp, done) #result file is renamed upon completion of the blastn command
}
With mcMap(), I launch a dozen of runBlast() call in parallel from R. For each runBlast() call, there is an R process (using 0% CPU) and a child blastn process that does the job.
I have noticed that some results files were incomplete, even though they have been renamed with the file.rename() call and blastn reported no issue.
This seems to happen when slurm reports :
slurmstepd: Exceeded step memory limit at some point
But the job wasn't interrupted and is reported as "completed".
So I think that when memory exceeds the requested amount, under some conditions, slurm kills a blastn process, but not (at least not immediately) the parent R process, which simply goes on to rename the file as the file.rename() call is performed after the system() call.
Anyhow, this is highly problematic for me as there is no easy way to tell on which blastn task this has occurred, as there is no indication of which process was killed. Everything looks like it finished properly.
Note that when slurms kills a whole job (and reports it as "cancelled"), this doesn't seem to happen.
So I guess my question is: does Slurm kills only "parts" of jobs, meaning killing some child process without killing the parent process (immediately) ?
The text was updated successfully, but these errors were encountered:
I'm not sure if it is an issue or a feature, but I've noticed a problem with jobs submitted on a cluster with slurm 17.02.7
A job is actually an R script that launches system commands in parallel (a command is the blastn executable, which is used to find homologies between DNA sequences).
So the R function would look like:
With
mcMap()
, I launch a dozen ofrunBlast()
call in parallel from R. For eachrunBlast()
call, there is an R process (using 0% CPU) and a childblastn
process that does the job.I have noticed that some results files were incomplete, even though they have been renamed with the
file.rename()
call and blastn reported no issue.This seems to happen when slurm reports :
But the job wasn't interrupted and is reported as "completed".
So I think that when memory exceeds the requested amount, under some conditions, slurm kills a blastn process, but not (at least not immediately) the parent R process, which simply goes on to rename the file as the
file.rename()
call is performed after thesystem()
call.Anyhow, this is highly problematic for me as there is no easy way to tell on which blastn task this has occurred, as there is no indication of which process was killed. Everything looks like it finished properly.
Note that when slurms kills a whole job (and reports it as "cancelled"), this doesn't seem to happen.
So I guess my question is: does Slurm kills only "parts" of jobs, meaning killing some child process without killing the parent process (immediately) ?
The text was updated successfully, but these errors were encountered: