Skip to content

Commit

Permalink
Bugfix slurm daemon hang (#99)
Browse files Browse the repository at this point in the history
* Change scr_srun to kill daemon process explicitly

Will not make changes to other resource managers, as we have not seen errors there and I have not evaluated this method for them.
  • Loading branch information
becker33 authored Nov 2, 2017
1 parent c77b9ea commit 8da72a0
Showing 1 changed file with 11 additions and 1 deletion.
12 changes: 11 additions & 1 deletion scripts/TLCC/scr_run.in
Original file line number Diff line number Diff line change
Expand Up @@ -126,8 +126,11 @@ fi

# start background scr_transfer processes (1 per node) if async flush is enabled
if [ "$SCR_FLUSH_ASYNC" == "1" ] ; then
redirect=""
if [ -z "$SCR_DEBUG" ]; then redirect="2> /dev/null"; fi
nnodes=`$bindir/scr_glob_hosts --count --hosts $SCR_NODELIST`
srun -W 0 -n${nnodes} -N${nnodes} $bindir/scr_transfer $cntldir/transfer.scrinfo &
srun -q -Q --disable-status -W 0 -n${nnodes} -N${nnodes} $bindir/scr_transfer $cntldir/transfer.scrinfo $redirect &
daemon_pid=$!
fi

# enter the run loop
Expand Down Expand Up @@ -275,6 +278,13 @@ while [ 1 ] ; do
fi
done

# Stop the transfer daemon
if [ $daemon_pid ]; then
echo "Killing the transfer daemon process"
echo "This may result in an error message from slurmstepd"
kill -s SIGINT $daemon_pid
fi

# stop scr_transfer processes before we attempt to scavenge
if [ "$SCR_FLUSH_ASYNC" == "1" ] ; then
# TODO: this doesn't currently do anything
Expand Down

0 comments on commit 8da72a0

Please sign in to comment.