Skip to content

Commit

Permalink
Update GPU process cleanup logic in SLURM epilog script
Browse files Browse the repository at this point in the history
Remove redundant 'tail' command in GPU process cleanup checks to ensure more accurate detection and termination of residual GPU processes. This change optimizes the script by directly filtering out comments and unnecessary lines from nvidia-smi output and not depend on how many comment lines nvidia-smi output may have
  • Loading branch information
ilya-da committed Sep 14, 2024
1 parent 6186cf1 commit 5efe4a5
Showing 1 changed file with 3 additions and 3 deletions.
6 changes: 3 additions & 3 deletions roles/slurm/templates/etc/slurm/epilog.d/50-exclusive-gpu
Original file line number Diff line number Diff line change
Expand Up @@ -4,16 +4,16 @@ set -ex
command -v nvidia-smi || exit 0

# Clean up processes still running. If processes don't exit node is drained.
if nvidia-smi pmon -c 1 | tail -n+3 | awk '{print $2}' | grep -v - > /dev/null
if nvidia-smi pmon -c 1 | grep -v \# | awk '{print $2}' | grep -v - > /dev/null
then
for i in $(nvidia-smi pmon -c 1 | tail -n+3 | awk '{print $2}' | grep -v -)
for i in $(nvidia-smi pmon -c 1 | grep -v \# | awk '{print $2}' | grep -v -)
do
logger -s -t slurm-epilog "Killing residual GPU process $i ..."
kill -9 "$i"
done
fi
sleep 5
if nvidia-smi pmon -c 1 | tail -n+3 | awk '{print $2}' | grep -v - > /dev/null
if nvidia-smi pmon -c 1 | grep -v \# | awk '{print $2}' | grep -v - > /dev/null
then
logger -s -t slurm-epilog 'Failed to kill residual GPU processes. Draining node ...'
scontrol update nodename="$HOSTNAME" state=drain reason='Residual GPU processes found'
Expand Down

0 comments on commit 5efe4a5

Please sign in to comment.