Skip to content

Commit

Permalink
[AutoTuner] Update multi-node scene (#136)
Browse files Browse the repository at this point in the history
In multi-node scene, If some nodes error, the torchrun is not quitting
immediately which have to wait until the timeout to be forced to kill.
In this PR, we kill task automatically by querying the task status if
changed from running to transitional.
  • Loading branch information
Caozhou1995 authored Jun 7, 2024
2 parents 9838ede + 095d1e2 commit 39a20b3
Show file tree
Hide file tree
Showing 2 changed files with 8 additions and 2 deletions.
2 changes: 1 addition & 1 deletion flagscale/auto_tuner/record/recorder.py
Original file line number Diff line number Diff line change
Expand Up @@ -309,4 +309,4 @@ def save(self, history):
df = df.reindex(columns=cols)
if "stopped_by_tuner" in df.columns:
df = df.drop(columns=["stopped_by_tuner"])
df.to_csv(self.path, index=False)
df.to_csv(self.path, index=False, escapechar='\\')
8 changes: 7 additions & 1 deletion flagscale/auto_tuner/tuner.py
Original file line number Diff line number Diff line change
Expand Up @@ -242,7 +242,7 @@ def monitor(self):
"""Monitor the task until task timeout or completed."""
# Sleep 3s to ensure the task is started
time.sleep(3)

running = False
while True:
# If the task timeout, stop monitoring
end_time = time.time()
Expand All @@ -262,6 +262,12 @@ def monitor(self):
f"task_{self.cur_strategy['idx']} status: {status.name}")
if status == JobStatus.COMPLETED_OR_IDLE:
break
if status == JobStatus.RUNNING:
running = True
if status == JobStatus.TRANSITIONAL:
if running:
self.runner.stop()
break
except Exception as e:
self.logger.info(e)
time.sleep(self.interval)
Expand Down

0 comments on commit 39a20b3

Please sign in to comment.