Skip to content

Commit

Permalink
torch checkpoint creation should use storage class methods (#126)
Browse files Browse the repository at this point in the history
Checkpoint folders and files are to be created relative to the
storage_root of the storage class used for the run.  Code in
torch_framework.py for checkpoint_folder and checkpoint file
creation is hard-coded using posix calls which may not be
appropriate for future storage classes, and the checkpoint_folder
is not relative to the storage_root.  The checkpoint() method
has been changed to use storage class methods.
  • Loading branch information
krehm authored Dec 8, 2023
1 parent 80ea7d6 commit 2e324cf
Showing 1 changed file with 4 additions and 5 deletions.
9 changes: 4 additions & 5 deletions dlio_benchmark/framework/torch_framework.py
Original file line number Diff line number Diff line change
Expand Up @@ -101,15 +101,14 @@ def checkpoint(self, epoch, step_number):
"""
Performs Checkpointing for a specific step number. It writes different file of different sizes.
"""
if not os.path.exists(self.checkpoint_folder):
os.makedirs(self.checkpoint_folder)
my_rank = self.rank()
if not self.storage.get_node(self.checkpoint_folder):
self.storage.create_node(self.checkpoint_folder)

model_file = os.path.join(self.checkpoint_folder, f"model-{epoch}-{step_number}.bin")

f = open(model_file, "w")
string_val = "x" * self.args.model_size
f.write(string_val)
f.close()
self.storage.put_data(model_file, string_val)

@dlp.log
def compute(self, x, epoch_number, step, computation_time):
Expand Down

0 comments on commit 2e324cf

Please sign in to comment.