torch checkpoint creation should use storage class methods (#126)

Checkpoint folders and files are to be created relative to the storage_root of the storage class used for the run. Code in torch_framework.py for checkpoint_folder and checkpoint file creation is hard-coded using posix calls which may not be appropriate for future storage classes, and the checkpoint_folder is not relative to the storage_root. The checkpoint() method has been changed to use storage class methods.
argonne-lcf · Dec 8, 2023 · 2e324cf · 2e324cf
1 parent 80ea7d6
commit 2e324cf
Showing 1 changed file with 4 additions and 5 deletions.
diff --git a/dlio_benchmark/framework/torch_framework.py b/dlio_benchmark/framework/torch_framework.py
@@ -101,15 +101,14 @@ def checkpoint(self, epoch, step_number):
             """
             Performs Checkpointing for a specific step number. It writes different file of different sizes.
             """
-            if not os.path.exists(self.checkpoint_folder):
-                os.makedirs(self.checkpoint_folder)
             my_rank = self.rank()
+            if not self.storage.get_node(self.checkpoint_folder):
+                self.storage.create_node(self.checkpoint_folder)
+
             model_file = os.path.join(self.checkpoint_folder, f"model-{epoch}-{step_number}.bin")
 
-            f = open(model_file, "w")
             string_val = "x" * self.args.model_size
-            f.write(string_val)
-            f.close()
+            self.storage.put_data(model_file, string_val)
 
     @dlp.log
     def compute(self, x, epoch_number, step, computation_time):