improving the doc

argonne-lcf · Feb 7, 2025 · 843cabb · 843cabb
1 parent 49d3f47
commit 843cabb
Show file tree

Hide file tree

Showing 3 changed files with 44 additions and 30 deletions.
diff --git a/dlio_benchmark/utils/statscounter.py b/dlio_benchmark/utils/statscounter.py
@@ -389,6 +389,4 @@ def save_data(self):
             json.dump(self.output, outfile, indent=4)
             outfile.flush()
         if self.my_rank == 0:
-            logging.info(f"{utcnow()} outputs saved in RANKID_output.json")
-
-
+            logging.info(f"{utcnow()} outputs saved in RANKID_output.json")
diff --git a/docs/source/config.rst b/docs/source/config.rst
@@ -60,42 +60,50 @@ One can specify the name of the model as
 .. list-table:: 
    :widths: 15 10 30
    :header-rows: 1
+
    * - name 
      - default
      - The name of the model
+   * - type
+     - default
+     - A string that specifies the type of the model, such as transformer, CNN, etc.
    * - model_size
      - 10240
      - the size of the model parameters per GPU in bytes
    * - optimization_groups
      - []
      - List of optimization group tensors. Use Array notation for yaml.
    * - num_layers
-     - 1
+     - -1
      - Number of layers to checkpoint. Each layer would be checkpointed separately.
    * - layer_parameters
      - []
-     - List of parameters per layer. This is used to perform I/O per layer.
-
-In the model session, one can define ``parallelism``, which have three variables, tensor, pipeline, and zero_stage. 
-by default, zero_stage=-1 in which no sharding at all. If zero_stage = 3, all the model and optimizer states will be sharded accross
-the data parallel group. 
+     - List of parameters per layer. This is used to perform I/O per layer. 
+   * - parallelism
+     - {tensor: 1, pipeline: 1, zero_stage: -1}
+     - Parallelism configuration for the model. 
+   * - transformer
+     - {hidden_size: 2048, ffn_hidden_size: 8196, vocab_size: 32000}
+     - Transformer layer configuration for the model.
+
+The model information is used to determine the checkpoint files. 
+The user can specify the model architecture using either optimizaton_groups & layer_parameters, or by specifying the transformer configuration. 
+
+The ``optimization_groups`` is a list of tensors that are grouped together for optimization. Suppose optimization_groups is specified as [1024, 528], 
+each rank will write the following tensors to the checkpoint file: {"0": {"a": array of 1024, "b": array of 1024}, "1": {"a": array of 528, "b": array of 528}}. The total size of the tensor will be 1024*2 + 528*2. The ``layer_parameters`` is a list of parameters per layer. The ``num_layers`` is used to specify the number of layers to checkpoint. Each layer would be checkpointed separately. 
+Suppose layer_parameters is [1024, 2048], each rank in the tensor parallelism group will write the following tensors to the checkpoint file: 
+{'0': array of 1024/TP, "1": array of (2048/TP)}. Please notice the difference in how the optimization groups and layer parameters are treated internally.
+
+We do not suggest the users to specify the model architeure in this way. Instead, we suggest the users to specify the transformer configuration directly which is more intuitive. 
+The ``transformer`` configuration is used to specify the hidden size, FFN hidden size, and vocab size for the transformer layer, which together determined the 
+optimization_groups and layer_parameters. 
 
-.. list-table:: 
-   :widths: 15 10 30
-   :header-rows: 1
+.. attention::
 
-   * - tensor
-     - 1
-     - Tensor parallelism for model. Used to determine the number of layer model files.
-   * - pipeline
-     - 1
-     - Pipeline parallelism for model.
-   * - zero_stage
-     - -1
-     - Zero stage [-1|1|2|3]. default: -1
+  Please note that if optimization_groups and layer_parameters are specified, the transformer configuration will be ignored. But we 
+  always suggest to specify the transformer configuration for better readability.
 
-For transformer architecture, one can define ``transformer`` under ``model``
-In which three paramters 
+  Please also note that ZeRO stage 3 is not compatiable with ``parallelism.pipeline == 3``.  
 
 .. list-table:: 
    :widths: 15 10 30
@@ -110,13 +118,8 @@ In which three paramters
    * - vocab_size
      - 32000
      - vocab size for the embedding layer
-
-.. code-block:: yaml
-
-  model: unet3d
-
-No other parameters under this section. 
-
+
+In future, we would support more non-transformer type of layers. 
 
 framework
 -------------------
@@ -372,6 +375,15 @@ checkpoint
    * - steps_between_checkpoints
      - -1
      - performing one checkpointing per certain number of steps specified
+   * - fsync
+     - False
+     - whether to perform fsync after writing the checkpoint
+   * - model_datatype
+     - fp16
+     - the datatype of the model parameters. Available options are fp16, fp32, int8, uint8, bf16. 
+   * - optimizer_datatype
+     - fp32
+     - the datatype of the optimizer parameters. Available options are fp16, fp32, int8, uint8, bf16. 
 
 .. note::
 

diff --git a/docs/source/examples.rst b/docs/source/examples.rst
@@ -323,6 +323,10 @@ different ZeRO stages for the model and optimizer.
 * For Stage 3, both the model and optimizer are sharded across all the data parallel instances. 
 * For Stage 1 and 2 the optimizer is sharded across all the data parallel instances, but the model is sharded only across the first data parallel instance. 
 * Pipeline parallelism and ZeRO 3 are not compatiable to each other. 
+
+One can also specify the datatype for the model and optimizer to be saved. By default, the model is saved in fp16 and the optimizer in fp32.
+
+The output log will contain the checkpoint duration and throughput. In the final summary.json, `checkpoint_duration` and `checkpoint_io` will be reported.
 
 .. code-block:: yaml