Skip to content

Commit

Permalink
improving the doc
Browse files Browse the repository at this point in the history
  • Loading branch information
zhenghh04 committed Feb 7, 2025
1 parent 49d3f47 commit 843cabb
Show file tree
Hide file tree
Showing 3 changed files with 44 additions and 30 deletions.
4 changes: 1 addition & 3 deletions dlio_benchmark/utils/statscounter.py
Original file line number Diff line number Diff line change
Expand Up @@ -389,6 +389,4 @@ def save_data(self):
json.dump(self.output, outfile, indent=4)
outfile.flush()
if self.my_rank == 0:
logging.info(f"{utcnow()} outputs saved in RANKID_output.json")


logging.info(f"{utcnow()} outputs saved in RANKID_output.json")
66 changes: 39 additions & 27 deletions docs/source/config.rst
Original file line number Diff line number Diff line change
Expand Up @@ -60,42 +60,50 @@ One can specify the name of the model as
.. list-table::
:widths: 15 10 30
:header-rows: 1

* - name
- default
- The name of the model
* - type
- default
- A string that specifies the type of the model, such as transformer, CNN, etc.
* - model_size
- 10240
- the size of the model parameters per GPU in bytes
* - optimization_groups
- []
- List of optimization group tensors. Use Array notation for yaml.
* - num_layers
- 1
- -1
- Number of layers to checkpoint. Each layer would be checkpointed separately.
* - layer_parameters
- []
- List of parameters per layer. This is used to perform I/O per layer.

In the model session, one can define ``parallelism``, which have three variables, tensor, pipeline, and zero_stage.
by default, zero_stage=-1 in which no sharding at all. If zero_stage = 3, all the model and optimizer states will be sharded accross
the data parallel group.
- List of parameters per layer. This is used to perform I/O per layer.
* - parallelism
- {tensor: 1, pipeline: 1, zero_stage: -1}
- Parallelism configuration for the model.
* - transformer
- {hidden_size: 2048, ffn_hidden_size: 8196, vocab_size: 32000}
- Transformer layer configuration for the model.

The model information is used to determine the checkpoint files.
The user can specify the model architecture using either optimizaton_groups & layer_parameters, or by specifying the transformer configuration.

The ``optimization_groups`` is a list of tensors that are grouped together for optimization. Suppose optimization_groups is specified as [1024, 528],
each rank will write the following tensors to the checkpoint file: {"0": {"a": array of 1024, "b": array of 1024}, "1": {"a": array of 528, "b": array of 528}}. The total size of the tensor will be 1024*2 + 528*2. The ``layer_parameters`` is a list of parameters per layer. The ``num_layers`` is used to specify the number of layers to checkpoint. Each layer would be checkpointed separately.
Suppose layer_parameters is [1024, 2048], each rank in the tensor parallelism group will write the following tensors to the checkpoint file:
{'0': array of 1024/TP, "1": array of (2048/TP)}. Please notice the difference in how the optimization groups and layer parameters are treated internally.

We do not suggest the users to specify the model architeure in this way. Instead, we suggest the users to specify the transformer configuration directly which is more intuitive.
The ``transformer`` configuration is used to specify the hidden size, FFN hidden size, and vocab size for the transformer layer, which together determined the
optimization_groups and layer_parameters.

.. list-table::
:widths: 15 10 30
:header-rows: 1
.. attention::

* - tensor
- 1
- Tensor parallelism for model. Used to determine the number of layer model files.
* - pipeline
- 1
- Pipeline parallelism for model.
* - zero_stage
- -1
- Zero stage [-1|1|2|3]. default: -1
Please note that if optimization_groups and layer_parameters are specified, the transformer configuration will be ignored. But we
always suggest to specify the transformer configuration for better readability.

For transformer architecture, one can define ``transformer`` under ``model``
In which three paramters
Please also note that ZeRO stage 3 is not compatiable with ``parallelism.pipeline == 3``.

.. list-table::
:widths: 15 10 30
Expand All @@ -110,13 +118,8 @@ In which three paramters
* - vocab_size
- 32000
- vocab size for the embedding layer

.. code-block:: yaml
model: unet3d
No other parameters under this section.


In future, we would support more non-transformer type of layers.

framework
-------------------
Expand Down Expand Up @@ -372,6 +375,15 @@ checkpoint
* - steps_between_checkpoints
- -1
- performing one checkpointing per certain number of steps specified
* - fsync
- False
- whether to perform fsync after writing the checkpoint
* - model_datatype
- fp16
- the datatype of the model parameters. Available options are fp16, fp32, int8, uint8, bf16.
* - optimizer_datatype
- fp32
- the datatype of the optimizer parameters. Available options are fp16, fp32, int8, uint8, bf16.

.. note::

Expand Down
4 changes: 4 additions & 0 deletions docs/source/examples.rst
Original file line number Diff line number Diff line change
Expand Up @@ -323,6 +323,10 @@ different ZeRO stages for the model and optimizer.
* For Stage 3, both the model and optimizer are sharded across all the data parallel instances.
* For Stage 1 and 2 the optimizer is sharded across all the data parallel instances, but the model is sharded only across the first data parallel instance.
* Pipeline parallelism and ZeRO 3 are not compatiable to each other.

One can also specify the datatype for the model and optimizer to be saved. By default, the model is saved in fp16 and the optimizer in fp32.

The output log will contain the checkpoint duration and throughput. In the final summary.json, `checkpoint_duration` and `checkpoint_io` will be reported.

.. code-block:: yaml
Expand Down

0 comments on commit 843cabb

Please sign in to comment.