Fix tp mem cache #203

AleHD · 2024-06-28T13:27:26Z

Nanotron seems to consume disproportionately more memory on its activations compared to megatron. This is due to at least the following factors:

The glu activation, which is not fused, allocate two tensors: the activation and the element-wise multiplication. Fusing this operation provides (relatively small) memory gains.
During the differentiable operations (specifically, the DifferentiableAllGather and DifferentiableReduceScatterSum) a new tensor is allocated via torch.empty. This tensor is cached through the entire forward pass until the backward pass. However, this cache is unnecessary as these tensors are not used at all during the backward. Getting rid of these allocations provide significant memory gains. To fix this, this PR introduces a global memory buffer (MemoryBuffer singleton) that recycles the allocated spaces, similar to megatron.

Attached: Memory traces of the default nanotron implementation (which OOMs), the current PR implementation and megatron. The memory traces represent the first rank of a tp8 pp4 dp1 llama70b 5 iteration run (sequence length 8k, microbatch size of 1, accumulation=4, synchronous tp and reduc_scatter mode).

I think these changes are important, as it allows training larger models with significantly less memory requirements.

Let me know if you have any suggestions, and I'd be happy to make adjustments to upstream this feature! :)

…rentiable distributed operations

3outeille · 2024-06-28T13:53:29Z

src/nanotron/parallel/tensor_parallel/distributed_differentiable_primitives.py

-            dtype=tensor.dtype,
-            requires_grad=tensor.requires_grad,
-        )
+        unsharded_tensor = MemoryBuffer().get("dist", (unsharded_batch_size, *rest_size), dtype=tensor.dtype)


pass device=tensor.device and requires_grad=tensor.requires_grad as well ?

3outeille · 2024-06-28T13:53:39Z

src/nanotron/parallel/tensor_parallel/distributed_differentiable_primitives.py

-            dtype=tensor.dtype,
-            requires_grad=tensor.requires_grad,
+        sharded_tensor = MemoryBuffer().get(
+            "dist", (unsharded_batch_size // group.size(), *rest_size), dtype=tensor.dtype


same as above

3outeille · 2024-06-28T14:08:13Z

nice catch ! lgtm

AleHD · 2024-06-28T14:31:58Z

Thanks for the comments! I have one question regarding the requires_grad. In principle we shouldn't require gradient on the gathered tensors, right? The custom backward handles the gradient computation for the parameters anyway. At least training runs seamlessly without setting gradient to those tensors.

3outeille · 2024-07-01T12:07:20Z

Yeah I think we can set it to False here (seems like Megatron does the same as well here)

zzhhjjj · 2024-07-02T11:26:29Z

Hi,

Thanks for the PR, it's really nice! I tested your PR by training 100 steps on the Tiny Story dataset and compared the loss with our code. I found an abnormal difference. Could you observe the same thing on your side? This is my config file, you may have to change it a little bit, but the idea is to compare the loss before and after the change with the same hyperparameters. Thanks a lot for the work.

checkpoints:
  checkpoint_interval: 1000
  checkpoints_path: null
  checkpoints_path_is_shared_file_system: false
  resume_checkpoint_path: null
  save_initial_state: false

data_stages:
- data:
    dataset:
      dataset_overwrite_cache: false
      dataset_processing_num_proc_per_process: 1
      hf_dataset_config_name: null
      hf_dataset_or_datasets: roneneldan/TinyStories
      hf_dataset_splits: train
      text_column_name: text
    num_loading_workers: 0
    seed: 42
  name: Stable Training Stage
  start_training_step: 1

general:
  benchmark_csv_path: null
  consumed_train_samples: null
  ignore_sanity_checks: true
  project: ring-attention
  run: Llama3-128k-lr-1e-5-5000
  seed: 42
  step: null

lighteval: null
logging:
  iteration_step_info_interval: 1
  log_level: info
  log_level_replica: info

model:
  ddp_bucket_cap_mb: 25
  dtype: bfloat16
  init_method:
    std: 0.025
  make_vocab_size_divisible_by: 1
  model_config:
    bos_token_id: 1
    eos_token_id: 2
    hidden_act: silu
    hidden_size: 768
    initializer_range: 0.02
    intermediate_size: 3072
    is_llama_config: true
    max_position_embeddings: 512
    num_attention_heads: 16
    num_hidden_layers: 12
    num_key_value_heads: 16
    pad_token_id: null
    pretraining_tp: 1
    rms_norm_eps: 1.0e-05
    rope_theta: 1000000.0
    rope_interleaved: false
    rope_scaling: null
    tie_word_embeddings: true
    use_cache: true
    vocab_size: 50272

optimizer:
  accumulate_grad_in_fp32: true
  clip_grad: 1.0
  learning_rate_scheduler:
    learning_rate: 0.0003
    lr_decay_starting_step: null
    lr_decay_steps: 198
    lr_decay_style: cosine
    lr_warmup_steps: 2
    lr_warmup_style: linear
    min_decay_lr: 1.0e-05
  optimizer_factory:
    adam_beta1: 0.9
    adam_beta2: 0.95
    adam_eps: 1.0e-08
    name: adamW
    torch_adam_is_fused: true
  weight_decay: 0.01
  zero_stage: 0

tokenizer:
  tokenizer_max_length: null
  tokenizer_name_or_path: gpt2 
  tokenizer_revision: null

parallelism:
  dp: 1
  expert_parallel_size: 1
  pp: 1
  pp_engine: 1f1b
  tp: 4
  tp_linear_async_communication: false
  tp_mode: REDUCE_SCATTER
profiler: null

tokens:
  batch_accumulation_per_replica: 2
  limit_test_batches: 0
  limit_val_batches: 0
  micro_batch_size: 64
  sequence_length: 512
  train_steps: 100
  val_check_interval: -1

s3_upload: null
experiment_logger: null

AleHD · 2024-07-08T08:01:33Z

That's interesting. Thanks for letting me know, I will investigate further and come back with some results soon :)

AleHD · 2024-07-17T13:45:52Z

I was able to reproduce the error. To fix the issue, I followed megatron's design and fused the all gather and linear operation in a single module. The loss progression now matches with the main branch.

I added two configurations of the optimization. Given that the all-gather and linear are in the same module, we can control whether to recompute or not (and cache it instead) the all-gather during the backward. As expected, recomputing yields larger memory savings at the cost of throughput (but still, both methods are more memory-efficient than the current implementation, and both provide at least comparable tok/sec than the current main). The configuration is parallelism.tp_recompute_allgather. I chose to set it to true as default because it provides the best trade-off, but it is probably advised to set it to true if memory is not a concern.

I attach wandb logs of four runs on two different configurations that validate the claims. On both, blue is the baseline (main branch) implementation, red is the wrong first version of this PR, green is the no-recompute mode (moderate memory savings and slightly faster than baseline) and purple is the recompute mode (large memory savings and on average as fast as the baseline). The first plots correspond to the tiny llama configuration you shared before. The second plot corresponds to a llama8b run. Except for the wrong plot, all lines are pretty much identical in the lm_loss graph.

Let me know if you have any suggestions.

src/nanotron/models/llama.py

src/nanotron/config/config.py

src/nanotron/parallel/tensor_parallel/functional.py

3outeille · 2024-08-02T08:48:54Z

src/nanotron/parallel/tensor_parallel/functional.py

+            sub_grad_input = torch.empty(
+                input_size, dtype=total_input.dtype, device=total_input.device, requires_grad=False
+            )
+            dist.reduce_scatter_tensor(sub_grad_input, grad_input, group=group, op=dist.ReduceOp.SUM)


Seems like dist.reduce_scatter needs grad_input to be contiguous (cf https://github.com/pytorch/pytorch/blob/2b267fa7f28e18ca6ea1de4201d2541a40411457/torch/distributed/nn/functional.py#L305)

I am not sure if grad_input = grad_output @ weight is contiguous (although you have grad_output = grad_output.contiguous()). Maybe to be sure, we should grad_input = grad_input.contiguous() before running the reduce_scatter ? what do you think ?

3outeille · 2024-08-02T08:53:29Z

I chose to set it to true as default because it provides the best trade-off, but it is probably advised to set it to true if memory is not a concern.

Hello, very nice PR ! I left some comments as well before merging

I was able to reproduce the results but found out that the version with recompute_all_gather is faster than the baseline (TP of main branch). So I am okay with setting it to True by default as well.

AleHD · 2024-08-02T13:56:49Z

Updated the PR with the suggestions mentioned! Let me know if I'm missing something.

3outeille · 2024-08-02T19:30:55Z

all points were addressed, LGTM !

AleHD added 3 commits June 27, 2024 11:56

Implemented global memory buffer to reduce activation memory of diffe…

bcf405d

…rentiable distributed operations

GLU fusion

ed1ca7d

precommit

9b0de5b

3outeille self-assigned this Jun 28, 2024

3outeille requested a review from xrsrke June 28, 2024 13:29

3outeille reviewed Jun 28, 2024

View reviewed changes

C-TC mentioned this pull request Jul 8, 2024

Add layer-wise activation recomputation to llama model #207

Merged

Merge branch 'main' into fix_tp_mem_cache

bbc259f

3outeille closed this in #207 Jul 14, 2024

3outeille reopened this Jul 15, 2024

AleHD added 7 commits July 16, 2024 11:39

Wrong backward fixed

803b6da

Removed useless prints

59bfb6b

Minor fixes

2c69e9a

precommit

30439fd

Added tp_recompute_allgather option

1e02a9c

Changed recompute default

9cc81bb

Changed recompute default

956fbfd

AleHD mentioned this pull request Jul 17, 2024

Fixed memory issues during forward swiss-ai/nanotron#5

Merged

Moved ColumnLinearNoAsync module for consistency

b9e9201

AleHD mentioned this pull request Jul 18, 2024

Memory optimization in async tp-linear #208

Merged

3outeille reviewed Jul 22, 2024

View reviewed changes

src/nanotron/models/llama.py Outdated Show resolved Hide resolved

AleHD added 2 commits July 23, 2024 09:07

Merge branch 'main' into fix_tp_mem_cache

49633df

Fixed List not found

2afd007

Fixed tp=1 case

7e758db

AleHD marked this pull request as draft July 30, 2024 08:41

AleHD added 3 commits July 30, 2024 18:50

Merge branch 'main' into fix_tp_mem_cache

ce2a96b

Fixed column parallel

cd84d4f

Added tp_recompute_allgather test

d3db06a

AleHD marked this pull request as ready for review July 30, 2024 17:44

xrsrke reviewed Aug 1, 2024

View reviewed changes

src/nanotron/config/config.py Show resolved Hide resolved

3outeille reviewed Aug 2, 2024

View reviewed changes

src/nanotron/parallel/tensor_parallel/functional.py Outdated Show resolved Hide resolved

3outeille reviewed Aug 2, 2024

View reviewed changes

src/nanotron/parallel/tensor_parallel/functional.py Show resolved Hide resolved

3outeille reviewed Aug 2, 2024

View reviewed changes

AleHD added 2 commits August 2, 2024 15:40

Minor restyling

7daa186

Fixed names

31c3c5a

3outeille merged commit 4eb520f into huggingface:main Aug 2, 2024
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix tp mem cache #203

Fix tp mem cache #203

AleHD commented Jun 28, 2024

3outeille Jun 28, 2024

3outeille Jun 28, 2024

3outeille commented Jun 28, 2024

AleHD commented Jun 28, 2024

3outeille commented Jul 1, 2024

zzhhjjj commented Jul 2, 2024

AleHD commented Jul 8, 2024

AleHD commented Jul 17, 2024 •

edited

Loading

3outeille Aug 2, 2024

3outeille commented Aug 2, 2024

AleHD commented Aug 2, 2024

3outeille commented Aug 2, 2024

Fix tp mem cache #203

Fix tp mem cache #203

Conversation

AleHD commented Jun 28, 2024

3outeille Jun 28, 2024

Choose a reason for hiding this comment

3outeille Jun 28, 2024

Choose a reason for hiding this comment

3outeille commented Jun 28, 2024

AleHD commented Jun 28, 2024

3outeille commented Jul 1, 2024

zzhhjjj commented Jul 2, 2024

AleHD commented Jul 8, 2024

AleHD commented Jul 17, 2024 • edited Loading

3outeille Aug 2, 2024

Choose a reason for hiding this comment

3outeille commented Aug 2, 2024

AleHD commented Aug 2, 2024

3outeille commented Aug 2, 2024

AleHD commented Jul 17, 2024 •

edited

Loading