Replies: 2 comments 3 replies
-
@timpal0l, since you are doing data parallel training, global_batch_size >= gpu_count, since each gpu must have at least 1 sample.
Yes, this approach provides the most flexibility when varying GPU count. |
Beta Was this translation helpful? Give feedback.
-
Ok, but not defining gradient accumulation, wont make it smaller than one, right? How could I otherwise benefit from using more compute, without increasing my batch size? |
Beta Was this translation helpful? Give feedback.
-
For context, I am doing a full finetune of a LLM (meta-llama/Llama-3.1-8B) on a HPC cluster with A100 (40gb gpus) on a rather large corpus of text.
The training setup consists of using the SFTTrainer from huggingface transformers.
And the distributed training is handled with accelerate + deepspeed ZeRO2.
The "issue" I am facing is that when I increase the number of nodes is my SLURM config, the global batch sizes increases, since, it seems to be a function of the number of total gpus.
Currently I have this config;
and a single node has 4x A100 GPUs.
So e.g using 16 nodes I get global batch size = per_device_train_batch_size * gradient_accumulation_steps * nodes * gpu_per_node => 1 * 8 * 16 * 4 => 512.
If I launch the same training with 32 nodes the global batch size becomes 1024, which means I get half the amount of gradient updates, which prevents the model to converge, since the training completes in half the time with the larger batch size (about 1M tokens per batch).
I can drop down the
gradient_accumulation_steps
ofc, but ideally I would like to lock the global batch size in case I launch a training on hundreds of nodes.trainer.yaml:
deepspeed_zero2.json:
Relevant parts of the slurm config;
Is it possible to lock the global batch size? I read in the documentation that:
Does it make sense to only set
train_micro_batch_size_per_gpu
andtrain_batch_size
? And leavninggradient_accumulation_steps
empty?Beta Was this translation helpful? Give feedback.
All reactions