How can I lock the global batch size? #6660

timpal0l · 2024-10-23T14:41:36Z

timpal0l
Oct 23, 2024

For context, I am doing a full finetune of a LLM (meta-llama/Llama-3.1-8B) on a HPC cluster with A100 (40gb gpus) on a rather large corpus of text.

The training setup consists of using the SFTTrainer from huggingface transformers.

And the distributed training is handled with accelerate + deepspeed ZeRO2.

The "issue" I am facing is that when I increase the number of nodes is my SLURM config, the global batch sizes increases, since, it seems to be a function of the number of total gpus.

Currently I have this config;

per_device_train_batch_size = 1
gradient_accumulation_steps = 8

and a single node has 4x A100 GPUs.

So e.g using 16 nodes I get global batch size = per_device_train_batch_size * gradient_accumulation_steps * nodes * gpu_per_node => 1 * 8 * 16 * 4 => 512.

If I launch the same training with 32 nodes the global batch size becomes 1024, which means I get half the amount of gradient updates, which prevents the model to converge, since the training completes in half the time with the larger batch size (about 1M tokens per batch).

I can drop down the gradient_accumulation_steps ofc, but ideally I would like to lock the global batch size in case I launch a training on hundreds of nodes.

trainer.yaml:

learning_rate: 2e-5
warmup_steps: 100
lr_scheduler: cosine
optimizer: adamw_torch_fused
max_grad_norm: 1.0
gradient_accumulation_steps: 8
per_device_train_batch_size: 1
num_epochs: 1
sequence_len: 8192

deepspeed_zero2.json:

{
  "zero_optimization": {
    "stage": 2,
    "offload_optimizer": {
      "device": "cpu"
    },
    "contiguous_gradients": true,
    "overlap_comm": true
  },
  "bf16": {
    "enabled": "auto"
  },
  "fp16": {
    "enabled": "auto",
    "auto_cast": false,
    "loss_scale": 0,
    "initial_scale_power": 32,
    "loss_scale_window": 1000,
    "hysteresis": 2,
    "min_loss_scale": 1
  },
  "gradient_accumulation_steps": "auto",
  "gradient_clipping": "auto",
  "train_batch_size": "auto",
  "train_micro_batch_size_per_gpu": "auto",
  "wall_clock_breakdown": false
}

Relevant parts of the slurm config;

#!/bin/bash
#SBATCH --job-name=test
#SBATCH --account=test
#SBATCH --partition=standard-g
#SBATCH --cpus-per-task=56
#SBATCH --nodes=16
#SBATCH --gpus-per-node=4
#SBATCH --ntasks-per-node=1
#SBATCH --mem=480G
#SBATCH --exclusive
#SBATCH -t 48:00:00

#Variables for distributed enviroment
export MASTER_PORT=$(expr 10000 + $(echo -n $SLURM_JOBID | tail -c 4))
master_addr=$(scontrol show hostnames "$SLURM_JOB_NODELIST" | head -n 1)
export MASTER_ADDR=$master_addr
export LOCAL_RANK=$SLURM_LOCALID
export WORLD_SIZE=$((SLURM_GPUS_ON_NODE*SLURM_NNODES))

accelerate launch \
    --rdzv_conf "rdzv_backend=c10d,rdzv_endpoint=$MASTER_ADDR:$MASTER_PORT" \
    --config_file $ACCELERATE_CONFIG_FILE \
    --num_machines $SLURM_NNODES \
    --num_processes $WORLD_SIZE \
    --main_process_ip $MASTER_ADDR \
    --main_process_port $MASTER_PORT \
    --machine_rank \$SLURM_PROCID \
    --role \$(hostname -s) \
    --tee 3 \

Is it possible to lock the global batch size? I read in the documentation that:

train_batch_size must be equal to train_micro_batch_size_per_gpu * gradient_accumulation_steps * number of GPUs. For simplicity, you can choose to only specify two of the three parameters, the last one will be inferred automatically by DeepSpeed.

Does it make sense to only set train_micro_batch_size_per_gpu and train_batch_size? And leavning gradient_accumulation_steps empty?

tjruwase · 2024-10-23T16:43:26Z

tjruwase
Oct 23, 2024
Maintainer

@timpal0l, since you are doing data parallel training, global_batch_size >= gpu_count, since each gpu must have at least 1 sample.

Does it make sense to only set train_micro_batch_size_per_gpu and train_batch_size? And leavning gradient_accumulation_steps empty?

Yes, this approach provides the most flexibility when varying GPU count.

0 replies

timpal0l · 2024-10-23T17:12:31Z

timpal0l
Oct 23, 2024
Author

@timpal0l, since you are doing data parallel training, global_batch_size >= gpu_count, since each gpu must have at least 1 sample.

Does it make sense to only set train_micro_batch_size_per_gpu and train_batch_size? And leavning gradient_accumulation_steps empty?

Yes, this approach provides the most flexibility when varying GPU count.

Ok, but not defining gradient accumulation, wont make it smaller than one, right?

How could I otherwise benefit from using more compute, without increasing my batch size?

3 replies

tjruwase Oct 23, 2024
Maintainer

Ok, but not defining gradient accumulation, wont make it smaller than one, right?

An assert would fire if smaller than one.

How could I otherwise benefit from using more compute, without increasing my batch size?

Besides batch scaling, typical uses of compute scaling are model scaling and more recently context scaling. Are either relevant for your scenario?

timpal0l Oct 24, 2024
Author

Thanks. Could you provide any docs for model scaling using deepspeed zero 2/3?

tjruwase Oct 24, 2024
Maintainer

Papers:

Tutorials:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can I lock the global batch size? #6660

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

How can I lock the global batch size? #6660

timpal0l Oct 23, 2024

Replies: 2 comments · 3 replies

tjruwase Oct 23, 2024 Maintainer

timpal0l Oct 23, 2024 Author

tjruwase Oct 23, 2024 Maintainer

timpal0l Oct 24, 2024 Author

tjruwase Oct 24, 2024 Maintainer

timpal0l
Oct 23, 2024

Replies: 2 comments 3 replies

tjruwase
Oct 23, 2024
Maintainer

timpal0l
Oct 23, 2024
Author

tjruwase Oct 23, 2024
Maintainer

timpal0l Oct 24, 2024
Author

tjruwase Oct 24, 2024
Maintainer