[Bug] The sampler give wrong length for multi-node training #402

LinShan-Bin · 2025-01-31T19:27:26Z

Hi,

First, thank you for the excellent work on this project!

I encountered an issue while fine-tuning the model with multiple modes. Specifically, I observed the following behavior:

When using 4 nodes with 1 gradient accumulation step, the total number of steps is 4 times greater compared to using 1 node with 4 gradient accumulation steps.

Upon examining the sampler code in the [LLaVA-NeXT Trainer implementation](

LLaVA-NeXT/llava/train/llava_trainer.py

Line 273 in 79ef45a

def _get_train_sampler(self) -> Optional[torch.utils.data.Sampler]:

), I noticed something that seems off. Here's the relevant snippet:

if self.args.group_by_length:
    lengths = self.train_dataset.lengths
    return LengthGroupedSampler(
        # self.args.train_batch_size * self.args.gradient_accumulation_steps, # TODO: seems that we should not have gradient_accumulation_steps
        self.args.train_batch_size,
        # world_size=self.args.world_size,
        world_size=self.args.world_size * self.args.gradient_accumulation_steps,  # TODO: seems that this may work?
        lengths=lengths,
    )

It appears that the world_size is being multiplied by gradient_accumulation_steps, which might be causing unintended behavior. This seems to increase the number of steps when using multiple nodes, as each node's contribution is scaled incorrectly.

Could you confirm whether this logic is intentional or suggest a solution to address this discrepancy?

Thank you for your time and help!

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] The sampler give wrong length for multi-node training #402

[Bug] The sampler give wrong length for multi-node training #402

LinShan-Bin commented Jan 31, 2025

[Bug] The sampler give wrong length for multi-node training #402

[Bug] The sampler give wrong length for multi-node training #402

Comments

LinShan-Bin commented Jan 31, 2025