You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
First, thank you for the excellent work on this project!
I encountered an issue while fine-tuning the model with multiple modes. Specifically, I observed the following behavior:
When using 4 nodes with 1 gradient accumulation step, the total number of steps is 4 times greater compared to using 1 node with 4 gradient accumulation steps.
), I noticed something that seems off. Here's the relevant snippet:
ifself.args.group_by_length:
lengths=self.train_dataset.lengthsreturnLengthGroupedSampler(
# self.args.train_batch_size * self.args.gradient_accumulation_steps, # TODO: seems that we should not have gradient_accumulation_stepsself.args.train_batch_size,
# world_size=self.args.world_size,world_size=self.args.world_size*self.args.gradient_accumulation_steps, # TODO: seems that this may work?lengths=lengths,
)
It appears that the world_size is being multiplied by gradient_accumulation_steps, which might be causing unintended behavior. This seems to increase the number of steps when using multiple nodes, as each node's contribution is scaled incorrectly.
Could you confirm whether this logic is intentional or suggest a solution to address this discrepancy?
Thank you for your time and help!
The text was updated successfully, but these errors were encountered:
Hi,
First, thank you for the excellent work on this project!
I encountered an issue while fine-tuning the model with multiple modes. Specifically, I observed the following behavior:
Upon examining the sampler code in the [LLaVA-NeXT Trainer implementation](
LLaVA-NeXT/llava/train/llava_trainer.py
Line 273 in 79ef45a
It appears that the
world_size
is being multiplied bygradient_accumulation_steps
, which might be causing unintended behavior. This seems to increase the number of steps when using multiple nodes, as each node's contribution is scaled incorrectly.Could you confirm whether this logic is intentional or suggest a solution to address this discrepancy?
Thank you for your time and help!
The text was updated successfully, but these errors were encountered: