Failed to reproduce LongVA-7B after training from scratch #37

nanocm · 2024-12-12T07:50:39Z

I tried to reproduce the model. Below are the steps I followed:

pretrain
First, I ran the scripts/pretrain.sh, which producs the projector. The pretrain data comes from https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain. And I added some lines in model/llava_arch.py(prepare_inputs_labels_for_multimodal), because of the incorrect input dimension when I directly use liuhaotian/LLaVA-Pretrain. Specifically, I unsqueeze the image tensor to match the requested "5-dimension input" and use a batch size of 1 in case the modifications result in unwanted errors.
finetune
Then, I executed the scripts/fintune.sh, using the projector from step 1 and Qwen-224k LLM from the huggingface. https://huggingface.co/datasets/lmms-lab/LLaVA-NeXT-Data is the dataset I use.

Then, I obtained the “LongVA-7B”, I think.(I didn't run dpo.sh)
However, the test results are much different from those in the paper(may due to lmms-eval) and from the released chekpoints from hf.

I noticed there are some private data in LLaVA-NeXT-Data, which was mentioned in #10 and the hf datasets repo .
Is it because the private data used during training that accounts for the difference?

nanocm · 2024-12-16T05:46:47Z

@jzhang38

jzhang38 · 2024-12-31T19:18:55Z

Is it because the private data used during training that accounts for the difference?

Your model has a significant drop in docvqa and infovqa. Yet the private dataset contains very little OCR data. So I do not think it is caused by the absence of private data. Maybe it is caused by the batchsize==1?

nanocm · 2025-01-09T05:19:09Z

Is it because the private data used during training that accounts for the difference?

Your model has a significant drop in docvqa and infovqa. Yet the private dataset contains very little OCR data. So I do not think it is caused by the absence of private data. Maybe it is caused by the batchsize==1?

Thank you for you reply.
I about to rerun the training process. As mentioned above, there exists some problems when running the pretrain script. Here is the traceback:

Rank 0:  line 203: <class 'torch.Tensor'> 16 torch.Size([16, 3, 336, 336])
Traceback (most recent call last):
  File "/home/chenmingshuo/projects/longia/LongVA/longva/longva/train/train_mem.py", line 4, in <module>
    train()
  File "/home/chenmingshuo/projects/longia/LongVA/longva/longva/train/train.py", line 1646, in train
    trainer.train()
  File "/home/chenmingshuo/miniconda3/envs/longva/lib/python3.10/site-packages/transformers/trainer.py", line 1780, in train
    return inner_training_loop(
  File "/home/chenmingshuo/miniconda3/envs/longva/lib/python3.10/site-packages/transformers/trainer.py", line 2124, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/chenmingshuo/miniconda3/envs/longva/lib/python3.10/site-packages/transformers/trainer.py", line 3042, in training_step
    loss = self.compute_loss(model, inputs)
  File "/home/chenmingshuo/miniconda3/envs/longva/lib/python3.10/site-packages/transformers/trainer.py", line 3065, in compute_loss
    outputs = model(**inputs)
  File "/home/chenmingshuo/miniconda3/envs/longva/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/chenmingshuo/miniconda3/envs/longva/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/chenmingshuo/miniconda3/envs/longva/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/chenmingshuo/miniconda3/envs/longva/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1890, in forward
    loss = self.module(*inputs, **kwargs)
  File "/home/chenmingshuo/miniconda3/envs/longva/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/chenmingshuo/miniconda3/envs/longva/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1568, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/home/chenmingshuo/projects/longia/LongVA/longva/longva/model/language_model/llava_qwen.py", line 83, in forward
    (input_ids, position_ids, attention_mask, past_key_values, inputs_embeds, labels) = self.prepare_inputs_labels_for_multimodal(input_ids, position_ids, attention_mask, past_key_values, labels, images, modalities, image_sizes)
  File "/home/chenmingshuo/projects/longia/LongVA/longva/longva/model/llava_arch.py", line 279, in prepare_inputs_labels_for_multimodal
    raise ValueError(error_message)
ValueError:
            Something is wrong with the input shape. Most likely, you did not wrap the video input in a list:
            This is correct:
                model.generate(input_ids, images=[video_tensor],  modalities=["video"], **gen_kwargs)
            This is wrong:
                model.generate(input_ids, images=video_tensor,  modalities=["video"], **gen_kwargs)

The traceback shows the input images is a Tensor and has shape of [16, 3, 336, 336] (4 dims), which is not expected. And I noticed there are some logic dealing with this case.

        # when input is a list and has 4 dimensions, unsqueeze to 5 dimentions
        if type(images) is list or images.ndim == 5:
            if type(images) is list:
                images = [x.unsqueeze(0) if x.ndim == 3 else x for x in images]

But given the Tensor input, above logic fails. So I made the following modifications:

        if type(images) is list or images.ndim == 5 or images.ndim == 4:
            if type(images) is not list and images.ndim == 4:
                images = images.unsqueeze(1)
            if type(images) is list:
                images = [x.unsqueeze(0) if x.ndim == 3 else x for x in images]
                rank0_print("line 207:", type(images[0]), images[0].shape)

Is it right?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed to reproduce LongVA-7B after training from scratch #37

Failed to reproduce LongVA-7B after training from scratch #37

nanocm commented Dec 12, 2024 •

edited

Loading

nanocm commented Dec 16, 2024

jzhang38 commented Dec 31, 2024

nanocm commented Jan 9, 2025

Failed to reproduce LongVA-7B after training from scratch #37

Failed to reproduce LongVA-7B after training from scratch #37

Comments

nanocm commented Dec 12, 2024 • edited Loading

nanocm commented Dec 16, 2024

jzhang38 commented Dec 31, 2024

nanocm commented Jan 9, 2025

nanocm commented Dec 12, 2024 •

edited

Loading