Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to reproduce LongVA-7B after training from scratch #37

Open
nanocm opened this issue Dec 12, 2024 · 3 comments
Open

Failed to reproduce LongVA-7B after training from scratch #37

nanocm opened this issue Dec 12, 2024 · 3 comments

Comments

@nanocm
Copy link

nanocm commented Dec 12, 2024

I tried to reproduce the model. Below are the steps I followed:

  1. pretrain
    First, I ran the scripts/pretrain.sh, which producs the projector. The pretrain data comes from https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain. And I added some lines in model/llava_arch.py(prepare_inputs_labels_for_multimodal), because of the incorrect input dimension when I directly use liuhaotian/LLaVA-Pretrain. Specifically, I unsqueeze the image tensor to match the requested "5-dimension input" and use a batch size of 1 in case the modifications result in unwanted errors.
  2. finetune
    Then, I executed the scripts/fintune.sh, using the projector from step 1 and Qwen-224k LLM from the huggingface. https://huggingface.co/datasets/lmms-lab/LLaVA-NeXT-Data is the dataset I use.

Then, I obtained the “LongVA-7B”, I think.(I didn't run dpo.sh)
However, the test results are much different from those in the paper(may due to lmms-eval) and from the released chekpoints from hf.
image

I noticed there are some private data in LLaVA-NeXT-Data, which was mentioned in #10 and the hf datasets repo .
Is it because the private data used during training that accounts for the difference?

@nanocm
Copy link
Author

nanocm commented Dec 16, 2024

@jzhang38

@jzhang38
Copy link
Collaborator

Is it because the private data used during training that accounts for the difference?

Your model has a significant drop in docvqa and infovqa. Yet the private dataset contains very little OCR data. So I do not think it is caused by the absence of private data. Maybe it is caused by the batchsize==1?

@nanocm
Copy link
Author

nanocm commented Jan 9, 2025

Is it because the private data used during training that accounts for the difference?

Your model has a significant drop in docvqa and infovqa. Yet the private dataset contains very little OCR data. So I do not think it is caused by the absence of private data. Maybe it is caused by the batchsize==1?

Thank you for you reply.
I about to rerun the training process. As mentioned above, there exists some problems when running the pretrain script. Here is the traceback:

Rank 0:  line 203: <class 'torch.Tensor'> 16 torch.Size([16, 3, 336, 336])
Traceback (most recent call last):
  File "/home/chenmingshuo/projects/longia/LongVA/longva/longva/train/train_mem.py", line 4, in <module>
    train()
  File "/home/chenmingshuo/projects/longia/LongVA/longva/longva/train/train.py", line 1646, in train
    trainer.train()
  File "/home/chenmingshuo/miniconda3/envs/longva/lib/python3.10/site-packages/transformers/trainer.py", line 1780, in train
    return inner_training_loop(
  File "/home/chenmingshuo/miniconda3/envs/longva/lib/python3.10/site-packages/transformers/trainer.py", line 2124, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/chenmingshuo/miniconda3/envs/longva/lib/python3.10/site-packages/transformers/trainer.py", line 3042, in training_step
    loss = self.compute_loss(model, inputs)
  File "/home/chenmingshuo/miniconda3/envs/longva/lib/python3.10/site-packages/transformers/trainer.py", line 3065, in compute_loss
    outputs = model(**inputs)
  File "/home/chenmingshuo/miniconda3/envs/longva/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/chenmingshuo/miniconda3/envs/longva/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/chenmingshuo/miniconda3/envs/longva/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
    ret_val = func(*args, **kwargs)
  File "/home/chenmingshuo/miniconda3/envs/longva/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1890, in forward
    loss = self.module(*inputs, **kwargs)
  File "/home/chenmingshuo/miniconda3/envs/longva/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/chenmingshuo/miniconda3/envs/longva/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1568, in _call_impl
    result = forward_call(*args, **kwargs)
  File "/home/chenmingshuo/projects/longia/LongVA/longva/longva/model/language_model/llava_qwen.py", line 83, in forward
    (input_ids, position_ids, attention_mask, past_key_values, inputs_embeds, labels) = self.prepare_inputs_labels_for_multimodal(input_ids, position_ids, attention_mask, past_key_values, labels, images, modalities, image_sizes)
  File "/home/chenmingshuo/projects/longia/LongVA/longva/longva/model/llava_arch.py", line 279, in prepare_inputs_labels_for_multimodal
    raise ValueError(error_message)
ValueError:
            Something is wrong with the input shape. Most likely, you did not wrap the video input in a list:
            This is correct:
                model.generate(input_ids, images=[video_tensor],  modalities=["video"], **gen_kwargs)
            This is wrong:
                model.generate(input_ids, images=video_tensor,  modalities=["video"], **gen_kwargs)

The traceback shows the input images is a Tensor and has shape of [16, 3, 336, 336] (4 dims), which is not expected. And I noticed there are some logic dealing with this case.

        # when input is a list and has 4 dimensions, unsqueeze to 5 dimentions
        if type(images) is list or images.ndim == 5:
            if type(images) is list:
                images = [x.unsqueeze(0) if x.ndim == 3 else x for x in images]

But given the Tensor input, above logic fails. So I made the following modifications:

        if type(images) is list or images.ndim == 5 or images.ndim == 4:
            if type(images) is not list and images.ndim == 4:
                images = images.unsqueeze(1)
            if type(images) is list:
                images = [x.unsqueeze(0) if x.ndim == 3 else x for x in images]
                rank0_print("line 207:", type(images[0]), images[0].shape)

Is it right?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants