-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failed to reproduce LongVA-7B after training from scratch #37
Comments
Your model has a significant drop in docvqa and infovqa. Yet the private dataset contains very little OCR data. So I do not think it is caused by the absence of private data. Maybe it is caused by the batchsize==1? |
Thank you for you reply. Rank 0: line 203: <class 'torch.Tensor'> 16 torch.Size([16, 3, 336, 336])
Traceback (most recent call last):
File "/home/chenmingshuo/projects/longia/LongVA/longva/longva/train/train_mem.py", line 4, in <module>
train()
File "/home/chenmingshuo/projects/longia/LongVA/longva/longva/train/train.py", line 1646, in train
trainer.train()
File "/home/chenmingshuo/miniconda3/envs/longva/lib/python3.10/site-packages/transformers/trainer.py", line 1780, in train
return inner_training_loop(
File "/home/chenmingshuo/miniconda3/envs/longva/lib/python3.10/site-packages/transformers/trainer.py", line 2124, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/chenmingshuo/miniconda3/envs/longva/lib/python3.10/site-packages/transformers/trainer.py", line 3042, in training_step
loss = self.compute_loss(model, inputs)
File "/home/chenmingshuo/miniconda3/envs/longva/lib/python3.10/site-packages/transformers/trainer.py", line 3065, in compute_loss
outputs = model(**inputs)
File "/home/chenmingshuo/miniconda3/envs/longva/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/chenmingshuo/miniconda3/envs/longva/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1527, in _call_impl
return forward_call(*args, **kwargs)
File "/home/chenmingshuo/miniconda3/envs/longva/lib/python3.10/site-packages/deepspeed/utils/nvtx.py", line 18, in wrapped_fn
ret_val = func(*args, **kwargs)
File "/home/chenmingshuo/miniconda3/envs/longva/lib/python3.10/site-packages/deepspeed/runtime/engine.py", line 1890, in forward
loss = self.module(*inputs, **kwargs)
File "/home/chenmingshuo/miniconda3/envs/longva/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1518, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/home/chenmingshuo/miniconda3/envs/longva/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1568, in _call_impl
result = forward_call(*args, **kwargs)
File "/home/chenmingshuo/projects/longia/LongVA/longva/longva/model/language_model/llava_qwen.py", line 83, in forward
(input_ids, position_ids, attention_mask, past_key_values, inputs_embeds, labels) = self.prepare_inputs_labels_for_multimodal(input_ids, position_ids, attention_mask, past_key_values, labels, images, modalities, image_sizes)
File "/home/chenmingshuo/projects/longia/LongVA/longva/longva/model/llava_arch.py", line 279, in prepare_inputs_labels_for_multimodal
raise ValueError(error_message)
ValueError:
Something is wrong with the input shape. Most likely, you did not wrap the video input in a list:
This is correct:
model.generate(input_ids, images=[video_tensor], modalities=["video"], **gen_kwargs)
This is wrong:
model.generate(input_ids, images=video_tensor, modalities=["video"], **gen_kwargs) The traceback shows the input images is a Tensor and has shape of [16, 3, 336, 336] (4 dims), which is not expected. And I noticed there are some logic dealing with this case. # when input is a list and has 4 dimensions, unsqueeze to 5 dimentions
if type(images) is list or images.ndim == 5:
if type(images) is list:
images = [x.unsqueeze(0) if x.ndim == 3 else x for x in images] But given the Tensor input, above logic fails. So I made the following modifications: if type(images) is list or images.ndim == 5 or images.ndim == 4:
if type(images) is not list and images.ndim == 4:
images = images.unsqueeze(1)
if type(images) is list:
images = [x.unsqueeze(0) if x.ndim == 3 else x for x in images]
rank0_print("line 207:", type(images[0]), images[0].shape) Is it right? |
I tried to reproduce the model. Below are the steps I followed:
First, I ran the scripts/pretrain.sh, which producs the projector. The pretrain data comes from https://huggingface.co/datasets/liuhaotian/LLaVA-Pretrain. And I added some lines in model/llava_arch.py(prepare_inputs_labels_for_multimodal), because of the incorrect input dimension when I directly use liuhaotian/LLaVA-Pretrain. Specifically, I unsqueeze the image tensor to match the requested "5-dimension input" and use a batch size of 1 in case the modifications result in unwanted errors.
Then, I executed the scripts/fintune.sh, using the projector from step 1 and Qwen-224k LLM from the huggingface. https://huggingface.co/datasets/lmms-lab/LLaVA-NeXT-Data is the dataset I use.
Then, I obtained the “LongVA-7B”, I think.(I didn't run dpo.sh)
However, the test results are much different from those in the paper(may due to lmms-eval) and from the released chekpoints from hf.
I noticed there are some private data in LLaVA-NeXT-Data, which was mentioned in #10 and the hf datasets repo .
Is it because the private data used during training that accounts for the difference?
The text was updated successfully, but these errors were encountered: