Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

V100 run video understanding #29

Open
gehong-coder opened this issue Oct 15, 2024 · 10 comments
Open

V100 run video understanding #29

gehong-coder opened this issue Oct 15, 2024 · 10 comments

Comments

@gehong-coder
Copy link

V100 cannot use flash attention, so I changed to using eager to calculate attention,
self.self_attn = IDEFICS_VISION_ATTENTION_CLASSES"eager"

but the following error occurred:

File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/transformers/models/idefics2/modeling_idefics2.py", line 630, in forward
encoder_outputs = self.encoder(
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/accelerate/hooks.py", line 170, in new_forward
output = module._old_forward(*args, **kwargs)
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/transformers/models/idefics2/modeling_idefics2.py", line 555, in forward
layer_outputs = encoder_layer(
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/accelerate/hooks.py", line 170, in new_forward
output = module._old_forward(*args, **kwargs)
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/transformers/models/idefics2/modeling_idefics2.py", line 467, in forward
hidden_states, attn_weights = self.self_attn(
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/accelerate/hooks.py", line 170, in new_forward
output = module._old_forward(*args, **kwargs)
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/transformers/models/idefics2/modeling_idefics2.py", line 245, in forward
raise ValueError(
ValueError: Attention mask should be of size (128, 1, 1225, 1225), but is torch.Size([128, 1225])

@aria-hacker
Copy link
Collaborator

aria-hacker commented Oct 15, 2024

We've implemented support for eager attention. Could you please test the following code and let me know if you encounter any issues? @gehong-coder

model = AutoModelForCausalLM.from_pretrained(
    "rhymes-ai/Aria",
    device_map="auto",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,  # Corrected 'true' to 'True'
    attn_implementation="eager",
)

@gehong-coder
Copy link
Author

gehong-coder commented Oct 16, 2024

We've implemented support for eager attention. Could you please test the following code and let me know if you encounter any issues? @gehong-coder

model = AutoModelForCausalLM.from_pretrained(
    "rhymes-ai/Aria",
    device_map="auto",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,  # Corrected 'true' to 'True'
    attn_implementation="eager",
)

Hello, this problem occurs after I use the above settings. It seems that setting attn_implementation = eager here cannot use eager internally.

File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/transformers/models/idefics2/modeling_idefics2.py", line 467, in forward
hidden_states, attn_weights = self.self_attn(
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 619, in forward
out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = _flash_attn_varlen_forward(
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 88, in _flash_attn_varlen_forward
out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = flash_attn_cuda.varlen_fwd(
RuntimeError: FlashAttention only supports Ampere GPUs or newer.

So I went into modeling_idefics2 and changed line 442 of self.self_attn = IDEFICS_VISION_ATTENTION_CLASSESconfig._attn_implementation and config._attn_implementation to eager.
Then it will appear

"/home/hong.ge/.cache/huggingface/modules/transformers_modules/5cc2703b3afd585f232ec5027e9c039a2001bcec/modeling_aria.py", line 376, in forward
image_outputs, image_attn_mask = self.vision_tower(
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/accelerate/hooks.py", line 170, in new_forward
output = module._old_forward(*args, **kwargs)
File "/home/hong.ge/.cache/huggingface/modules/transformers_modules/5cc2703b3afd585f232ec5027e9c039a2001bcec/vision_encoder.py", line 120, in forward
vit_oup = self.vision_model(
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/accelerate/hooks.py", line 170, in new_forward
output = module._old_forward(*args, **kwargs)
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/transformers/models/idefics2/modeling_idefics2.py", line 630, in forward
encoder_outputs = self.encoder(
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/accelerate/hooks.py", line 170, in new_forward
output = module._old_forward(*args, **kwargs)
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/transformers/models/idefics2/modeling_idefics2.py", line 555, in forward
layer_outputs = encoder_layer(
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/accelerate/hooks.py", line 170, in new_forward
output = module._old_forward(*args, **kwargs)
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/transformers/models/idefics2/modeling_idefics2.py", line 467, in forward
hidden_states, attn_weights = self.self_attn(
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/accelerate/hooks.py", line 170, in new_forward
output = module._old_forward(*args, **kwargs)
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/transformers/models/idefics2/modeling_idefics2.py", line 245, in forward
raise ValueError(
ValueError: Attention mask should be of size (128, 1, 1225, 1225), but is torch.Size([128, 1225])

@aria-hacker
Copy link
Collaborator

@gehong-coder Is your local model updated to the latest rhymes-ai/Aria repo? We updated it yesterday

@gehong-coder
Copy link
Author

I have updated the model, but it still appears. Is it because grouped_gemm is not installed?

grouped_gemmis not installed, using sequential GEMM, which is slower. AriaMoELMForCausalLM has generative capabilities, asprepare_inputs_for_generationis explicitly overwritten. However, it doesn't directly inherit fromGenerationMixin. From 👉v4.50👈 onwards, PreTrainedModelwill NOT inherit fromGenerationMixin, and this model will lose the ability to call generate` and other related functions.

  • If you're using trust_remote_code=True, you can get rid of this warning by loading the model with an auto class. See https://huggingface.co/docs/transformers/en/model_doc/auto#auto-classes
  • If you are the owner of the model architecture code, please modify your model class such that it inherits from GenerationMixin (after PreTrainedModel, otherwise you'll get an exception).
  • If you are not the owner of the model architecture class, please contact the model code owner to update it.
    Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:16<00:00, 1.37s/it]
    Already cached 128/128 frames for video /mnt/nfs/bj4-v100-1/data1/hong.ge/workspace/data/test_data/test_caption/video/飞机.mp4, enjoy speed!
    /mnt/nfs/bj4-v100-1/data1/hong.ge/workspace/github/Aria/inference/notebooks/video_in.py:149: FutureWarning: torch.cuda.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cuda', args...) instead.
    with torch.inference_mode(), torch.cuda.amp.autocast(dtype=torch.bfloat16):
    /mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:601: UserWarning: do_sample is set to False. However, temperature is set to 0.0 -- this flag is only used in sample-based generation modes. You should set do_sample=True or unset temperature.
    warnings.warn(
    The seen_tokens attribute is deprecated and will be removed in v4.41. Use the cache_position model input instead.
    Traceback (most recent call last):
    File "/mnt/nfs/bj4-v100-1/data1/hong.ge/workspace/github/Aria/inference/notebooks/video_in.py", line 166, in
    infer(contents)
    File "/mnt/nfs/bj4-v100-1/data1/hong.ge/workspace/github/Aria/inference/notebooks/video_in.py", line 150, in infer
    output = model.generate(
    File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
    File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/transformers/generation/utils.py", line 2048, in generate
    result = self._sample(
    File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/transformers/generation/utils.py", line 3008, in _sample
    outputs = self(**model_inputs, return_dict=True)
    File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
    File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
    File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/accelerate/hooks.py", line 170, in new_forward
    output = module._old_forward(*args, **kwargs)
    File "/home/hong.ge/.cache/huggingface/modules/transformers_modules/5cc2703b3afd585f232ec5027e9c039a2001bcec/modeling_aria.py", line 376, in forward
    image_outputs, image_attn_mask = self.vision_tower(
    File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
    File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
    File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/accelerate/hooks.py", line 170, in new_forward
    output = module._old_forward(*args, **kwargs)
    File "/home/hong.ge/.cache/huggingface/modules/transformers_modules/5cc2703b3afd585f232ec5027e9c039a2001bcec/vision_encoder.py", line 120, in forward
    vit_oup = self.vision_model(
    File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
    File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
    File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/accelerate/hooks.py", line 170, in new_forward
    output = module._old_forward(*args, **kwargs)
    File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/transformers/models/idefics2/modeling_idefics2.py", line 630, in forward
    encoder_outputs = self.encoder(
    File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
    File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
    File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/accelerate/hooks.py", line 170, in new_forward
    output = module._old_forward(*args, **kwargs)
    File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/transformers/models/idefics2/modeling_idefics2.py", line 555, in forward
    layer_outputs = encoder_layer(
    File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
    File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
    File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/accelerate/hooks.py", line 170, in new_forward
    output = module._old_forward(*args, **kwargs)
    File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/transformers/models/idefics2/modeling_idefics2.py", line 467, in forward
    hidden_states, attn_weights = self.self_attn(
    File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
    File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
    File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/accelerate/hooks.py", line 170, in new_forward
    output = module._old_forward(*args, **kwargs)
    File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/transformers/models/idefics2/modeling_idefics2.py", line 245, in forward
    raise ValueError(
    ValueError: Attention mask should be of size (128, 1, 1225, 1225), but is torch.Size([128, 1225])

@saeedkhaki92
Copy link

Eager attention is not working, and not be able to run the model on V100s. Could you please help with this feature?

@aria-hacker
Copy link
Collaborator

aria-hacker commented Oct 18, 2024

@gehong-coder I can't reproduce this error on my local machine. Could you provide some minimal code to reproduce this bug? And what is the version of your transformers?

@gehong-coder
Copy link
Author

gehong-coder commented Oct 22, 2024

@gehong-coder I can't reproduce this error on my local machine. Could you provide some minimal code to reproduce this bug? And what is the version of your transformers?

python 3.10
tokenizers 0.20.1
torch 2.4.0
torchvision 0.19.0
tqdm 4.66.5
transformers 4.45.0
triton 3.0.0

this is my code
`import requests
import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor
from decord import VideoReader
from PIL import Image
from tqdm import tqdm
from typing import List
import os

def load_model():
model_id_or_path = "/home/hong.ge/.cache/torch/hub/models--rhymes-ai--Aria/snapshots/5cc2703b3afd585f232ec5027e9c039a2001bcec"
model = AutoModelForCausalLM.from_pretrained(
model_id_or_path,
device_map="auto",
torch_dtype=torch.bfloat16,
trust_remote_code=True, # Corrected 'true' to 'True'
attn_implementation="eager",
)
# model = AutoModelForCausalLM.from_pretrained(model_id_or_path, device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True)
processor = AutoProcessor.from_pretrained(model_id_or_path, trust_remote_code=True)
return model, processor

model, processor = load_model()
def load_video(video_file, num_frames=128, cache_dir="cached_video_frames", verbosity="DEBUG"):
# Create cache directory if it doesn't exist
os.makedirs(cache_dir, exist_ok=True)

video_basename = os.path.basename(video_file)
cache_subdir = os.path.join(cache_dir, f"{video_basename}_{num_frames}")
os.makedirs(cache_subdir, exist_ok=True)

cached_frames = []
missing_frames = []
frame_indices = []

for i in range(num_frames):
    frame_path = os.path.join(cache_subdir, f"frame_{i}.jpg")
    if os.path.exists(frame_path):
        cached_frames.append(frame_path)
    else:
        missing_frames.append(i)
        frame_indices.append(i) 
        
vr = VideoReader(video_file)
duration = len(vr)
fps = vr.get_avg_fps()
        
frame_timestamps = [int(duration / num_frames * (i+0.5)) / fps for i in range(num_frames)]

if verbosity == "DEBUG":
    print("Already cached {}/{} frames for video {}, enjoy speed!".format(len(cached_frames), num_frames, video_file))
# If all frames are cached, load them directly
if not missing_frames:
    return [Image.open(frame_path).convert("RGB") for frame_path in cached_frames], frame_timestamps



actual_frame_indices = [int(duration / num_frames * (i+0.5)) for i in missing_frames]


missing_frames_data = vr.get_batch(actual_frame_indices).asnumpy()

for idx, frame_index in enumerate(tqdm(missing_frames, desc="Caching rest frames")):
    img = Image.fromarray(missing_frames_data[idx]).convert("RGB")
    frame_path = os.path.join(cache_subdir, f"frame_{frame_index}.jpg")
    img.save(frame_path)
    cached_frames.append(frame_path)

cached_frames.sort(key=lambda x: int(os.path.basename(x).split('_')[1].split('.')[0]))
return [Image.open(frame_path).convert("RGB") for frame_path in cached_frames], frame_timestamps

def create_image_gallery(images, columns=3, spacing=20, bg_color=(200, 200, 200)):
"""
Combine multiple images into a single larger image in a grid format.

Parameters:
    image_paths (list of str): List of file paths to the images to display.
    columns (int): Number of columns in the gallery.
    spacing (int): Space (in pixels) between the images in the gallery.
    bg_color (tuple): Background color of the gallery (R, G, B).

Returns:
    PIL.Image: A single combined image.
"""
# Open all images and get their sizes
img_width, img_height = images[0].size  # Assuming all images are of the same size

# Calculate rows needed for the gallery
rows = (len(images) + columns - 1) // columns

# Calculate the size of the final gallery image
gallery_width = columns * img_width + (columns - 1) * spacing
gallery_height = rows * img_height + (rows - 1) * spacing

# Create a new image with the calculated size and background color
gallery_image = Image.new('RGB', (gallery_width, gallery_height), bg_color)

# Paste each image into the gallery
for index, img in enumerate(images):
    row = index // columns
    col = index % columns

    x = col * (img_width + spacing)
    y = row * (img_height + spacing)

    gallery_image.paste(img, (x, y))

return gallery_image

def get_placeholders_for_videos(frames: List, timestamps=[]):
contents = []
if not timestamps:
for i, _ in enumerate(frames):
contents.append({"text": None, "type": "image"})
contents.append({"text": "\n", "type": "text"})
else:
for i, (_, ts) in enumerate(zip(frames, timestamps)):
contents.extend(
[
{"text": f"[{int(ts)//60:02d}:{int(ts)%60:02d}]", "type": "text"},
{"text": None, "type": "image"},
{"text": "\n", "type": "text"}
]
)
return contents

def infer(contents):
torch.cuda.empty_cache()

messages = [
    {
        "role": "user",
        "content": [
            *contents,
            {"text": "Please list the burgers that appear in this video, and how they are made.", "type": "text"},
        ],
    }
]

text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=text, images=frames, return_tensors="pt", max_image_size=490)
inputs["pixel_values"] = inputs["pixel_values"].to(model.dtype)
inputs = {k: v.to(model.device) for k, v in inputs.items()}

with torch.inference_mode(), torch.cuda.amp.autocast(dtype=torch.bfloat16):
    output = model.generate(
        **inputs,
        max_new_tokens=2048,
        stop_strings=["<|im_end|>"],
        tokenizer=processor.tokenizer,
        do_sample=False,
        temperature=0.,
    )
    output_ids = output[0][inputs["input_ids"].shape[1]:]
    result = processor.decode(output_ids, skip_special_tokens=True)

print(result)

frames, frame_timestamps = load_video("/mnt/nfs/bj4-v100-1/data1/hong.ge/workspace/data/test_data/test_caption/video/飞机.mp4", num_frames=128)
contents = get_placeholders_for_videos(frames, frame_timestamps)
infer(contents)`

@aria-hacker
Copy link
Collaborator

@gehong-coder
It seems that you are using the code and model weights from huggingface cache dir /home/hong.ge/.cache/torch/hub/models--rhymes-ai--Aria/snapshots/5cc2703b3afd585f232ec5027e9c039a2001bcec please make sure all py files and json are aligned with the latest configuration.

The recommended way to load the latest Aria is to load it from the online official site model = AutoModelForCausalLM.from_pretrained("rhymes-ai/Aria", device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True). It will automatically check if those files are new.

@gehong-coder
Copy link
Author

gehong-coder commented Oct 23, 2024

@aria-hacker
I have downloaded the latest version of the model, using the script Aria/inference/notebooks/04_video_understanding.ipynb
Using a v100 machine, then the following comes up
image
So I changed this again
image
But, it's still giving me this problem... are you guys sure it will work in v100?
image

@aria-hacker
Copy link
Collaborator

@gehong-coder In most cases, you should not edit the code inside the transformers if you don't understand its whole context. I looked into it, and the modification you made in the wrong way which caused the error. The attention mask is built based on the type of attention name. However, you just directly modified the attention implementation, and the configuration stays in the flash_attention_2 way (4d mask) causing the error using FA2's mask with eager attention.
image

You should only modify the config for the vision encoder and model, that's how we passed attn_implementation in the latest code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants