V100 run video understanding #29

gehong-coder · 2024-10-15T07:37:57Z

V100 cannot use flash attention, so I changed to using eager to calculate attention,
self.self_attn = IDEFICS_VISION_ATTENTION_CLASSES"eager"

but the following error occurred：

File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/transformers/models/idefics2/modeling_idefics2.py", line 630, in forward
encoder_outputs = self.encoder(
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/accelerate/hooks.py", line 170, in new_forward
output = module._old_forward(*args, **kwargs)
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/transformers/models/idefics2/modeling_idefics2.py", line 555, in forward
layer_outputs = encoder_layer(
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/accelerate/hooks.py", line 170, in new_forward
output = module._old_forward(*args, **kwargs)
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/transformers/models/idefics2/modeling_idefics2.py", line 467, in forward
hidden_states, attn_weights = self.self_attn(
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/accelerate/hooks.py", line 170, in new_forward
output = module._old_forward(*args, **kwargs)
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/transformers/models/idefics2/modeling_idefics2.py", line 245, in forward
raise ValueError(
ValueError: Attention mask should be of size (128, 1, 1225, 1225), but is torch.Size([128, 1225])

aria-hacker · 2024-10-15T12:02:30Z

We've implemented support for eager attention. Could you please test the following code and let me know if you encounter any issues? @gehong-coder

model = AutoModelForCausalLM.from_pretrained(
    "rhymes-ai/Aria",
    device_map="auto",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,  # Corrected 'true' to 'True'
    attn_implementation="eager",
)

gehong-coder · 2024-10-16T02:13:24Z

We've implemented support for eager attention. Could you please test the following code and let me know if you encounter any issues? @gehong-coder
model = AutoModelForCausalLM.from_pretrained(
    "rhymes-ai/Aria",
    device_map="auto",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,  # Corrected 'true' to 'True'
    attn_implementation="eager",
)

Hello, this problem occurs after I use the above settings. It seems that setting attn_implementation = eager here cannot use eager internally.

File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/transformers/models/idefics2/modeling_idefics2.py", line 467, in forward
hidden_states, attn_weights = self.self_attn(
return super().apply(*args, **kwargs) # type: ignore[misc]
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 619, in forward
out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = _flash_attn_varlen_forward(
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/flash_attn/flash_attn_interface.py", line 88, in _flash_attn_varlen_forward
out, q, k, v, out_padded, softmax_lse, S_dmask, rng_state = flash_attn_cuda.varlen_fwd(
RuntimeError: FlashAttention only supports Ampere GPUs or newer.

So I went into modeling_idefics2 and changed line 442 of self.self_attn = IDEFICS_VISION_ATTENTION_CLASSESconfig._attn_implementation and config._attn_implementation to eager.
Then it will appear
"/home/hong.ge/.cache/huggingface/modules/transformers_modules/5cc2703b3afd585f232ec5027e9c039a2001bcec/modeling_aria.py", line 376, in forward
image_outputs, image_attn_mask = self.vision_tower(
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/accelerate/hooks.py", line 170, in new_forward
output = module._old_forward(*args, **kwargs)
File "/home/hong.ge/.cache/huggingface/modules/transformers_modules/5cc2703b3afd585f232ec5027e9c039a2001bcec/vision_encoder.py", line 120, in forward
vit_oup = self.vision_model(
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/accelerate/hooks.py", line 170, in new_forward
output = module._old_forward(*args, **kwargs)
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/transformers/models/idefics2/modeling_idefics2.py", line 630, in forward
encoder_outputs = self.encoder(
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/accelerate/hooks.py", line 170, in new_forward
output = module._old_forward(*args, **kwargs)
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/transformers/models/idefics2/modeling_idefics2.py", line 555, in forward
layer_outputs = encoder_layer(
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/accelerate/hooks.py", line 170, in new_forward
output = module._old_forward(*args, **kwargs)
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/transformers/models/idefics2/modeling_idefics2.py", line 467, in forward
hidden_states, attn_weights = self.self_attn(
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/accelerate/hooks.py", line 170, in new_forward
output = module._old_forward(*args, **kwargs)
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/transformers/models/idefics2/modeling_idefics2.py", line 245, in forward
raise ValueError(
ValueError: Attention mask should be of size (128, 1, 1225, 1225), but is torch.Size([128, 1225])

aria-hacker · 2024-10-16T05:33:06Z

@gehong-coder Is your local model updated to the latest rhymes-ai/Aria repo? We updated it yesterday

gehong-coder · 2024-10-16T10:21:43Z

I have updated the model, but it still appears. Is it because grouped_gemm is not installed?

grouped_gemmis not installed, using sequential GEMM, which is slower. AriaMoELMForCausalLM has generative capabilities, asprepare_inputs_for_generationis explicitly overwritten. However, it doesn't directly inherit fromGenerationMixin. From 👉v4.50👈 onwards, PreTrainedModelwill NOT inherit fromGenerationMixin, and this model will lose the ability to call generate` and other related functions.

If you're using trust_remote_code=True, you can get rid of this warning by loading the model with an auto class. See https://huggingface.co/docs/transformers/en/model_doc/auto#auto-classes
If you are the owner of the model architecture code, please modify your model class such that it inherits from GenerationMixin (after PreTrainedModel, otherwise you'll get an exception).
If you are not the owner of the model architecture class, please contact the model code owner to update it.
Loading checkpoint shards: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 12/12 [00:16<00:00, 1.37s/it]
Already cached 128/128 frames for video /mnt/nfs/bj4-v100-1/data1/hong.ge/workspace/data/test_data/test_caption/video/飞机.mp4, enjoy speed!
/mnt/nfs/bj4-v100-1/data1/hong.ge/workspace/github/Aria/inference/notebooks/video_in.py:149: FutureWarning: torch.cuda.amp.autocast(args...) is deprecated. Please use torch.amp.autocast('cuda', args...) instead.
with torch.inference_mode(), torch.cuda.amp.autocast(dtype=torch.bfloat16):
/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/transformers/generation/configuration_utils.py:601: UserWarning: do_sample is set to False. However, temperature is set to 0.0 -- this flag is only used in sample-based generation modes. You should set do_sample=True or unset temperature.
warnings.warn(
The seen_tokens attribute is deprecated and will be removed in v4.41. Use the cache_position model input instead.
Traceback (most recent call last):
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/workspace/github/Aria/inference/notebooks/video_in.py", line 166, in
infer(contents)
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/workspace/github/Aria/inference/notebooks/video_in.py", line 150, in infer
output = model.generate(
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/transformers/generation/utils.py", line 2048, in generate
result = self._sample(
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/transformers/generation/utils.py", line 3008, in _sample
outputs = self(**model_inputs, return_dict=True)
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/accelerate/hooks.py", line 170, in new_forward
output = module._old_forward(*args, **kwargs)
File "/home/hong.ge/.cache/huggingface/modules/transformers_modules/5cc2703b3afd585f232ec5027e9c039a2001bcec/modeling_aria.py", line 376, in forward
image_outputs, image_attn_mask = self.vision_tower(
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/accelerate/hooks.py", line 170, in new_forward
output = module._old_forward(*args, **kwargs)
File "/home/hong.ge/.cache/huggingface/modules/transformers_modules/5cc2703b3afd585f232ec5027e9c039a2001bcec/vision_encoder.py", line 120, in forward
vit_oup = self.vision_model(
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/accelerate/hooks.py", line 170, in new_forward
output = module._old_forward(*args, **kwargs)
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/transformers/models/idefics2/modeling_idefics2.py", line 630, in forward
encoder_outputs = self.encoder(
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/accelerate/hooks.py", line 170, in new_forward
output = module._old_forward(*args, **kwargs)
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/transformers/models/idefics2/modeling_idefics2.py", line 555, in forward
layer_outputs = encoder_layer(
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/accelerate/hooks.py", line 170, in new_forward
output = module._old_forward(*args, **kwargs)
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/transformers/models/idefics2/modeling_idefics2.py", line 467, in forward
hidden_states, attn_weights = self.self_attn(
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
return forward_call(*args, **kwargs)
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/accelerate/hooks.py", line 170, in new_forward
output = module._old_forward(*args, **kwargs)
File "/mnt/nfs/bj4-v100-1/data1/hong.ge/miniconda3/envs/aria/lib/python3.10/site-packages/transformers/models/idefics2/modeling_idefics2.py", line 245, in forward
raise ValueError(
ValueError: Attention mask should be of size (128, 1, 1225, 1225), but is torch.Size([128, 1225])

saeedkhaki92 · 2024-10-17T19:23:29Z

Eager attention is not working, and not be able to run the model on V100s. Could you please help with this feature?

aria-hacker · 2024-10-18T06:21:52Z

@gehong-coder I can't reproduce this error on my local machine. Could you provide some minimal code to reproduce this bug? And what is the version of your transformers?

gehong-coder · 2024-10-22T02:26:44Z

@gehong-coder I can't reproduce this error on my local machine. Could you provide some minimal code to reproduce this bug? And what is the version of your transformers?

python 3.10
tokenizers 0.20.1
torch 2.4.0
torchvision 0.19.0
tqdm 4.66.5
transformers 4.45.0
triton 3.0.0

this is my code
`import requests
import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor
from decord import VideoReader
from PIL import Image
from tqdm import tqdm
from typing import List
import os

def load_model():
model_id_or_path = "/home/hong.ge/.cache/torch/hub/models--rhymes-ai--Aria/snapshots/5cc2703b3afd585f232ec5027e9c039a2001bcec"
model = AutoModelForCausalLM.from_pretrained(
model_id_or_path,
device_map="auto",
torch_dtype=torch.bfloat16,
trust_remote_code=True, # Corrected 'true' to 'True'
attn_implementation="eager",
)
# model = AutoModelForCausalLM.from_pretrained(model_id_or_path, device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True)
processor = AutoProcessor.from_pretrained(model_id_or_path, trust_remote_code=True)
return model, processor

model, processor = load_model()
def load_video(video_file, num_frames=128, cache_dir="cached_video_frames", verbosity="DEBUG"):
# Create cache directory if it doesn't exist
os.makedirs(cache_dir, exist_ok=True)

video_basename = os.path.basename(video_file)
cache_subdir = os.path.join(cache_dir, f"{video_basename}_{num_frames}")
os.makedirs(cache_subdir, exist_ok=True)

cached_frames = []
missing_frames = []
frame_indices = []

for i in range(num_frames):
    frame_path = os.path.join(cache_subdir, f"frame_{i}.jpg")
    if os.path.exists(frame_path):
        cached_frames.append(frame_path)
    else:
        missing_frames.append(i)
        frame_indices.append(i) 
        
vr = VideoReader(video_file)
duration = len(vr)
fps = vr.get_avg_fps()
        
frame_timestamps = [int(duration / num_frames * (i+0.5)) / fps for i in range(num_frames)]

if verbosity == "DEBUG":
    print("Already cached {}/{} frames for video {}, enjoy speed!".format(len(cached_frames), num_frames, video_file))
# If all frames are cached, load them directly
if not missing_frames:
    return [Image.open(frame_path).convert("RGB") for frame_path in cached_frames], frame_timestamps



actual_frame_indices = [int(duration / num_frames * (i+0.5)) for i in missing_frames]


missing_frames_data = vr.get_batch(actual_frame_indices).asnumpy()

for idx, frame_index in enumerate(tqdm(missing_frames, desc="Caching rest frames")):
    img = Image.fromarray(missing_frames_data[idx]).convert("RGB")
    frame_path = os.path.join(cache_subdir, f"frame_{frame_index}.jpg")
    img.save(frame_path)
    cached_frames.append(frame_path)

cached_frames.sort(key=lambda x: int(os.path.basename(x).split('_')[1].split('.')[0]))
return [Image.open(frame_path).convert("RGB") for frame_path in cached_frames], frame_timestamps

def create_image_gallery(images, columns=3, spacing=20, bg_color=(200, 200, 200)):
"""
Combine multiple images into a single larger image in a grid format.

Parameters:
    image_paths (list of str): List of file paths to the images to display.
    columns (int): Number of columns in the gallery.
    spacing (int): Space (in pixels) between the images in the gallery.
    bg_color (tuple): Background color of the gallery (R, G, B).

Returns:
    PIL.Image: A single combined image.
"""
# Open all images and get their sizes
img_width, img_height = images[0].size  # Assuming all images are of the same size

# Calculate rows needed for the gallery
rows = (len(images) + columns - 1) // columns

# Calculate the size of the final gallery image
gallery_width = columns * img_width + (columns - 1) * spacing
gallery_height = rows * img_height + (rows - 1) * spacing

# Create a new image with the calculated size and background color
gallery_image = Image.new('RGB', (gallery_width, gallery_height), bg_color)

# Paste each image into the gallery
for index, img in enumerate(images):
    row = index // columns
    col = index % columns

    x = col * (img_width + spacing)
    y = row * (img_height + spacing)

    gallery_image.paste(img, (x, y))

return gallery_image

def get_placeholders_for_videos(frames: List, timestamps=[]):
contents = []
if not timestamps:
for i, _ in enumerate(frames):
contents.append({"text": None, "type": "image"})
contents.append({"text": "\n", "type": "text"})
else:
for i, (_, ts) in enumerate(zip(frames, timestamps)):
contents.extend(
[
{"text": f"[{int(ts)//60:02d}:{int(ts)%60:02d}]", "type": "text"},
{"text": None, "type": "image"},
{"text": "\n", "type": "text"}
]
)
return contents

def infer(contents):
torch.cuda.empty_cache()

messages = [
    {
        "role": "user",
        "content": [
            *contents,
            {"text": "Please list the burgers that appear in this video, and how they are made.", "type": "text"},
        ],
    }
]

text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=text, images=frames, return_tensors="pt", max_image_size=490)
inputs["pixel_values"] = inputs["pixel_values"].to(model.dtype)
inputs = {k: v.to(model.device) for k, v in inputs.items()}

with torch.inference_mode(), torch.cuda.amp.autocast(dtype=torch.bfloat16):
    output = model.generate(
        **inputs,
        max_new_tokens=2048,
        stop_strings=["<|im_end|>"],
        tokenizer=processor.tokenizer,
        do_sample=False,
        temperature=0.,
    )
    output_ids = output[0][inputs["input_ids"].shape[1]:]
    result = processor.decode(output_ids, skip_special_tokens=True)

print(result)

frames, frame_timestamps = load_video("/mnt/nfs/bj4-v100-1/data1/hong.ge/workspace/data/test_data/test_caption/video/飞机.mp4", num_frames=128)
contents = get_placeholders_for_videos(frames, frame_timestamps)
infer(contents)`

aria-hacker · 2024-10-22T10:07:47Z

@gehong-coder
It seems that you are using the code and model weights from huggingface cache dir /home/hong.ge/.cache/torch/hub/models--rhymes-ai--Aria/snapshots/5cc2703b3afd585f232ec5027e9c039a2001bcec please make sure all py files and json are aligned with the latest configuration.

The recommended way to load the latest Aria is to load it from the online official site model = AutoModelForCausalLM.from_pretrained("rhymes-ai/Aria", device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True). It will automatically check if those files are new.

gehong-coder · 2024-10-23T03:11:04Z

@aria-hacker
I have downloaded the latest version of the model, using the script Aria/inference/notebooks/04_video_understanding.ipynb
Using a v100 machine, then the following comes up

So I changed this again

But, it's still giving me this problem... are you guys sure it will work in v100?

aria-hacker · 2024-10-24T02:06:03Z

@gehong-coder In most cases, you should not edit the code inside the transformers if you don't understand its whole context. I looked into it, and the modification you made in the wrong way which caused the error. The attention mask is built based on the type of attention name. However, you just directly modified the attention implementation, and the configuration stays in the flash_attention_2 way (4d mask) causing the error using FA2's mask with eager attention.

You should only modify the config for the vision encoder and model, that's how we passed attn_implementation in the latest code.

aria-hacker mentioned this issue Oct 15, 2024

support eager attention #30

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

V100 run video understanding #29

V100 run video understanding #29

gehong-coder commented Oct 15, 2024

aria-hacker commented Oct 15, 2024 •

edited

Loading

gehong-coder commented Oct 16, 2024 •

edited

Loading

aria-hacker commented Oct 16, 2024

gehong-coder commented Oct 16, 2024

saeedkhaki92 commented Oct 17, 2024

aria-hacker commented Oct 18, 2024 •

edited

Loading

gehong-coder commented Oct 22, 2024 •

edited

Loading

aria-hacker commented Oct 22, 2024

gehong-coder commented Oct 23, 2024 •

edited

Loading

aria-hacker commented Oct 24, 2024

V100 run video understanding #29

V100 run video understanding #29

Comments

gehong-coder commented Oct 15, 2024

aria-hacker commented Oct 15, 2024 • edited Loading

gehong-coder commented Oct 16, 2024 • edited Loading

aria-hacker commented Oct 16, 2024

gehong-coder commented Oct 16, 2024

saeedkhaki92 commented Oct 17, 2024

aria-hacker commented Oct 18, 2024 • edited Loading

gehong-coder commented Oct 22, 2024 • edited Loading

aria-hacker commented Oct 22, 2024

gehong-coder commented Oct 23, 2024 • edited Loading

aria-hacker commented Oct 24, 2024

aria-hacker commented Oct 15, 2024 •

edited

Loading

gehong-coder commented Oct 16, 2024 •

edited

Loading

aria-hacker commented Oct 18, 2024 •

edited

Loading

gehong-coder commented Oct 22, 2024 •

edited

Loading

gehong-coder commented Oct 23, 2024 •

edited

Loading