Aria

😊 Hugging Face | 📄 Paper | 📚 Blog | 🌐 WebDemo | 🟣 Discord

Introduction

Aria is a multimodal native MoE model. It features:

State-of-the-art performance on various multimodal and language tasks, superior in video and document understanding;
Long multimodal context window of 64K tokens;
3.9B activated parameters per token, enabling fast inference speed and low fine-tuning cost.

News

[Dec 1, 2024] We release the base models for Aria (Aria-Base-8K and Aria-Base-64K)! They are fully compatible with this inference & fine-tuning codebase.
[Oct 10, 2024] We release Aria!

Quick Start

Installation

pip install -e .
# or install with dev dependencies if you want to contribute to the project
pip install -e .[dev] 

pip install grouped_gemm
pip install flash-attn --no-build-isolation

Inference

Aria has 25.3B total parameters, it can be loaded in one A100 (80GB) GPU with bfloat16 precision.

Here is a code snippet to show you how to use Aria with Hugging Face Transformers.

import requests
import torch
from PIL import Image
from transformers import AutoModelForCausalLM, AutoProcessor

model_id_or_path = "rhymes-ai/Aria"
revision = "4844f0b5ff678e768236889df5accbe4967ec845"

model = AutoModelForCausalLM.from_pretrained(model_id_or_path, revision=revision, device_map="auto", torch_dtype=torch.bfloat16, trust_remote_code=True)

processor = AutoProcessor.from_pretrained(model_id_or_path, revision=revision, trust_remote_code=True)

image_path = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/cat.png"

image = Image.open(requests.get(image_path, stream=True).raw)

messages = [
    {
        "role": "user",
        "content": [
            {"text": None, "type": "image"},
            {"text": "what is the image?", "type": "text"},
        ],
    }
]

text = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(text=text, images=image, return_tensors="pt")
inputs["pixel_values"] = inputs["pixel_values"].to(model.dtype)
inputs = {k: v.to(model.device) for k, v in inputs.items()}

with torch.inference_mode(), torch.cuda.amp.autocast(dtype=torch.bfloat16):
    output = model.generate(
        **inputs,
        max_new_tokens=500,
        stop_strings=["<|im_end|>"],
        tokenizer=processor.tokenizer,
        do_sample=True,
        temperature=0.9,
    )
    output_ids = output[0][inputs["input_ids"].shape[1]:]
    result = processor.decode(output_ids, skip_special_tokens=True)

print(result)

We offer additional inference methods, such as utilizing vLLM for enhanced performance. For comprehensive details, please refer to docs/inference.md.

Cookbook

Checkout these inference examples that demonstrate how to use Aria on various applications such as chart understanding, PDF reading, video understanding, etc, available with both Hugging Face Transformers and vLLM backends.

Fine-tuning

Note: For optimal fine-tuning performance, install the optional grouped_gemm dependency:
pip install grouped_gemm

We offer both LoRA fine-tuning and full parameter tuning, using various dataset types:

Single-image datasets
Multi-image datasets
Video datasets
Code datasets

For a quick try, visit the examples folder and choose one of the fine-tuning examples. If you would like to fine-tune from base models (recommended when you have a large database), please change the following model paths in the configs (full or lora)

model_name_or_path: rhymes-ai/Aria
tokenizer_path: rhymes-ai/Aria

to the ones corresponding to one of the base models:

model_name_or_path: rhymes-ai/Aria-Base-64K # rhymes-ai/Aria-Base-8K
tokenizer_path: rhymes-ai/Aria-Base-64K # rhymes-ai/Aria-Base-8K

Prepare dataset

Please refer to custom_dataset.md for how to prepare your dataset.

Fine-tune with LoRA

After preparing your dataset, follow these steps to fine-tune Aria using LoRA:

Open the configuration file recipes/config_lora.yaml. Locate the dataset_mixer section and update it with your dataset paths:

dataset_mixer:
  "path/to/dataset1": 1
  "path/to/dataset2": 0.5
  "path/to/dataset3": 2

Note on dataset mixing: Aria supports combining multiple datasets with different sampling rates. In the example above:

dataset1 will be used entirely (weight 1)

dataset2 will use 50% of its data (weight 0.5)

dataset3 will be used twice (weight 2)

Start the fine-tuning process by running the following command on one A100 (80GB) or H100 (80GB) GPU:

python aria/train.py --config recipes/config_lora.yaml

For multi-GPU training, use the accelerate library:

accelerate launch --config_file recipes/accelerate_configs/zero2.yaml aria/train.py --config recipes/config_lora.yaml --num_processes [number_of_gpus]

Choose from pre-configured accelerate settings in recipes/accelerate_configs/
Adjust the --num_processes argument to match your available GPUs
For custom configurations, refer to the accelerate documentation

Inference with the fine-tuned model:

See inference with LoRA support for how to inference with the fine-tuned model.

Full parameter fine-tuning

Everything is the same as the LoRA fine-tuning process, except for the configuration file recipes/config_full.yaml.

Full parameter tuning consumes more GPU memory, thus multiple GPUs are required. The following command has been tested on 8 A100 (80GB) GPUs.

accelerate launch --config_file recipes/accelerate_configs/zero2.yaml aria/train.py --config recipes/config_full.yaml

If you encounter out-of-memory errors, try reducing the per_device_train_batch_size in the config file. Adjust the gradient_accumulation_steps accordingly to maintain the effective training batch size.

per_device_train_batch_size: 8
gradient_accumulation_steps: 2

Memory consumption varies across datasets. Generally, more memory is required for multi-image and video datasets. Adjust the deepspeed_config parameters to optimize memory consumption, such as using zero_stage 3 and offloading parameters and optimizer to the CPU.

deepspeed_config:
  gradient_accumulation_steps: auto
  gradient_clipping: auto
  offload_optimizer_device: cpu
  offload_param_device: cpu
  zero3_init_flag: true
  zero_stage: 3

Inference with Your Trained Model

First, you need to extract the FP32 consolidated weights from ZeRO 1, 2, or 3 DeepSpeed checkpoints:

cd /path/to/your/output/dir
python zero_to_fp32.py . pytorch_model.bin

See inference.md for instructions on how to perform inference with the fine-tuned model.

Citation

If you find our work helpful, please consider citing.

@article{aria,
  title={Aria: An Open Multimodal Native Mixture-of-Experts Model}, 
  author={Dongxu Li and Yudong Liu and Haoning Wu and Yue Wang and Zhiqi Shen and Bowen Qu and Xinyao Niu and Guoyin Wang and Bei Chen and Junnan Li},
  year={2024},
  journal={arXiv preprint arXiv:2410.05993},
}

Name		Name	Last commit message	Last commit date
Latest commit History 201 Commits
.github/workflows		.github/workflows
aria		aria
assets		assets
docs		docs
examples		examples
gptfast		gptfast
inference/notebooks		inference/notebooks
recipes		recipes
tests		tests
.gitignore		.gitignore
.isort.cfg		.isort.cfg
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Aria

Introduction

News

Quick Start

Installation

Inference

Cookbook

Fine-tuning

Prepare dataset

Fine-tune with LoRA

Full parameter fine-tuning

Inference with Your Trained Model

Citation

About

Releases

Packages

Languages

License

ai-jz/aria

Folders and files

Latest commit

History

Repository files navigation

Aria

Introduction

News

Quick Start

Installation

Inference

Cookbook

Fine-tuning

Prepare dataset

Fine-tune with LoRA

Full parameter fine-tuning

Inference with Your Trained Model

Citation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages