Home

Welcome to the AnyModal Wiki

AnyModal is a modular and extensible framework for integrating diverse input modalities (e.g., images, audio) into large language models (LLMs). It simplifies the process of combining different input types with LLMs by providing seamless tokenization, encoding, and generation pipelines.

Introduction and Fundamentals

Multimodal large language models (LLMs) extend the capabilities of traditional text-based models by incorporating inputs from diverse modalities such as images, audio, and structured data. These models transform non-textual inputs into representations that can be processed alongside text, enabling tasks like image captioning, visual question answering, and audio transcription. The key to this integration lies in preparing non-textual data in a way that aligns with the LLM’s text-based token processing pipeline.

A crucial part of this process is the input tokenizer, which consists of two main components. First, a feature encoder is used to extract meaningful features from the input. For example, when processing images, a vision model like ViT (Vision Transformer) encodes the image into a sequence of latent feature vectors. These vectors capture important information such as object presence, spatial relationships, and textures, which are crucial for tasks like image captioning or object detection.

Since the output of the feature encoder may not naturally align with the token space of the LLM, a projection layer is applied to transform these features into a compatible format. The projection layer maps the encoded features into the same embedding space as the LLM’s input tokens, ensuring seamless integration. For instance, image embeddings generated by ViT can be projected into a token space where they can be concatenated with text tokens. This allows the LLM to process multimodal data as a single sequence, enabling it to generate coherent outputs that reflect both textual and visual contexts.

AnyModal leverages this approach by abstracting the complexities of integrating multiple modalities. By decoupling the feature encoding and projection steps, it provides a flexible and modular framework that allows users to mix and match components for various data types. This makes it easier to experiment with different modalities, expand system capabilities, and adapt to a wide range of use cases.

What is AnyModal?

AnyModal fills a gap in existing tools for building multimodal systems. Most frameworks focus on specific tasks, but AnyModal is designed to be flexible and general-purpose. Whether you're working on vision-language models (VLMs), audio-based tasks, or even custom modalities, AnyModal lets you integrate them with pre-trained LLMs easily.

Features

Modular Design: Plug and play different input modalities (e.g., vision, audio, or custom data types).
Ease of Use: Minimal setup; implement your modality-specific tokenization and pass it to the framework.
Extensibility: Add support for new modalities with just a few lines of code.

Getting Started

Prerequisites

Python 3.8+

Install dependencies:

pip install torch transformers datasets torchvision tqdm

Installation

Copy the anymodal.py file to your project directory.

Example Usage

Here’s a basic example of using AnyModal with an image input:

from transformers import ViTImageProcessor, ViTForImageClassification
from anymodal import MultiModalModel
from vision import VisionEncoder, Projector

# Initialize vision components
processor = ViTImageProcessor.from_pretrained('google/vit-base-patch16-224')
vision_model = ViTForImageClassification.from_pretrained('google/vit-base-patch16-224')
vision_encoder = VisionEncoder(vision_model)
vision_tokenizer = Projector(in_features=vision_model.config.hidden_size, out_features=768)

# Load LLM components
from transformers import AutoTokenizer, AutoModelForCausalLM
llm_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B")
llm_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B")

# Initialize AnyModal
multimodal_model = MultiModalModel(
    input_processor=None,
    input_encoder=vision_encoder,
    input_tokenizer=vision_tokenizer,
    language_tokenizer=llm_tokenizer,
    language_model=llm_model,
    input_start_token='<|imstart|>',
    input_end_token='<|imend|>',
    prompt_text="The interpretation of the given image is: "
)

Current Demos

LaTeX OCR
Chest X-Ray Captioning (in progress)
Image Captioning
Visual Question Answering (planned)
Audio Captioning (planned)

Contributions

Contributions are welcome! Here's how you can help:

Fork the repository and clone it to your local machine.
Create a new branch for your feature or bug fix.
Submit a pull request describing your changes.

Let’s make AnyModal better together!

For more discussions and updates, visit the AnyModal subreddit.

License

This project is licensed under the MIT License. See the LICENSE for details.

Feedback and Support

If you have any questions or feedback, feel free to open an issue or start a discussion in the GitHub repository.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

Home

Welcome to the AnyModal Wiki

Table of Contents

Introduction and Fundamentals

What is AnyModal?

Features

Getting Started

Prerequisites

Installation

Example Usage

Current Demos

Contributions

License

Feedback and Support

Clone this wiki locally