-
Notifications
You must be signed in to change notification settings - Fork 4
Home
AnyModal is a modular and extensible framework for integrating diverse input modalities (e.g., images, audio) into large language models (LLMs). It simplifies the process of combining different input types with LLMs by providing seamless tokenization, encoding, and generation pipelines.
- Introduction and Fundamentals
- What is AnyModal?
- Features
- Getting Started
- Example Usage
- Current Demos
- Contributions
Multimodal large language models (LLMs) extend the capabilities of traditional text-based models by incorporating inputs from diverse modalities such as images, audio, and structured data. These models transform non-textual inputs into representations that can be processed alongside text, enabling tasks like image captioning, visual question answering, and audio transcription. The key to this integration lies in preparing non-textual data in a way that aligns with the LLM’s text-based token processing pipeline.
A crucial part of this process is the input tokenizer, which consists of two main components. First, a feature encoder is used to extract meaningful features from the input. For example, when processing images, a vision model like ViT (Vision Transformer) encodes the image into a sequence of latent feature vectors. These vectors capture important information such as object presence, spatial relationships, and textures, which are crucial for tasks like image captioning or object detection.
Since the output of the feature encoder may not naturally align with the token space of the LLM, a projection layer is applied to transform these features into a compatible format. The projection layer maps the encoded features into the same embedding space as the LLM’s input tokens, ensuring seamless integration. For instance, image embeddings generated by ViT can be projected into a token space where they can be concatenated with text tokens. This allows the LLM to process multimodal data as a single sequence, enabling it to generate coherent outputs that reflect both textual and visual contexts.
AnyModal leverages this approach by abstracting the complexities of integrating multiple modalities. By decoupling the feature encoding and projection steps, it provides a flexible and modular framework that allows users to mix and match components for various data types. This makes it easier to experiment with different modalities, expand system capabilities, and adapt to a wide range of use cases.
AnyModal fills a gap in existing tools for building multimodal systems. Most frameworks focus on specific tasks, but AnyModal is designed to be flexible and general-purpose. Whether you're working on vision-language models (VLMs), audio-based tasks, or even custom modalities, AnyModal lets you integrate them with pre-trained LLMs easily.
- Modular Design: Plug and play different input modalities (e.g., vision, audio, or custom data types).
- Ease of Use: Minimal setup; implement your modality-specific tokenization and pass it to the framework.
- Extensibility: Add support for new modalities with just a few lines of code.
- Python 3.8+
- Install dependencies:
pip install torch transformers datasets torchvision tqdm
Copy the anymodal.py
file to your project directory.
Here’s a basic example of using AnyModal with an image input:
from transformers import ViTImageProcessor, ViTForImageClassification
from anymodal import MultiModalModel
from vision import VisionEncoder, Projector
# Initialize vision components
processor = ViTImageProcessor.from_pretrained('google/vit-base-patch16-224')
vision_model = ViTForImageClassification.from_pretrained('google/vit-base-patch16-224')
vision_encoder = VisionEncoder(vision_model)
vision_tokenizer = Projector(in_features=vision_model.config.hidden_size, out_features=768)
# Load LLM components
from transformers import AutoTokenizer, AutoModelForCausalLM
llm_tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-3.2-1B")
llm_model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-3.2-1B")
# Initialize AnyModal
multimodal_model = MultiModalModel(
input_processor=None,
input_encoder=vision_encoder,
input_tokenizer=vision_tokenizer,
language_tokenizer=llm_tokenizer,
language_model=llm_model,
input_start_token='<|imstart|>',
input_end_token='<|imend|>',
prompt_text="The interpretation of the given image is: "
)
- LaTeX OCR
- Chest X-Ray Captioning (in progress)
- Image Captioning
- Visual Question Answering (planned)
- Audio Captioning (planned)
Contributions are welcome! Here's how you can help:
- Fork the repository and clone it to your local machine.
- Create a new branch for your feature or bug fix.
- Submit a pull request describing your changes.
Let’s make AnyModal better together!
For more discussions and updates, visit the AnyModal subreddit.
This project is licensed under the MIT License. See the LICENSE for details.
If you have any questions or feedback, feel free to open an issue or start a discussion in the GitHub repository.