简体中文 | English
- 💌 Table of Contents
- 📰 News
- 🌈 Introduction
- 📣 Latest Developments
- ✨ Key Features
- 🔍 Installation
- 🔥 Tutorials
- 🤔 FAQ
- 📱 Model Library
- 📝 License
- 📌 Community
🔥PaddleMix Live Class on October 22, 2024
- 🎉 The PaddleMIX Multimodal Model Suite Development Competition is now open for registration! Cash prizes and project certificates are waiting for you. On October 22 (Tuesday) at 19:00🔑, a senior Baidu R&D engineer will provide an in-depth explanation on how to enhance data quality using PaddleMIX multimodal models and the competition rules. Scan the QR code to stay updated through the group announcements. 🚀 Registration Link:
PaddleMIX is a multimodal large model development suite based on PaddlePaddle, integrating various modalities such as images, text, and video. It covers a wide range of multimodal tasks, including vision-language pre-training, fine-tuning, text-to-image, text-to-video, and multimodal understanding. It offers an out-of-the-box development experience while supporting flexible customization to meet diverse needs, empowering the exploration of general artificial intelligence.
The PaddleMIX toolchain includes data processing, model development, pre-training, fine-tuning, and inference deployment, supporting mainstream multimodal models such as EVA-CLIP, BLIP-2, and Stable Diffusion. With cross-modal task pipelines like AppFlow and text-to-image application pipelines, developers can quickly build multimodal applications.
Multimodal understanding 🤝 integrates visual 👀 and linguistic 💬 processing capabilities. It includes functions such as basic perception, fine-grained image understanding, and complex visual reasoning 🧠. Our Model Library offers practical applications for single-image, multi-image, and video inference. Features include natural image summarization 📝, question answering 🤔, OCR 🔍, sentiment recognition ❤️😢, specialized image analysis 🔬, and code interpretation 💻. These technologies can be applied in various fields such as education 📚, healthcare 🏥, industry 🏭, and more, enabling comprehensive intelligent analysis from static images 🖼️ to dynamic videos 🎥. We invite you to experience and explore these capabilities!
Multimodal generation ✍️ combines the creative power of text 💬 and visuals 👀. It includes various technologies ranging from text-to-image 🖼️ to text-to-video 🎥, featuring advanced models like Stable Diffusion 3 and Open-Sora. We provide practical applications for single-image generation, multi-image synthesis, and video generation in ppdiffusers. These features cover areas such as artistic creation 🎨, animation production 📽️, and content generation 📝. With these technologies, creative generation from static images to dynamic videos can be applied in fields like education 📚, entertainment 🎮, advertising 📺, and more. We invite you to experience and explore these innovations!
ComfyUI Creative Workflow | Art Style QR Code Model | Mix Image Overlay |
---|---|---|
Anime Text-to-Image | AI Art|50+ Lora Style Overlays | ControlNet|Partial Image Repainting |
🔥 PaddleMIX v2.1 Released on 2024.10.11
-
Supports the PaddleNLP 3.0 beta version, allowing early access to its latest features.
-
Added cutting-edge models like Qwen2-VL, InternVL2, and Stable Diffusion 3 (SD3).
-
Released our self-developed multimodal data capability tagging model PP-InsCapTagger, which can be used for data analysis and filtering. Experimental cases show that it can reduce data volume by 50% while maintaining model performance, significantly improving training efficiency.
-
The multimodal large models InternVL2, LLaVA, SD3, and SDXL are now adapted to the Ascend 910B, offering training and inference capabilities on domestic computing chips.
PaddleMIX v2.0 Released on 2024.07.25
- Multimodal Understanding: Added LLaVA series, Qwen-VL, etc.; introduced Auto module to unify the SFT training process; introduced Mixtoken training strategy, increasing SFT throughput by 5.6 times.
- Multimodal Generation: Released PPDiffusers 0.24.1, supporting video generation capabilities, and added LCM to the text-to-image model. Also added a PaddlePaddle version of PEFT and the Accelerate backend. Provided a ComfyUI plugin developed with PaddlePaddle.
- Multimodal Data Processing Toolbox DataCopilot: Supports custom data structures, data transformation, and offline format checks. Includes basic statistical information and data visualization functionality.
PaddleMIX v1.0 Released on 2023.10.7
- Added distributed training capabilities for vision-language pre-training models, and BLIP-2 now supports trillion-scale training.
- Introduced the cross-modal application pipeline AppFlow, which supports 11 cross-modal applications such as automatic annotation, image editing, and audio-to-image with one click.
- PPDiffusers released version 0.19.3, adding SDXL and related tasks.
PaddleMIX supports a wide range of the latest mainstream algorithm benchmarks and pre-trained models, covering vision-language pre-training, text-to-image, cross-modal visual tasks, and enabling diverse functionalities such as image editing, image description, and data annotation. Gateway
: 📱 Model Library
PaddleMIX provides a unified model development interface, allowing developers to quickly integrate and customize models. With the Auto module, users can efficiently load pre-trained models, perform tokenization, and easily complete model training, fine-tuning (SFT), inference, and deployment through a simplified API. Additionally, the Auto module supports developers in customizing automated model integration, ensuring flexibility and scalability while enhancing development efficiency.
PaddleMIX offers high-performance distributed training and inference capabilities, integrating acceleration operators like ✨Fused Linear✨ and ✨Flash Attention✨. It supports 🌀BF16 mixed-precision training and 4D mixed-parallel strategies. By optimizing inference performance through convolution layout, GroupNorm fusion, and rotating positional encoding optimization, it significantly enhances large-scale pre-training and efficient inference performance.
The multimodal data processing toolbox, DataCopilot, accelerates model iteration and upgrades. It allows developers to perform basic data operations with low code based on specific tasks. Gateway
: 🏆 Featured Models | Tools
git clone https://github.com/PaddlePaddle/PaddleMIX
cd PaddleMIX
conda create -n paddlemix python=3.10 -y
conda activate paddlemix
- CUDA 11.x or 12.3
- PaddlePaddle 3.0.0b1
sh build_paddle_env.sh
For detailed instructions on installing PaddlePaddle, please refer to the Installation Guide.
Currently, PaddleMIX supports the Ascend 910B chip (more models are in progress; if you have other model requirements, please submit an issue to let us know). The Ascend driver version is 23.0.3. Considering the variability in environments, we recommend using the standard image provided by PaddlePaddle to prepare your environment.
- Refer to the command below to start the container;
ASCEND_RT_VISIBLE_DEVICES
specifies the visible NPU card numbers.
docker run -it --name paddle-npu-dev -v $(pwd):/work \
--privileged --network=host --shm-size=128G -w=/work \
-v /usr/local/Ascend/driver:/usr/local/Ascend/driver \
-v /usr/local/bin/npu-smi:/usr/local/bin/npu-smi \
-v /usr/local/dcmi:/usr/local/dcmi \
-e ASCEND_RT_VISIBLE_DEVICES="0,1,2,3,4,5,6,7" \
registry.baidubce.com/device/paddle-npu:cann80T13-ubuntu20-$(uname -m)-gcc84-py39 /bin/bash
- Install PaddlePaddle inside the container
# Note: You need to install the CPU version of PaddlePaddle first. Currently, only Python 3.9 is supported.
python -m pip install --pre paddlepaddle -i https://www.paddlepaddle.org.cn/packages/nightly/cpu/
python -m pip install --pre paddle-custom-npu -i https://www.paddlepaddle.org.cn/packages/nightly/npu/
Run the following command to automatically install all necessary dependencies:
sh build_env.sh
Quick Start
- Multimodal Understanding: Beginner's Experience
- Multimodal Generation: Zero-Basics Getting Started Guide
- Cross-Modal Task Pipeline: End-to-End Process Demonstration
Hands-On Practice & Examples
- LLaVA Model: Full Process Practice from Training to Inference
- SDXL Application: Create Your Own Olympic Poster Generator
Multi-Hardware Usage
- For the model list and usage supported by Ascend 910B, please refer to Ascend Hardware Usage
Data Preparation & Fine-Tuning
Inference Deployment
Multimodal Understanding | Multimodal Generation |
|
|
For more model capabilities, please refer to the Model Capability Matrix
Introduction (Click to Expand)
AppFlow, as the cross-modal application task pipeline of PaddleMIX, possesses powerful functionality and ease of use. By integrating cutting-edge algorithms such as LLaVA and Stable Diffusion, AppFlow has comprehensively covered various modalities including images, text, audio, and video. Through a flexible pipeline approach, it has constructed over ten multimodal applications, encompassing text-image generation, text-video generation, text-audio generation, image understanding, and more, providing users with rich demo examples. The highlight of AppFlow is its one-click prediction feature, allowing users to complete model inference with simple commands, eliminating cumbersome training and extensive coding, significantly lowering the barrier to use. Additionally, AppFlow fully leverages the dynamic-static unification advantages of the PaddlePaddle framework; users only need to set simple parameters to automatically complete model dynamic-to-static export and high-performance inference, enhancing work efficiency and optimizing model performance for one-stop application deployment.
Gateway
: Application Documentation Example.
Introduction (Click to Expand)
In real-world application scenarios, there is a substantial demand for fine-tuning multimodal large models using proprietary data to enhance model performance, making data elements the core of this process. Based on this, PaddleMIX provides the DataCopilot tool for data processing and analysis, allowing developers to achieve an end-to-end development experience within the PaddleMIX suite.
PP-InsCapTagger (Instance Capability Tagger) is a dataset capability tagging model implemented by DataCopilot based on PaddleMIX. It is used to label the capabilities of multimodal data instances. By optimizing the dataset through instance capability distribution, it can improve model training efficiency and provide an efficient solution for dataset analysis and evaluation. Combining the model inference labeling results with the LLaVA SFT dataset optimization can improve LLaVA model training efficiency by 50% during the SFT phase.
Gateway
: Application Documentation Example.
PP-InsCapTagger (Click to Expand)
Model | ScienceQA | TextVQA | VQAv2 | GQA | MMMU | MME |
---|---|---|---|---|---|---|
llava-1.5-7b (origin) | 66.8 | 58.2 | 78.5 | 62 | - | - |
llava-1.5-7b (rerun) | 69.01 | 57.6 | 79 | 62.95 | 36.89 | 1521 323 |
llava-1.5-7b (random 50%) | 67.31 | 55.6 | 76.89 | 61.01 | 34.67 | 1421 286 |
llava-1.5-7b (our 50%) | 70.24 (+2.93) | 57.12 (+1.52) | 78.32 (+1.43) | 62.14 (+1.13) | 37.11 (+2.44) | 1476 (+55) 338 (+52) |
Gateway : Application Documentation Example. |
For answers to some common questions about our project, please refer to the FAQ. If your question is not addressed, feel free to raise it in the Issues.
This project is released under the Apache 2.0 license.
- Scan the QR code and fill out the questionnaire to join the communication group and engage deeply with numerous community developers and the official team.