MobileVLM is a competent multimodal vision language model (MMVLM) targeted to run on mobile devices. It is an amalgamation of a myriad of architectural designs and techniques that are mobile-oriented, which comprises a set of language models at the scale of 1.4B and 2.7B parameters, trained from scratch, a multimodal vision model that is pre-trained in the CLIP fashion, cross-modality interaction via an efficient projector.
The MobileVLM architecture (right) utilizes MobileLLaMA as its language model, intakes
See more information on official GitHub project page.
In this tutorial we consider how to use MobileVLM model to build multimodal language assistant with OpenVINO help.
The tutorial consists from following steps:
- Install requirements
- Clone MobileVLM repository
- Import required packages
- Load the model
- Convert model to OpenVINO Intermediate Representation (IR)
- Inference
- Load OpenVINO model
- Prepare input data
- Run generation process
- Interactive inference
This is a self-contained example that relies solely on its own code.
We recommend running the notebook in a virtual environment. You only need a Jupyter server to start.
For details, please refer to Installation Guide.