Skip to content

Latest commit

 

History

History
 
 

279-mobilevlm-language-assistant

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 

Mobile language assistant with MobileVLM and OpenVINO

MobileVLM is a competent multimodal vision language model (MMVLM) targeted to run on mobile devices. It is an amalgamation of a myriad of architectural designs and techniques that are mobile-oriented, which comprises a set of language models at the scale of 1.4B and 2.7B parameters, trained from scratch, a multimodal vision model that is pre-trained in the CLIP fashion, cross-modality interaction via an efficient projector.

The MobileVLM architecture (right) utilizes MobileLLaMA as its language model, intakes $\mathbf{X}_v$ and $\mathbf{X}_q$ which are image and language instructions as respective inputs and gives $\mathbf{Y}_a$ as the output language response. LDP refers to a lightweight downsample projector (left).

See more information on official GitHub project page.

In this tutorial we consider how to use MobileVLM model to build multimodal language assistant with OpenVINO help.

Notebook contents

The tutorial consists from following steps:

  • Install requirements
  • Clone MobileVLM repository
  • Import required packages
  • Load the model
  • Convert model to OpenVINO Intermediate Representation (IR)
  • Inference
    • Load OpenVINO model
    • Prepare input data
    • Run generation process
  • Interactive inference

Installation instructions

This is a self-contained example that relies solely on its own code.
We recommend running the notebook in a virtual environment. You only need a Jupyter server to start. For details, please refer to Installation Guide.