Title MobileVLM: A Fast, Strong and Open Vision Language Assistant for Mobile Devices

Authors Xiangxiang Chu¹ , Limeng Qiao¹, Xinyang Lin¹, Shuang Xu¹, Yang Yang¹³, Yiming Hu¹, Fei Wei¹, Xinyu Zhang¹, Bo Zhang¹, Xiaolin Wei¹, Chunhua Shen²*

Reference Wu, Qinzhuo, et al. "Mobilevlm: A vision-language model for better intra-and inter-ui understanding." Findings of the Association for Computational Linguistics: EMNLP 2024.

Github

https://github.com/Meituan-AutoML/MobileVLM

PDF

mobilevlm_v1.pdf


Note: The summarization was assisted by ChatGPT, except for the figure.

Summarization

  1. MobileVLM is a lightweight multimodal vision–language model designed to run efficiently on mobile devices.
  2. It combines a CLIP-style vision encoder (ViT-L/14) with a compact language model (MobileLLaMA) and connects them through an efficient projector (LDP) for cross-modal interaction.
  3. The model is built with parameter sizes around 1.4B–2.7B to balance performance and computational efficiency. (See the Table 6,7 in paper)
  4. Despite its relatively small scale, MobileVLM achieves performance comparable to much larger vision–language models on multiple benchmarks.
  5. In addition, it demonstrates fast inference speeds on mobile hardware, making it suitable for real-time multimodal applications on edge devices.

그림1.png

0. Abstract

MobileVLM comprises

  1. a set of language models at the scale of 1.4B and 2.7B parameters, trained from scratch,
  2. a multimodal vision model that is pre-trained in the CLIP fashion,
  3. cross-modality interaction via an efficient projector.

Authors obtain state-of-the-art performance of

  1. 21.5 tokens on mobile CPU and
  2. 65.3 tokens on Jetson Orin GPU per second, respectively.

1. Introduction