Title
MobileVLM: A Fast, Strong and Open Vision Language Assistant for Mobile Devices
Authors
Xiangxiang Chu¹ , Limeng Qiao¹, Xinyang Lin¹, Shuang Xu¹, Yang Yang¹³, Yiming Hu¹, Fei Wei¹, Xinyu Zhang¹, Bo Zhang¹, Xiaolin Wei¹, Chunhua Shen²*
- Affiliations:
*¹ Meituan Inc. ² Zhejiang University, China ³ Dalian University of Technology, China
- Work done as an intern at Meituan.*
Reference
Wu, Qinzhuo, et al. "Mobilevlm: A vision-language model for better intra-and inter-ui understanding." Findings of the Association for Computational Linguistics: EMNLP 2024.
Github
https://github.com/Meituan-AutoML/MobileVLM
PDF
mobilevlm_v1.pdf
Note: The summarization was assisted by ChatGPT, except for the figure.
Summarization
- MobileVLM is a lightweight multimodal vision–language model designed to run efficiently on mobile devices.
- It combines a CLIP-style vision encoder (ViT-L/14) with a compact language model (MobileLLaMA) and connects them through an efficient projector (LDP) for cross-modal interaction.
- The model is built with parameter sizes around 1.4B–2.7B to balance performance and computational efficiency. (See the Table 6,7 in paper)
- Despite its relatively small scale, MobileVLM achieves performance comparable to much larger vision–language models on multiple benchmarks.
- In addition, it demonstrates fast inference speeds on mobile hardware, making it suitable for real-time multimodal applications on edge devices.

0. Abstract
MobileVLM comprises
- a set of language models at the scale of 1.4B and 2.7B parameters, trained from scratch,
- a multimodal vision model that is pre-trained in the CLIP fashion,
- cross-modality interaction via an efficient projector.
Authors obtain state-of-the-art performance of
- 21.5 tokens on mobile CPU and
- 65.3 tokens on Jetson Orin GPU per second, respectively.
1. Introduction