+
+ Multi-modal Large Language Models (MLLMs) are at the forefront of artificial intelligence research, aiming
+ to create models capable of understanding, learning, and generating multiple data types, including text,
+ images, and sound. Despite the potential, significant challenges persist, including the integration of
+ suitable vision encoders and LLMs, the scarcity of comprehensive multi-modal datasets, and the need for
+ efficient performance improvement.
+
+ A performance gap currently exists between closed-source models, often developed by resource-rich tech
+ companies, and open-source models. However, the open-source community is making substantial strides,
+ driven by collaboration and resource availability.
+
+ Our work with the Bumblebee model, an open-source MLLM, exemplifies this progress. By implementing token
+ shrinkage and developing an efficient projector called STSR (Scalable Token Shrinkage Resampler),
+ Bumblebee has surpassed the closed-source QwenVL Max on the MMBench-Test-CN with a score of 75.9, using
+ only open-source data and 14 billion LLM parameters. This surpasses the current open-source
+ state-of-the-art Yi-34B-VL by 5.9 points on MMBench-Test-CN, despite having fewer parameters. This
+ achievement underscores the potential of open-source models to compete with, and potentially surpass,
+ their closed-source counterparts, signaling a promising future for open-source multi-modal learning.
+
+