RoboMamba: Multimodal State Space Model for Efficient Robot Reasoning and Manipulation

Liu, Jiaming; Liu, Mengzhen; Wang, Zhenyu; Lee, Lily; Zhou, Kaichen; An, Pengju; Yang, Senqiao; Zhang, Renrui; Guo, Yandong; Zhang, Shanghang

Computer Science > Computer Vision and Pattern Recognition

arXiv:2406.04339v1 (cs)

[Submitted on 6 Jun 2024 (this version), latest version 14 Dec 2024 (v2)]

Title:RoboMamba: Multimodal State Space Model for Efficient Robot Reasoning and Manipulation

Authors:Jiaming Liu, Mengzhen Liu, Zhenyu Wang, Lily Lee, Kaichen Zhou, Pengju An, Senqiao Yang, Renrui Zhang, Yandong Guo, Shanghang Zhang

View PDF HTML (experimental)

Abstract:A fundamental objective in robot manipulation is to enable models to comprehend visual scenes and execute actions. Although existing robot Multimodal Large Language Models (MLLMs) can handle a range of basic tasks, they still face challenges in two areas: 1) inadequate reasoning ability to tackle complex tasks, and 2) high computational costs for MLLM fine-tuning and inference. The recently proposed state space model (SSM) known as Mamba demonstrates promising capabilities in non-trivial sequence modeling with linear inference complexity. Inspired by this, we introduce RoboMamba, an end-to-end robotic MLLM that leverages the Mamba model to deliver both robotic reasoning and action capabilities, while maintaining efficient fine-tuning and inference. Specifically, we first integrate the vision encoder with Mamba, aligning visual data with language embedding through co-training, empowering our model with visual common sense and robot-related reasoning. To further equip RoboMamba with action pose prediction abilities, we explore an efficient fine-tuning strategy with a simple policy head. We find that once RoboMamba possesses sufficient reasoning capability, it can acquire manipulation skills with minimal fine-tuning parameters (0.1\% of the model) and time (20 minutes). In experiments, RoboMamba demonstrates outstanding reasoning capabilities on general and robotic evaluation benchmarks. Meanwhile, our model showcases impressive pose prediction results in both simulation and real-world experiments, achieving inference speeds 7 times faster than existing robot MLLMs. Our project web page: this https URL

Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2406.04339 [cs.CV]
	(or arXiv:2406.04339v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2406.04339

Submission history

From: Jiaming Liu [view email]
[v1] Thu, 6 Jun 2024 17:59:47 UTC (4,198 KB)
[v2] Sat, 14 Dec 2024 18:41:03 UTC (4,388 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:RoboMamba: Multimodal State Space Model for Efficient Robot Reasoning and Manipulation

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:RoboMamba: Multimodal State Space Model for Efficient Robot Reasoning and Manipulation

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators