MAVIS: Mathematical Visual Instruction Tuning

Zhang, Renrui; Wei, Xinyu; Jiang, Dongzhi; Zhang, Yichi; Guo, Ziyu; Tong, Chengzhuo; Liu, Jiaming; Zhou, Aojun; Wei, Bin; Zhang, Shanghang; Gao, Peng; Li, Hongsheng

Computer Science > Computer Vision and Pattern Recognition

arXiv:2407.08739v1 (cs)

[Submitted on 11 Jul 2024 (this version), latest version 1 Nov 2024 (v2)]

Title:MAVIS: Mathematical Visual Instruction Tuning

Authors:Renrui Zhang, Xinyu Wei, Dongzhi Jiang, Yichi Zhang, Ziyu Guo, Chengzhuo Tong, Jiaming Liu, Aojun Zhou, Bin Wei, Shanghang Zhang, Peng Gao, Hongsheng Li

View PDF HTML (experimental)

Abstract:Multi-modal Large Language Models (MLLMs) have recently emerged as a significant focus in academia and industry. Despite their proficiency in general multi-modal scenarios, the mathematical problem-solving capabilities in visual contexts remain insufficiently explored. We identify three key areas within MLLMs that need to be improved: visual encoding of math diagrams, diagram-language alignment, and mathematical reasoning skills. This draws forth an urgent demand for large-scale, high-quality data and training pipelines in visual mathematics. In this paper, we propose MAVIS, the first MAthematical VISual instruction tuning paradigm for MLLMs, involving a series of mathematical visual datasets and specialized MLLMs. Targeting the three issues, MAVIS contains three progressive training stages from scratch. First, we curate MAVIS-Caption, consisting of 558K diagram-caption pairs, to fine-tune a math-specific vision encoder (CLIP-Math) through contrastive learning, tailored for improved diagram visual encoding. Second, we utilize MAVIS-Caption to align the CLIP-Math with a large language model (LLM) by a projection layer, enhancing vision-language alignment in mathematical domains. Third, we introduce MAVIS-Instruct, including 900K meticulously collected and annotated visual math problems, which is adopted to finally instruct-tune the MLLM for robust mathematical reasoning skills. In MAVIS-Instruct, we incorporate complete chain-of-thought (CoT) rationales for each problem, and minimize textual redundancy, thereby concentrating the model towards the visual elements. Data and Models are released at this https URL

Comments:	Work in progress. Data and Models are released at this https URL
Subjects:	Computer Vision and Pattern Recognition (cs.CV)
Cite as:	arXiv:2407.08739 [cs.CV]
	(or arXiv:2407.08739v1 [cs.CV] for this version)
	https://doi.org/10.48550/arXiv.2407.08739

Submission history

From: Renrui Zhang [view email]
[v1] Thu, 11 Jul 2024 17:59:47 UTC (6,634 KB)
[v2] Fri, 1 Nov 2024 22:14:24 UTC (9,606 KB)

Computer Science > Computer Vision and Pattern Recognition

Title:MAVIS: Mathematical Visual Instruction Tuning

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computer Vision and Pattern Recognition

Title:MAVIS: Mathematical Visual Instruction Tuning

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators