mR²AG: Multimodal Retrieval-Reflection-Augmented Generation for Knowledge-Based VQA

Tao Zhang^1,2,4, Ziqi Zhang^1,6, Zongyang Ma^1,2,4, Yuxin Chen², Zhongang Qi⁵, Chunfeng Yuan¹,
Bing Li^1,6,†, Junfu Pu², Yuxuan Zhao³, Zehua Xie³, Jin Ma³, Ying Shan², Weiming Hu^1,4,7
¹State Key Laboratory of Multimodal Artificial Intelligence Systems, CASIA; ²PCG ARC Lab, ³Tencent;
⁴School of Artificial Intelligence, University of Chinese Academy of Sciences; ⁵Huawei Noah’s Ark Lab;
⁶PeopleAl Inc; ⁷School of Information Science and Technology, ShanghaiTech University
{zhangtao2023, mazongyang2020}@ia.ac.cn, {ziqi.zhang, cfyuan, bli, wmhu}@nlpr.ia.ac.cn
{uasonchen, jevinpu, zehuaxie, martyzhao, daniellwang, yingsshan}@tencent.com
[email protected]

Abstract

Advanced Multimodal Large Language Models (MLLMs) struggle with recent Knowledge-based VQA tasks, such as INFOSEEK and Encyclopedic-VQA, due to their limited and frozen knowledge scope, often leading to ambiguous and inaccurate responses. Thus, multimodal Retrieval-Augmented Generation (mRAG) is naturally introduced to provide MLLMs with comprehensive and up-to-date knowledge, effectively expanding the knowledge scope. However, current mRAG methods have inherent drawbacks, including: 1) Performing retrieval even when external knowledge is not needed. 2) Lacking of identification of evidence that supports the query. 3) Increasing model complexity due to additional information filtering modules or rules. To address these shortcomings, we propose a novel generalized framework called multimodal Retrieval-Reflection-Augmented Generation (mR²AG), which achieves adaptive retrieval and useful information localization to enable answers through two easy-to-implement reflection operations, preventing high model complexity. In mR²AG, Retrieval-Reflection is designed to distinguish different user queries and avoids redundant retrieval calls, and Relevance-Reflection is introduced to guide the MLLM in locating beneficial evidence of the retrieved content and generating answers accordingly. In addition, mR²AG can be integrated into any well-trained MLLM with efficient fine-tuning on the proposed mR²AG Instruction-Tuning dataset (mR²AG-IT). mR²AG significantly outperforms state-of-the-art MLLMs (e.g., GPT-4v/o) and RAG-based MLLMs on INFOSEEK and Encyclopedic-VQA, while maintaining the exceptional capabilities of base MLLMs across a wide range of Visual-dependent tasks.

¹¹footnotetext: ^† Corresponding author.²²footnotetext: Work done during Zhongang Qi’s tenure at Tencent PCG ARC Lab.

1 Introduction

Refer to caption — Figure 1: Comparisons of different methods on Visual-dependent and Knowledge-based VQA tasks: 1) Typical MLLMs use the image $I$ and question $Q$ as inputs, offering limited support for Knowledge-based questions. . 2) Naive mRAG use $I$ , $Q$ , and retrieved content $P_{1,2,3}$ as inputs in all cases, inevitably introducing irrelevant noise. 3) mR²AG adaptively determines the necessity of retrieval and effectively locates the useful context, *i.e*., $P_{3}$ for $Q_{2}$ .

The rapid development of Multimodal Large Language Models (MLLMs) [2, 26, 30, 29, 1, 40] enables them to excel in Visual-dependent VQA tasks that rely solely on visual content or common sense, such as VQAv2 [17], GQA [19] and TextVQA [42]. However, recently proposed Knowledge-based VQA tasks like INFOSEEK [9] and Encyclopedic-VQA [35], introduce fine-grained visual entities and focus on encyclopedic knowledge, posing significant new challenges for existing MLLMs. As shown in $Q_{2}$ of Figure. 1, when queried about the first flight date of an airplane, typical MLLMs tend to provide inaccurate and overly general responses due to their limited and frozen knowledge scope.

To obtain accurate and specific answers, some works [18, 7, 45] introduce multimodal Retrieval-Augmented Generation (mRAG) that leverages external knowledge bases. They first use retrievers to find query-related information, then input them into the MLLM for answer generation. Nevertheless, several challenges remain: 1) Indiscriminate use of retrieval. Visual-dependent questions usually do not require external knowledge, and conducting retrieval for them may introduce noise and lead to wrong answers, such as the response of Naive mRAG to $Q_{1}$ in Figure 1. 2) Lack of explicit evidence localization. When encountering questions beyond one’s knowledge, humans first search for relevant information and then obtain reliable answers by locating direct evidence, e.g., $P_{3}$ of $P$ in Figure 1. However, current methods input retrieved text into the model to directly generate answers, lacking an explicit evidence localization process, making it difficult to determine whether the model can effectively utilize the useful retrieved information. 3) High model complexity. To improve performance, some methods introduce complex rules or even external models to calculate query-passage correlations to filter retrieved content.

To address the above challenges, we propose a novel generalized framework called multimodal Retrieval-Reflection-Augmented Generation (mR²AG), designed to seamlessly leverage the inherent instruction-following, multimodal understanding and reasoning abilities in MLLMs to enhance Knowledge-based question answering. For clarifying the necessity of retrieval, mR²AG introduces Retrieval-Reflection to determine whether the user’s query is Knowledge-based or Visual-dependent for adaptive retrieval, thus expanding MLLMs’ knowledge scope while maintaining the original performance. Furthermore, mR²AG imitates human behavior to implement Relevance-Reflection, which guides the model to explicitly assess the evidence parts in all retrieved information, and generates accurate responses based on the identified evidence. The implementation of the proposed reflection operations involves only modifications to the MLLMs’ vocabulary, without introducing any additional modules or computational strategies to destroy the original structure of the models, thereby effectively coupling with the MLLMs’ existing capabilities.

To quickly integrate mR²AG into pre-trained MLLMs, we provide a corresponding mR²AG Instruction Tuning dataset (mR²AG-IT) specifically constructed for Knowledge-based VQA tasks through an automated annotation pipeline.. Specifically, mR²AG-IT annotates the evidence paragraphs within Wikipedia articles that explicitly support user queries. Comprehensive comparisons on Knowledge-based VQA datasets show that our approach significantly outperforms existing methods. When using LLaVA-v1.5-7B [30] as the base MLLM, applying mR²AG not only achieves performance gains of 10.6% and 15.5% over the previous SOTAs on the INFOSEEK ${}_{\text{Human}}$ and INFOSEEK ${}_{\text{Wikidata}}$ test sets, but also surpasses SOTAs on the Encyclopedic-VQA (Enc-VQA) test set by 2.5% on single-hop questions and 18.2% on multi-answer questions. Moreover, our method retains the excellent capabilities of the base MLLM on Visual-dependent VQA tasks, demonstrating performance comparable to LLaVA-v1.5-7B across various benchmarks.

The contributions of this work are summarized as follows:

•

We propose an advanced multimodal RAG framework mR²AG, which only uses two reflection operations to stimulate MLLMs to implement retrieval invocation, evidence content identification, and answer generation.
•

We provide the mR²AG-IT dataset, which aims to quickly adapt MLLMs to Knowledge-based VQA and serves as a supplement to general visual instruction tuning datasets.
•

mR²AG, when combined with common MLLMs, significantly outperforms existing mRAG methods in answering Knowledge-based queries and maintains the ability of the base MLLM on Visual-dependent tasks. Moreover, mR²AG exhibits conciseness and effectiveness.

2 Related Work

Knowledge-based VQA. Several works [35, 9] focus on Knowledge-based VQA and construct corresponding datasets. Early OK-VQA [33] and its variant A-OKVQA [41] emphasize the significance of knowledge in VQA but primarily focus on commonsense. ViQuAE [23] introduces a wide range of entity types and tests fine-grained knowledge related to named entities. However, certain questions in ViQuAE can be answered without viewing the visual content. INFOSEEK [9] and Enc-VQA [35] address this by designing questions that force the model to examine the image for the correct answer. These datasets cover a broad range of Wikipedia entities and focus on fine-grained knowledge related to these entities. In contrast, INFOSEEK explicitly defines two splits: Unseen Entity and unseen Question, whereas Enc-VQA introduces more diverse question types, including single-hop, multi-answer, and two-hop questions.

Multimodal Large Language Models. The rapid development of LLMs [5, 38, 11, 43, 10, 12] drives the progress of MLLMs [2, 26, 30, 29, 1, 40, 48]. Typical MLLMs, e.g., LLaVA, widely adopt a two-stage pre-training and fine-tuning paradigm, achieving impressive capabilities across various multimodal tasks [42, 25, 46, 27]. MLLMs perform well in understanding human queries [30], handling purely visual VQA tasks [17, 19], and addressing commonsense VQA tasks [33]. However, they struggle with Knowledge-based VQA tasks that involve fine-grained knowledge of specific visual entities, as shown by their performance on INFOSEEK and Enc-VQA [9, 35].

Retrieval-Augmented Generation. RAG is widely used in LLMs to address challenges like hallucinations [47], non-renewable knowledge, and opaque reasoning [21, 14]. It combines LLMs’ inherent knowledge with dynamic external knowledge, offering a solution for knowledge-intensive tasks [24, 32]. This technique is gaining traction within the MLLMs domain. For example, Wiki-LLaVA [7] retrieves Wikipedia articles from the input image and uses Contriever [15] to select relevant passages. EchoSight [45] introduces a fine-tuned Q-Former [26] architecture as a reranker, filtering the retrieved content based on both the input image and question. These methods rely on external models to filter the retrieved information, while overlooking the role of MLLMs. Inspired by SELF-RAG [4], we propose the mR²AG framework, which leverages MLLMs to independently localize the query-relevant evidence within the retrieved content, eliminating the need for additional modules or complex strategies.

3 Methodology

Knowledge-based VQA takes an image-question pair $(I,Q)$ as input and is supported by an accessible knowledge base. As shown in Figure. 2, naive mRAG first retrieves the top- $N$ articles most relevant to the input from the knowledge base, denoted as $\hat{P}=\{\hat{P_{i}}\}_{i=1}^{N}$ , and then feeds the concatenation of $\hat{P}$ and the $(I,Q)$ into the MLLM to directly generate the response $y^{\mathrm{ans}}$ . In contrast, mR²AG proposes two novel reflection operations to decouple the generation process into three steps: (1) Performing Retrieval-Reflection to determine whether retrieval is needed. (2) Performing Relevance-Reflection to identify evidence passages and generate answers accordingly. (3) Post-processing multiple potential answers. The presentation of mR²AG is organized as follows: Section 3.1 introduces the mR²AG method. Section 3.2 describes the training of mR²AG. Section 3.3 introduces the mR²AG-IT dataset for fine-tuning.

3.1 mR²AG Method

3.1.1 Retrieval-Reflection

User queries can be divided into Visual-dependent and Knowledge-based according to the input $(I,Q)$ . As shown in Figure 2, the question in case $(a1)$ requires external knowledge for a confident answer, while the question in case $(a2)$ can be answered entirely relying on visual content. Introducing external knowledge in the latter case may bring undesirable noise. To guide the model in distinguishing between different queries, we define two special tokens: [Retrieval]/[No Retrieval], to perform Retrieval-Reflection. First, the model generates retrieval-reflection predictions $y^{\mathrm{ret}}$ based on the input $(I,Q)$ :

y^{\mathrm{ret}}=\textrm{MLLM}(I,Q).

(1)

Depending on the different results of $y^{\mathrm{ret}}$ , one of the following ways is executed:

•

$y^{\mathrm{ret}}$ = [No Retrieval]: The model determines that the question can be answered without external knowledge, and conditions on this token along with $(I,Q)$ to generate the answer $y^{\mathrm{ans}}$ :

y^{\mathrm{ans}}=\textrm{MLLM}(I,Q,y^{\mathrm{ret}}=\textrm{[No Retrieval]}).

(2)

•

$y^{\mathrm{ret}}$ = [Retrieval]: The model recognizes the need for external knowledge to answer the question and invokes retrievers to assist in further generation process.

We use English Wikipedia entries as the knowledge base, where the $k_{th}$ entry consists of a candidate image $\hat{I_{k}}$ , title $\hat{T_{k}}$ , and article $\hat{P_{k}}$ . mR²AG combines cross-modal and uni-modal retrieval to select the most relevant Wikipedia entries to the query image $I$ . CLIP [39] is utilized to encode $I$ , $\hat{I_{k}}$ and $\hat{T_{k}}$ , and calculates the cosine similarity of $\mathrm{sim}(I,\hat{I_{k}})$ and $\mathrm{sim}(I,\hat{T_{k}})$ . The overall retrieval score $S^{\mathrm{ret}}_{k}$ for the $k_{th}$ entry is the average of two cosine similarities:

S^{\mathrm{ret}}_{k}=\left(\mathrm{sim}(I,\hat{I_{k}})+\mathrm{sim}(I,\hat{T_{% k}})\right)/2.

(3)

The supposed result $\hat{P}=\{\hat{P_{i}}\}_{i=1}^{N}$ correspond to the articles of the top- $N$ entries with the highest retrieval scores.

3.1.2 Relevance-Reflection

We divide each retrieved article $\hat{P}_{i}$ into multiple natural paragraphs. To enable the model to determine whether each segmented paragraph $\hat{P}_{ij}$ contains evidence relevant to the question $Q$ , we introduce two relevance-reflection tokens: [Relevant]/[Irrelevant]. The model conditioned on the combination of $\hat{P}_{ij}$ and the query $(I,Q)$ generates the relevance-reflection prediction $y^{\mathrm{rel}}_{ij}$ :

y^{\mathrm{rel}}_{ij}=\mathrm{MLLM}(I,Q,\hat{P}_{ij}).

(4)

According to the result of $y^{\mathrm{rel}}_{ij}$ , mR²AG selects one of the following processes to perform:

•

$y^{\mathrm{rel}}_{ij}$ = [Irrelevant]: This indicates that the model perceives the $\hat{P}_{ij}$ as irrelevant to the query and lacking sufficient evidence, prompting the model to terminate the generation process and avoid producing unreliable answers.

•

$y^{\mathrm{rel}}_{ij}$ = [Relevant]: The model considers $\hat{P_{ij}}$ relevant to the query, containing evidence beneficial for answer generation, and thus proceeds to generate the answer $y^{\mathrm{ans}}_{ij}$ based on $(I,Q,\hat{P}_{ij},y^{\mathrm{rel}}_{ij})$ :

y^{\textrm{ans}}_{ij}=\textrm{MLLM}(I,Q,\hat{P_{ij}},y^{\textrm{rel}}_{ij}=% \textrm{[Relevant]}).

(5)

3.1.3 Answer Post-Processing

Multiple evidence passages may exist in an article, leading to generate multiple candidate answers. Therefore, post-processing is necessary to arrive at a single final answer. Based on the retrieval-reflection-augmented generation process, we apply a hierarchical post-processing to rank the candidate answers by integrating scores at three levels:

•

Entry-Level. The retrieval score in Equation. 3 measures the similarity between the query image $I$ and the candidate Wikipedia entry, which serves as the Retrieval-Reflection score $S^{\mathrm{ret}}_{i}$ for the $i_{th}$ retrieved entry.

•

Passage-Level. The probability of generating the [Relevant] Relevance-Reflection token quantifies the model’s confidence in judging $\hat{P}_{ij}$ as evidence, which can be defined as the Relevance-Reflection score $S^{\mathrm{rel}}_{ij}$ :

S^{\mathrm{rel}}_{ij}=p_{\theta}\left(y^{\mathrm{rel}}_{ij}=\textrm{[Relevant]% }\mid I,Q,\hat{P}_{ij}\right),

(6)

where $\theta$ represents the parameters of MLLM.

•

Answer-Level. We calculate the probability of each token in the generated answer sequence and use the geometric mean to normalize the influence of sequence length variation, resulting in the answer confidence score $S^{\mathrm{ans}}_{ij}$ :

S^{\mathrm{ans}}_{ij}=\sqrt[n]{p_{\theta}(y^{\textrm{ans}}_{ij})},

(7)

where $n$ represents the sequence length. This score reflects the model’s confidence in generating answers based on the retrieved content and reflection tokens.

Post-processing. The three levels of scores comprehensively consider each step in the answer generation process, evaluating the reliability of the candidate answers at the entry, passage, and answer levels, respectively. The effects of the three scores are integrated by calculating their product, which serves as the final criterion for ranking candidate answers. The model outputs the answer with the highest score based on this criterion.

3.2 Training with Reflection Mechanism

During instruction tuning, we combine the common visual instruction tuning dataset, such as LLaVA-IT [30], with the specifically designed mR²AG-IT dataset:

•

For each sample in LLaVA-IT, we set the Retrieval-Reflection token to [No Retrieval]. The model is trained to answer the questions depends only on the visual content, and the training loss is formulated as:

	$\displaystyle\mathcal{L}_{1}$	$\displaystyle=-\mathbb{E}_{(I,Q,y^{ret},y^{\mathrm{ans}})\sim\mathcal{D}_{% \text{LLaVA-IT}}}$		(8)
		$\displaystyle\log p_{\theta}(y^{\mathrm{ret}}=[\text{No Retrieval}],y^{\mathrm% {ans}}\mid I,Q).$

•

For each sample in mR²AG-IT, we set the Retrieval-Reflection token to [Retrieval]. The model is trained to invoke retrievers, identify evidence passages, and generate accurate response. The training loss can be defined as:

$\displaystyle\mathcal{L}_{2}=$	$\displaystyle-\mathbb{E}_{(I,Q,y^{ret},\hat{P}_{ij},y^{rel}_{ij},y^{\mathrm{% ans}})\sim\mathcal{D}_{\text{mR${}^{2}$AG-IT}}}$	(9)
	$\displaystyle\Big{(}\log p_{\theta}(y^{ret}=[\text{Retrieval}]\|I,Q)$
	$\displaystyle+\mathbbm{1}\big{[}y^{rel}_{ij}=[\text{Relevant}]\big{]}\log p_{% \theta}(y^{rel}_{ij},y^{\mathrm{ans}}\|I,Q,\hat{P}_{ij})$
	$\displaystyle+\mathbbm{1}\big{[}y^{rel}_{ij}=[\text{Irrelevant}]\big{]}\log p_% {\theta}(y^{rel}_{ij}\|I,Q,\hat{P}_{ij})\Big{)},$

where the indicator function $\mathbbm{1}[\cdot]$ equals to 1 when the condition inside the parentheses is satisfied and 0 otherwise.

3.3 mR²AG-IT Dataset

For the Knowledge-based VQA tasks involving INFOSEEK [9] and Enc-VQA [35], we propose an automated pipeline for annotating training data. Each sample in these training datasets includes a $(I,Q,P_{i},A)$ quadruple, where $P_{i}$ is the ground truth Wikipedia article containing the answer source.

For INFOSEEK [9], the annotation process includes:

1.

The Wikipedia article is segmented into natural paragraphs, with each paragraph $P_{ij}$ preserving semantic independence and forming a new quadruple $(V,Q,P_{ij},A)$ .
2.

Each paragraph $P_{ij}$ is evaluated by GPT-4 [1] with the question $Q$ and answer $A$ to determine if it serves as evidence. If $P_{ij}$ is relevant to the query and supports answer generation, the Relevance-Reflection token is labeled as $y^{rel}_{ij}$ = [Relevant]; otherwise, $y^{rel}_{ij}$ = [Irrelevant].
3.

Due to the limited quantity of Wikipedia articles in the INFOSEEK training set, we additionally incorporate the Natural Questions (NQ) dataset [22] as a supplement. This dataset comprises of real queries from the Google search engine, each with a long answer (when available) and a short answer as the response to the query. We consider the long answer as evidence for the query and the short answer as the response.

For Enc-VQA [35], each sample is annotated with the evidence paragraphs from the ground truth Wikipedia article $P_{i}$ that support the answer, eliminating additional steps to identify the evidence paragraphs. To ensure the precision of all training samples, we conduct strict string searches for filtering, ensuring answers appear only in evidence paragraphs.

4 Experiments

Table 1: Main results of models with external knowledge on the INFOSEEK.

\dagger

denotes our method and its variants with alternative designs.

Model

LLM

#Params

INFOSEEK

{}_{\text{Wikidata}}

INFOSEEK

{}_{\text{Human}}

INFOSEEK

{}_{\text{Validation}}

Unseen

Question

Unseen

Entity

Overall

Unseen

Question

Unseen

Entity

Overall

Unseen

Question

Unseen

Entity

Overall

Retrieved Knowledge

CLIP→PaLM [9]

PaLM

540B

21.9

18.6

20.1

15.6

14.9

15.2

22.7

18.5

20.4

CLIP→FiD [9]

T5large

660M

20.7

18.1

19.3

18.9

17.6

18.2

23.3

19.1

20.9

Wiki-LLaVA [7]

Vicuna

–

30.1

27.8

28.9

LLM-RA [20]

–

26.1

20.9

23.1

–

EchoSight [45]

LLaMA3

–

31.3

LLaVA-mRAG

\dagger

Vicuna

30.3

29.2

29.8

17.6

15.9

16.7

–

LLaVA-SFR

\dagger

Vicuna

20.8

19.1

19.9

18.5

17.2

17.9

–

LLaVA-mR²AG

\dagger

Vicuna

39.1

38.0

38.6

30.2

27.5

28.8

40.6

39.8

40.2

Oracle Knowledge

Oracle→FID [9]

T5large

660M

–

52.0

–

45.6

52.1

53.0

52.5

Wiki-LLaVA [7]

Vicuna

–

52.7

50.3

51.5

AVIS [18]

–

56.4

50.7

53.4

–

LLaVA-mRAG

\dagger

Vicuna

55.3

56.1

55.7

32.8

28.2

30.3

–

LLaVA-SFR

\dagger

Vicuna

56.6

55.6

56.1

46.9

43.3

45.0

–

LLaVA-mR²AG

\dagger

Vicuna

58.3

57.9

58.1

50.4

47.2

48.7

60.8

59.3

60.0

4.1 Datasets

INFOSEEK [9] contains a training set and three evaluation sets: INFOSEEK ${}_{\text{Validation}}$ , INFOSEEK ${}_{\text{Wikidata}}$ and INFOSEEK ${}_{\text{Human}}$ . The training set, along with INFOSEEK ${}_{\text{Validation}}$ and INFOSEEK ${}_{\text{Wikidata}}$ , are all derived from 1.3M samples automatically constructed from Wikipedia to support large-scale training and evaluation. INFOSEEK ${}_{\text{Human}}$ consists of 8.9K samples annotated by humans to simulate real information-seeking intentions. To prevent overfitting, each evaluation set is divided into two subsets: Unseen Entity and Unseen Question. The evaluation samples can be divided into three categories, i.e., STRING, TIME, and NUMERICAL. The STRING and TIME categories use VQA Accuracy [17] as the evaluation metric, while the NUMERICAL category is assessed with Relaxed Accuracy [36]. To calculate the overall accuracy, the average score for each question is first computed separately for each test split, followed by the geometric mean of these averages.

INFOSEEK uses a Wikipedia knowledge base containing 100K articles and infobox images as external knowledge sources for the With-KB protocol. Since this external knowledge base is not publicly available, we construct one of the same scale and perform comprehensive evaluations across all available datasets. The evaluation metrics strictly follow INFOSEEK’s standards, ensuring a fair comparison.

Encyclopedic VQA [35] contains 1M {I, Q, A} triples, covering 16.7K entities and totaling 221K unique Q+A pairs. Each Q+A pair is associated with up to 5 images, showing various instances of the same entity. The dataset includes single-hop questions generated using templated and automatic methods, along with multi-answer and two-hop questions. Multi-answer questions require a list of possible answers, while two-hop questions involve two consecutive retrieval and reasoning steps. For evaluation, accuracy is measured as the percentage of matches between the predicted and ground-truth answers on the test split, using the BERT Matching (BEM) [6] standard for correctness assessment. For multi-answer questions, the model’s output is first converted into a set of strings, and the intersection-over-the-union (IoU) with the ground-truth set is calculated. If $\text{IoU}\geq 0.5$ , the prediction is considered correct; otherwise, BEM is used to determine the equivalence between the predicted and ground-truth lists.

Enc-VQA provides a controlled knowledge base comprising 2M Wikipedia pages, with each Q+A pair annotated with the corresponding Wikipedia articles and evidence paragraphs supporting the answers. In the retrieval-augmented setting, Enc-VQA employs a Google Lens-based [16] retriever to predict entities from the input image $I$ . We focus on single-hop and multi-answer questions in Enc-VQA, evaluating on the test split with a consistent knowledge base, retrieval approach and unified evaluation metrics.

4.2 Implementation details

We perform instruction tuning using the combination of the LLaVA instruction tuning dataset (LLaVA-IT) [30] and the mR²AG-IT dataset. Training continues from the stage-1 checkpoint of the LLaVA-v1.5-7B, with a learning rate of $2\times 10^{-5}$ and a batch size of $8\times 16$ , and lasts for one epoch. The main experiments and ablation studies are conducted based on the LLaVA-v1.5-7B [30], while mR²AG is also applicable to other MLLMs, such as Mini-Gemini [28] and Mipha [48]. We use CLIP-ViT-L/14@336px [39] as the retriever for INFOSEEK [9], and directly use the retrieval results based on Google Lens for Enc-VQA [35]. By default, we utilize the top-5 Wikipedia entries from the retrieval results for both benchmarks.

4.3 Comparisons with SOTA

4.3.1 INFOSEEK

Without Knowledge. In the setting without external knowledge, the model predicts the answer based solely on the input image and question, relying on the knowledge stored in its parameters during training. To explore the performance of MLLMs under this protocol, we fine-tune LLaVA-v1.5-7B [30] using {V, Q, A} triples from the INFOSEEK training set. After fine-tuning, the model’s performance on INFOSEEK ${}_{\text{Human}}$ improves from 9.5 to 12.0, and on INFOSEEK ${}_{\text{Wikidata}}$ from 9.1 to 20.5. Additionally, the strongest models, GPT-4v [1] and GPT-4o [37], achieve scores of 12.1 and 21.3 on INFOSEEK ${}_{\text{Human}}$ , respectively. These results suggest that while fine-tuning for specific datasets or using stronger models helps improve performance, current models still fall short in knowledge-based VQA tasks, underscoring the need for external knowledge.

Table 2: Main results of various methods on Enc-VQA[35]. By default, these methods use Google Lens [16] as the retriever, while methods marked with ^∗ use a custom retrieval scheme.

Model	Retrieved		Oracle
Model	Single-hop	Multi-answer	Singel-hop
PaLI [35]	28.1%	9.2%	48.8%
PaLM [35]	48.8%	33.6%	87.0%
GPT-3 [35]	44.9%	32.1%	82.1%
Wiki-LLaVA^∗ [7]	21.8%	–	39.2%
EchoSight^∗ [45]	41.8%	–	–
HAMMR-BLIP-2 [8]	45.0%	–	–
HAMMR-PaLI-X [8]	47.8%	–	–
Cascade w/o vanilla MLLMs [3]	53.4%	–	–
LLaVA-mR²AG	55.9%	51.8%	88.2%

Retrieved Knowledge. Table. 1 presents the comparisons of models utilizing an external knowledge base on the INFOSEEK [9] benchmark. When leveraging the articles of the retrieved Wikipedia entries for answer generation, our mR²AG significantly outperforms the best existing models on all three test sets. Specifically, mR²AG surpasses LLM-RA [20], CLIP→FiD [9], and EchoSight [45] by 15.5%, 10.6%, and 8.9% in overall accuracy on INFOSEEK ${}_{\text{Wikidata}}$ , INFOSEEK ${}_{\text{Human}}$ , and INFOSEEK ${}_{\text{Validation}}$ .

To further verify that our improvement stems from the proposed mR²AG framework rather than simply the improved retrieval method or the powerful MLLM, we implement two comparative models: LLaVA-mRAG and LLaVA-SFR. Both models use the same retrieval results as ours. LLaVA-mRAG directly uses the articles of the retrieved entries to augment generation, while LLaVA-SFR employs an off-the-shelf model, SFR [34], to identify question-related paragraphs in the articles for augmenting generation. Our results significantly outperform these two models, demonstrating that our method can effectively utilize noisy retrieval content, accurately pinpoint the relevant information, and extract the knowledge needed to answer the questions.

Oracle Knowledge. Even when it is possible to obtain the ground-truth Wikipedia entries corresponding to the involved entities in the question, and use them to augment the answer generation, our method still surpasses other methods across all test sets. This further highlights the advantage of mR²AG in refining useful evidence, and explicitly outputting evaluation results before generating answers improves the model’s ability to perform knowledge reasoning.

4.3.2 Encyclopedic VQA

As Table. 2 illustrates, when using the retrieved knowledge, mR²AG achieves superior performance, improving from 53.4% to 55.9% on single-hop questions and significantly increasing from 33.6% to 51.8% on the overlooked multi-answer questions. This indicates that mR²AG can effectively analyze noisy retrieved knowledge, accurately locate relevant information, and generate high-quality responses. Under the Oracle setting, when provided with more reliable knowledge sources, the performance of mR²AG improves dramatically from 55.9% to 88.2%, surpassing the previous SOTA of 87.0%. This suggests that: 1) Improving retrieval precision effectively enhances the performance of mR²AG, demonstrating the method’s high potential. 2) With the same retrieval content, the superior performance underscores that the Relevance-Reflection mechanism introduced by mR²AG further strengthens the model’s information extraction and reasoning capabilities. Overall, the state-of-the-art performance across multiple Knowledge-based VQA tasks showcases the broad applicability of mR²AG.

Table 3: Main results on common Visual-dependent benchmarks.

Model	MME		LLaVA ${}^{\text{W}}$	MMB	POPE
Model	cong	perc	LLaVA ${}^{\text{W}}$	MMB	POPE
LLaVA-v1.5-7B [29]	355.7	1507.9	64.9	64.7	85.9
Wiki-LLaVA [7]	341.3	1438.9	-	71.1	84.2
LLaVA-mR²AG	325.7	1476.2	65.8	66.2	86.1

Table 4: Results on MLLMs with different architectures and scales.

Model	LLM	#Params	INFOSEEK Wikidata	INFOSEEK Human
Without Knowledge
Mipha [48]	Phi2	3B	7.0	6.4
MGM [28]	Vicuna	7B	10.7	10.2
LLaVA [30]	Vicuna	13B	9.8	9.9
Retrieved Knowledge
Mipha-mRAG	Phi2	3B	26.6	14.4
MGM-mRAG	Vicuna	7B	29.8	18.4
LLaVA-mRAG	Vicuna	13B	31.3	18.1
Mipha-mR²AG	Phi2	3B	34.4	26.3
MGM-mR²AG	Vicuna	7B	38.4	28.8
LLaVA-mR²AG	Vicuna	13B	36.7	29.4

4.4 Comparisons on the Visual-dependent Tasks

We conduct comprehensive evaluations on widely used Visual-dependent benchmarks to demonstrate mR²AG’s capabilities in handling questions about visual content or common sense. As shown in Table. 3, our model performs comparably to the base MLLM, i.e., LLaVA-V1.5-7B [29], on the MME [13] benchmark and surpasses it on LLaVA^W [29], MMB [31], and POPE [27] benchmarks. These comparisons highlight the effectiveness of the mR²AG design, which strategically separates Visual-dependent tasks from Knowledge-based VQA tasks, allowing for targeted optimization in Knowledge-based domains while maintaining excellent performance on Visual-dependent tasks.

4.5 Generalizability and Scalability

We validate the effectiveness of the mR²AG framework on three model architectures (Mipha [48], Mini-Gemini [28], and LLaVA [30]), covering three different scales of language models (3B, 7B, and 13B). In our experiments, we fine-tune these MLLMs from their stage-1 checkpoints using the same instruction-tuning data, while maintaining consistent hyperparameters such as learning rate, batch size and number of epochs. The results in Table. 4 show that these base models generally perform poorly on the INFOSEEK [9] benchmark. However, by introducing the naive mRAG approach or the mR²AG framework, we observe significant improvements in performance on Knowledge-based VQA tasks. Notably, the mR²AG framework consistently outperforms the naive mRAG approach, demonstrating its generalizability across different model architectures and scalability across varying language model sizes.

Table 5: Comparisons of different answer post-processing methods.

Score Type	INFOSEEK ${}_{\text{Validation}}$
Score Type	Unseen Question	Unseen Entity	Overall
Random	32.7	32.7	32.7
$S^{\mathrm{ans}}$	36.4	37.0	36.7
$S^{\mathrm{ret}}$	37.6	37.4	37.5
$S^{\mathrm{rel}}$	39.7	38.7	39.2
$S^{\mathrm{ret}}\cdot S^{\mathrm{ans}}$	38.3	38.7	38.5
$S^{\mathrm{rel}}\cdot S^{\mathrm{ans}}$	39.5	39.0	39.2
$S^{\mathrm{ret}}\cdot S^{\mathrm{rel}}$	40.5	39.3	39.9
$S^{\mathrm{ret}}\cdot S^{\mathrm{rel}}\cdot S^{\mathrm{ans}}$	40.6	39.8	40.2

Table 6: Effect of retrieving different numbers of Wikipedia entries.

Retrieved Articles	R@K	INFOSEEK ${}_{\text{Validation}}$
1	0.38	29.7
3	0.53	37.6
5	0.59	40.2
10	0.65	42.8

4.6 Ablation Studies

In this section, we conduct ablation studies to verify the effectiveness of our model’s design choices. Most comparisons are performed on the INFOSEEK ${}_{\text{Validation}}$ dataset.

Effect of $S^{\mathrm{ret}}$ , $S^{\mathrm{rel}}$ and $S^{\mathrm{ans}}$ . To evaluate the effectiveness of using the product of the Retrieval-Reflection score $S^{\mathrm{ret}}$ , Relevance-Reflection score $S^{\mathrm{rel}}$ , and answer confidence score $S^{\mathrm{ans}}$ in post-processing, we conducted a comprehensive comparison with all alternative ranking methods, as shown in Table 5. As observed, randomly selecting from answer candidates yields the lowest performance, which is not desirable. When using scores for post-processing, the combination of all three scores is superior to other score combinations, demonstrating that these scores effectively assess the reliability of the answer at different levels.

Effect of retrieving different numbers of Wikipedia entries. To provide insights into the optimal number of retrieved Wikipedia entries for augmenting generation, we explore how varying the number of entries affects performance, as shown in Table 6. Increasing the number of retrieved entries from 1 to 5 significantly improves the overall accuracy of INFOSEEK ${}_{\text{Validation}}$ , as higher recall rates increase the likelihood of capturing relevant information. However, further increasing the number of retrieved entries introduces more irrelevant content and noise, leading to limited performance improvement and higher inference costs. Considering the trade-off between performance and efficiency, we select 5 retrieved Wikipedia entries for each question.

Contribution of the NQ samples to the mR²AG-IT dataset. NQ samples serve as a supplement to the mR²AG-IT dataset, which contains samples with evidence paragraphs. As shown in Table 7, removing this portion of data leads to a performance decline. This finding highlights the importance of including NQ data in enhancing the model’s ability to accurately identify evidence paragraphs.

Table 7: The importance of NQ dataset.

	INFOSEEK ${}_{\text{Validation}}$
	Unseen Question	Unseen Entity	Overall
w/o NQ	39.1	39.7	39.4
w/ NQ	40.6	39.8	40.2

Benefits of combining cross-modal and uni-modal retrievals. Table 8 compares the performance of different methods for retrieving the ground-truth Wikipedia entities on INFOSEEK ${}_{\text{Validation}}$ . It is evident that combining cross-modal and uni-modal retrievals significantly improves R@1/10/20, thereby incorporating more knowledge beneficial for answering the questions.

4.7 Qualitative analysis

In Figure. 3, we present the qualitative comparison of the mR²AG with GPT-4o. As illustrated in Figure. 3 (a) and (b), our method retrieves relevant knowledge and accurately answers the Knowledge-based questions, achieving more precise results compared to GPT-4o. However, two failure cases are shown in Figure.3:

•

Inaccurate retrieval: As illustrated in Figure. 3 (c), when the subject in the image is difficult to identify, the retriever struggles to find relevant information, making it challenging for our method to answer the questions.
•

Knowledge interference: In Figure. 3 (d), the retriever finds the correct entity, but our method provides the wrong answer due to the conflicting knowledge in the text, specifically ”a suspended span of 564 feet.”

Table 8: Comparisons of retrieval performance across different retrieval methods on the INFOSEEK knowledge base.

Retrieval Method	R@1	R@10	R@20
Cross-modal	0.31	0.58	0.65
Uni-modal	0.29	0.55	0.62
Cross-modal + Uni-modal	0.38	0.65	0.71

5 Conclusion

This paper proposes an advanced multimodal RAG framework that optimizes Knowledge-based VQA tasks while maintaining the model’s capabilities as a general-purpose MLLM. The approach is based on existing MLLMs, guiding the model to explicitly distinguish the type of user query and evaluate retrieved information to refine the naive multimodal RAG process. It improves inference efficiency by adaptively avoiding unnecessary retrievals. Furthermore, by explicitly evaluating the retrieved content, it can identify the evidence passages relevant to the query, while filtering out noise from the retrieved content, thereby enhancing the credibility of the generated responses. Future work will explore knowledge graph-based retrieval-augmented systems and broader application scenarios.

6 Limitation

Limited by the entity recognition capabilities of existing MLLMs, our method is quite dependent on the retriever, assuming that the default visual entity occupies the major position in the image. In the future, we plan to conduct specialized training for visual entity recognition tasks and guide the model to enhance the discrimination of visual entity recognition categories and locations.

References

Achiam et al. [2023] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
Alayrac et al. [2022] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35:23716–23736, 2022.
Alazraki et al. [2023] Lisa Alazraki, Lluis Castrejon, Mostafa Dehghani, Fantine Huot, Jasper Uijlings, and Thomas Mensink. How (not) to ensemble lvlms for vqa. In Proceedings on, pages 1–20. PMLR, 2023.
Asai et al. [2023] Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-rag: Learning to retrieve, generate, and critique through self-reflection. arXiv preprint arXiv:2310.11511, 2023.
Brown et al. [2020] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
Bulian et al. [2022] Jannis Bulian, Christian Buck, Wojciech Gajewski, Benjamin Boerschinger, and Tal Schuster. Tomayto, tomahto. beyond token-level answer equivalence for question answering evaluation. arXiv preprint arXiv:2202.07654, 2022.
Caffagni et al. [2024] Davide Caffagni, Federico Cocchi, Nicholas Moratelli, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. Wiki-llava: Hierarchical retrieval-augmented generation for multimodal llms. arXiv preprint arXiv:2404.15406, 2024.
Castrejon et al. [2024] Lluis Castrejon, Thomas Mensink, Howard Zhou, Vittorio Ferrari, Andre Araujo, and Jasper Uijlings. Hammr: Hierarchical multimodal react agents for generic vqa. arXiv preprint arXiv:2404.05465, 2024.
Chen et al. [2023] Yang Chen, Hexiang Hu, Yi Luan, Haitian Sun, Soravit Changpinyo, Alan Ritter, and Ming-Wei Chang. Can pre-trained vision and language models answer visual information-seeking questions? In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 14948–14968, 2023.
Chiang et al. [2023] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2(3):6, 2023.
Chowdhery et al. [2023] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023.
Chung et al. [2024] Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25(70):1–53, 2024.
Fu et al. [2024] Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, and Rongrong Ji. Mme: A comprehensive evaluation benchmark for multimodal large language models, 2024.
Gao et al. [2023] Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, and Haofen Wang. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997, 2023.
Gautier et al. [2022] Izacard Gautier, Caron Mathilde, Hosseini Lucas, Riedel Sebastian, Bojanowski Piotr, Joulin Armand, and Grave Edouard. Unsupervised dense information retrieval with contrastive learning. Transactions on Machine Learning Research, 2022.
Google [2017] Google. Google Lens. https://lens.google.com - Web interface available at https://images.google.com, 2017.
Goyal et al. [2017] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017.
Hu et al. [2024] Ziniu Hu, Ahmet Iscen, Chen Sun, Kai-Wei Chang, Yizhou Sun, David Ross, Cordelia Schmid, and Alireza Fathi. Avis: Autonomous visual information seeking with large language model agent. Advances in Neural Information Processing Systems, 36, 2024.
Hudson and Manning [2019] Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019.
Jian et al. [2024] Pu Jian, Donglei Yu, and Jiajun Zhang. Large language models know what is key visual entity: An llm-assisted multimodal retrieval for vqa. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 10939–10956, 2024.
Kandpal et al. [2023] Nikhil Kandpal, Haikang Deng, Adam Roberts, Eric Wallace, and Colin Raffel. Large language models struggle to learn long-tail knowledge. In International Conference on Machine Learning, pages 15696–15707. PMLR, 2023.
Kwiatkowski et al. [2019] Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466, 2019.
Lerner et al. [2022] Paul Lerner, Olivier Ferret, Camille Guinaudeau, Hervé Le Borgne, Romaric Besançon, José G Moreno, and Jesús Lovón Melgarejo. Viquae, a dataset for knowledge-based visual question answering about named entities. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 3108–3120, 2022.
Lewis et al. [2020] Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020.
Li et al. [2023a] Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023a.
Li et al. [2023b] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730–19742. PMLR, 2023b.
Li et al. [2023c] Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023c.
Li et al. [2024] Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, and Jiaya Jia. Mini-gemini: Mining the potential of multi-modality vision language models. arXiv preprint arXiv:2403.18814, 2024.
Liu et al. [2023] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023.
Liu et al. [2024a] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024a.
Liu et al. [2024b] Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, and Dahua Lin. Mmbench: Is your multi-modal model an all-around player?, 2024b.
Mao et al. [2020] Yuning Mao, Pengcheng He, Xiaodong Liu, Yelong Shen, Jianfeng Gao, Jiawei Han, and Weizhu Chen. Generation-augmented retrieval for open-domain question answering. arXiv preprint arXiv:2009.08553, 2020.
Marino et al. [2019] Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204, 2019.
Meng et al. [2024] Rui Meng, Ye Liu, Shafiq Rayhan Joty, Caiming Xiong, Yingbo Zhou, and Semih Yavuz. Sfrembedding-mistral: enhance text retrieval with transfer learning. Salesforce AI Research Blog, 3, 2024.
Mensink et al. [2023] Thomas Mensink, Jasper Uijlings, Lluis Castrejon, Arushi Goel, Felipe Cadar, Howard Zhou, Fei Sha, André Araujo, and Vittorio Ferrari. Encyclopedic vqa: Visual questions about detailed properties of fine-grained categories. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3113–3124, 2023.
Methani et al. [2020] Nitesh Methani, Pritha Ganguly, Mitesh M Khapra, and Pratyush Kumar. Plotqa: Reasoning over scientific plots. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1527–1536, 2020.
OpenAI [2024] OpenAI. Introducing gpt-4o: Openai’s new flagship multimodal model now in preview on azure. Microsoft Azure Blog, 2024. Available at: https://azure.microsoft.com/en-us/blog/introducing-gpt-4o-openais-new-flagship-multimodal-model-now-in-preview-on-azure/.
Ouyang et al. [2022] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
Reid et al. [2024] Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024.
Schwenk et al. [2022] Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-okvqa: A benchmark for visual question answering using world knowledge. In European Conference on Computer Vision, pages 146–162. Springer, 2022.
Singh et al. [2019] Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019.
Touvron et al. [2023] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
Wei et al. [2022] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022.
Yan and Xie [2024] Yibin Yan and Weidi Xie. Echosight: Advancing visual-language models with wiki knowledge. arXiv preprint arXiv:2407.12735, 2024.
Yue et al. [2023] Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. arXiv preprint arXiv:2311.16502, 2023.
Zhang et al. [2023] Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, et al. Siren’s song in the ai ocean: A survey on hallucination in large language models. corr abs/2309.01219 (2023), 2023.
Zhu et al. [2024] Minjie Zhu, Yichen Zhu, Xin Liu, Ning Liu, Zhiyuan Xu, Chaomin Shen, Yaxin Peng, Zhicai Ou, Feifei Feng, and Jian Tang. Mipha: A comprehensive overhaul of multimodal assistant with small language models. CoRR, 2024.

\thetitle

Supplementary Material

7 Prompt Engineering

7.1 mR²AG-IT Dataset Annotation

We utilize the GPT-4 [1] model via API to annotate the training dataset and design the following prompt to assess the relevance between retrieved content and the query. Inspired by the chain-of-thought [44] approach, the prompt instructs GPT-4 [1] to extract evidence sentences before generating relevance judgments. This design not only enhances the accuracy of the judgments but also facilitates manual calibration. For each input, the content within curly braces {} is replaced with the corresponding actual input. In the following, ”question”, ”answer”, and ”paragraph” correspond to $Q$ , $A$ , and $P_{ij}$ , respectively, as defined in Section 3.3.

7.2 INFOSEEK

INFOSEEK [9] evaluates generated answers using exact match, requiring the outputs to strictly match the annotated answers, which are typically concise and presented in the form of a single word or phrase. To ensure the outputs align with these requirements, we design the following prompt, guiding the model to focus on retrieved content and produce concise responses:

”Based on the retrieved document, answer the question with a single word or phrase.”

7.3 Encyclopedic-VQA

The Enc-VQA [35] dataset includes both single-hop and multi-answer questions. For single-hop questions, we adopt the same prompt template as INFOSEEK. For multiple-answer questions, the model needs to generate several possible answers. We adjust the prompt to ensure the answers comply with the dataset requirements, enabling effective extraction of answer lists from the responses for evaluation:

”Based on the retrieved documents, answer the question as briefly as possible, using ’&&’ to connect multiple different answers.”

8 Additional Experiment Results

Tables 9, 10, and 11 present the complete experimental results on INFOSEEK [9] across various question types. In the without external knowledge setting, the model relies solely on the knowledge encoded in its parameters to answer questions. As shown in Tables 9 and 11, we fine-tune the LLaVA model without integrating external knowledge. The results indicate that this approach only leads to limited performance improvements. Additionally, using APIs, we evaluate the performance of GPT-4v/o [1, 37] and Gemini-1.5-pro [40] on INFOSEEK ${}_{\text{Human}}$ . As shown in Table 9, although the GPT series models outperform other fine-tuned models, they remain inferior to the mR²AG framework. Overall, the complete experimental results on the INFOSEEK dataset demonstrate that mR²AG significantly improves accuracy across all question types, with the most notable enhancement observed in the Time category. These findings further underscore the superiority of our approach in addressing Knowledge-based VQA tasks.

Table 9: Complete results by question type on INFOSEEK

{}_{\text{Human}}

, with LLaVA-FT referring to the fine-tuned model.

Without External Knowledge
Model	LLM	Params	INFOSEEK ${}_{\text{Human}}$
			Unseen Question				Unseen Entity				Overall
			Time	Num	String	Avg	Time	Num	String	Avg	Overall
LLaVA [30]	Vicuna	7B	8.6	12.1	8.6	9.5	7.8	13.1	8.4	9.5	9.5
LLaVA-FT	Vicuna	7B	12.6	17.1	13.6	14.3	8.5	14.3	9.3	10.4	12.0
PaLI-X [9]	UL2 ${}_{\text{32B}}$	55B	–	–	–	12.9	–	–	–	9.3	10.8
Gemini-1.5-pro [40]	–	–	8.1	7.7	15.5	11.3	5.6	5.1	10.3	7.6	9.1
GPT-4v [1]	–	–	15.5	13.3	14.0	14.3	12.4	9.5	9.9	10.5	12.1
GPT4-o [37]	–	–	31.0	29.5	21.7	26.5	20.8	22.0	13.7	17.9	21.3
Retrieved Knowledge
LLaVA-mRAG	Vicuna	7B	20.0	19.1	15.0	17.6	19.3	16.8	13.2	15.9	16.7
LLaVA-SFR	Vicuna	7B	16.2	26.0	15.4	18.5	14.6	27.5	13.0	17.2	17.9
LLaVA-mR²AG	Vicuna	7B	37.8	39.6	19.7	30.2	33.5	39.7	16.8	27.5	28.8
Oracle Knowledge
LLaVA-mRAG	Vicuna	7B	41.9	29.6	29.1	32.8	37.3	28.9	22.2	28.2	30.3
LLaVA-SFR	Vicuna	7B	49.8	64.7	34.0	46.9	44.1	66.6	29.4	43.3	45.0
LLaVA-mR²AG	Vicuna	7B	65.2	66.6	31.2	50.4	59.3	68.6	27.5	47.2	48.7

Table 10: Complete results by question type on INFOSEEK

{}_{\text{Validation}}

Without External Knowledge
Model	LLM	Params	INFOSEEK ${}_{\text{Validation}}$
			Unseen Question				Unseen Entity				Overall
			Time	Num	String	Avg	Time	Num	String	Avg	Overall
InstructBLIP [9]	Flan-T5 ${}_{\text{XXL}}$	12B	7.9	7.5	17.8	15.0	6.6	8.2	16.1	14.0	14.5
BLIP2 [9]	Flan-T5 ${}_{\text{XXL}}$	12B	6.9	5.8	18.5	15.0	5.6	6.0	17.0	14.2	14.6
PaLI-17B [9]	mT5 ${}_{\text{XXL}}$	17B	3.8	18.4	27.4	24.2	1.0	14.8	18.2	16.7	19.7
PaLI-X [9]	UL2 ${}_{\text{32B}}$	55B	7.7	16.1	30.0	25.8	8.1	17.2	24.8	22.4	24.0
LLaVA-FT	Vicuna	7B	10.4	21.0	25.8	24.0	8.2	21.1	20.7	20.2	21.9
Retrieved Knowledge
CLIP → PaLM [9]	PaLM	540B	12.5	27.7	21.7	15.6	17.8	21.3	17.7	14.9	15.2
CLIP → FiD [9]	T5large	660B	12.3	23.4	23.9	18.9	13.8	15.2	20.5	17.6	18.2
LLaVA-mR²AG	Vicuna	7B	40.6	31.8	44.7	40.2	40.3	25.3	43.9	39.8	40.1

Table 11: Complete results by question type on INFOSEEK

{}_{\text{Wikidata}}

Without External Knowledge
Model	LLM	Params	INFOSEEK ${}_{\text{Wikidata}}$
			Unseen Question				Unseen Entity				Overall
			Time	Num	String	Avg	Time	Num	String	Avg	Overall
LLaVA [30]	Vicuna	7B	9.3	12.0	9.1	9.9	6.9	11.5	7.5	8.4	9.1
LLaVA-FT	Vicuna	7B	11.3	14.8	9.3	10.8	7.6	14.1	7.6	9.0	9.8
Retrieved knowledge
LLaVA-mRAG	Vicuna	7B	29.2	22.4	33.3	30.3	27.5	21.0	31.8	29.2	29.8
LLaVA-SFR	Vicuna	7B	18.9	19.7	21.3	20.8	15.2	18.6	19.6	19.1	19.9
LLaVA-mR²AG	Vicuna	7B	41.6	29.3	42.5	39.1	38.3	25.7	41.7	38.0	38.6
Oracle knowledge
LLaVA-mRAG	Vicuna	7B	71.6	34.3	61.8	55.3	58.4	35.2	62.2	56.1	55.7
LLaVA-SFR	Vicuna	7B	71.0	45.8	59.5	56.6	59.4	43.2	59.0	55.6	56.1
LLaVA-mR²AG	Vicuna	7B	70.6	43.2	62.9	58.3	67.0	41.2	62.3	57.9	58.1

9 Qualitative Results and Visualizations

Figure 4 qualitatively demonstrates the effectiveness of the mR²AG framework. It highlights the framework’s ability to accurately assess the relevance between retrieved content and user queries, precisely locate evidence paragraphs within the retrieved documents, and generate reliable answers. Figure 5 provides additional visualization results, illustrating that mR²AG effectively handles various types of visual entities and question types, further validating the design’s effectiveness and reliability. The last column presents additional error cases, where the primary issue lies in the failure to retrieve relevant Wikipedia entities for the visual content.

mR2AG: Multimodal Retrieval-Reflection-Augmented Generation for Knowledge-Based VQA

Abstract

1 Introduction

2 Related Work

3 Methodology

3.1 mR2AG Method

3.1.1 Retrieval-Reflection

3.1.2 Relevance-Reflection

3.1.3 Answer Post-Processing

3.2 Training with Reflection Mechanism

3.3 mR2AG-IT Dataset

4 Experiments

4.1 Datasets

4.2 Implementation details

4.3 Comparisons with SOTA

4.3.1 INFOSEEK

4.3.2 Encyclopedic VQA

4.4 Comparisons on the Visual-dependent Tasks

4.5 Generalizability and Scalability

4.6 Ablation Studies

4.7 Qualitative analysis

5 Conclusion

6 Limitation

References

7 Prompt Engineering

7.1 mR2AG-IT Dataset Annotation

7.2 INFOSEEK

7.3 Encyclopedic-VQA

8 Additional Experiment Results

9 Qualitative Results and Visualizations

mR²AG: Multimodal Retrieval-Reflection-Augmented Generation for Knowledge-Based VQA

3.1 mR²AG Method

3.3 mR²AG-IT Dataset

7.1 mR²AG-IT Dataset Annotation