mR2AG: Multimodal Retrieval-Reflection-Augmented Generation for Knowledge-Based VQA

Tao Zhang1,2,4, Ziqi Zhang1,6, Zongyang Ma1,2,4, Yuxin Chen2, Zhongang Qi5, Chunfeng Yuan1,
Bing Li1,6,†, Junfu Pu2, Yuxuan Zhao3, Zehua Xie3, Jin Ma3, Ying Shan2, Weiming Hu1,4,7
1State Key Laboratory of Multimodal Artificial Intelligence Systems, CASIA; 2PCG ARC Lab, 3Tencent;
4School of Artificial Intelligence, University of Chinese Academy of Sciences; 5Huawei Noah’s Ark Lab;
6PeopleAl Inc; 7School of Information Science and Technology, ShanghaiTech University
{zhangtao2023, mazongyang2020}@ia.ac.cn, {ziqi.zhang, cfyuan, bli, wmhu}@nlpr.ia.ac.cn
{uasonchen, jevinpu, zehuaxie, martyzhao, daniellwang, yingsshan}@tencent.com
[email protected]
Abstract

Advanced Multimodal Large Language Models (MLLMs) struggle with recent Knowledge-based VQA tasks, such as INFOSEEK and Encyclopedic-VQA, due to their limited and frozen knowledge scope, often leading to ambiguous and inaccurate responses. Thus, multimodal Retrieval-Augmented Generation (mRAG) is naturally introduced to provide MLLMs with comprehensive and up-to-date knowledge, effectively expanding the knowledge scope. However, current mRAG methods have inherent drawbacks, including: 1) Performing retrieval even when external knowledge is not needed. 2) Lacking of identification of evidence that supports the query. 3) Increasing model complexity due to additional information filtering modules or rules. To address these shortcomings, we propose a novel generalized framework called multimodal Retrieval-Reflection-Augmented Generation (mR2AG), which achieves adaptive retrieval and useful information localization to enable answers through two easy-to-implement reflection operations, preventing high model complexity. In mR2AG, Retrieval-Reflection is designed to distinguish different user queries and avoids redundant retrieval calls, and Relevance-Reflection is introduced to guide the MLLM in locating beneficial evidence of the retrieved content and generating answers accordingly. In addition, mR2AG can be integrated into any well-trained MLLM with efficient fine-tuning on the proposed mR2AG Instruction-Tuning dataset (mR2AG-IT). mR2AG significantly outperforms state-of-the-art MLLMs (e.g., GPT-4v/o) and RAG-based MLLMs on INFOSEEK and Encyclopedic-VQA, while maintaining the exceptional capabilities of base MLLMs across a wide range of Visual-dependent tasks.

11footnotetext: Corresponding author.22footnotetext: Work done during Zhongang Qi’s tenure at Tencent PCG ARC Lab.

1 Introduction

Refer to caption
Figure 1: Comparisons of different methods on Visual-dependent and Knowledge-based VQA tasks: 1) Typical MLLMs use the image I𝐼Iitalic_I and question Q𝑄Qitalic_Q as inputs, offering limited support for Knowledge-based questions. . 2) Naive mRAG use I𝐼Iitalic_I, Q𝑄Qitalic_Q, and retrieved content P1,2,3subscript𝑃123P_{1,2,3}italic_P start_POSTSUBSCRIPT 1 , 2 , 3 end_POSTSUBSCRIPT as inputs in all cases, inevitably introducing irrelevant noise. 3) mR2AG adaptively determines the necessity of retrieval and effectively locates the useful context, i.e., P3subscript𝑃3P_{3}italic_P start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT for Q2subscript𝑄2Q_{2}italic_Q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT.

The rapid development of Multimodal Large Language Models (MLLMs) [2, 26, 30, 29, 1, 40] enables them to excel in Visual-dependent VQA tasks that rely solely on visual content or common sense, such as VQAv2 [17], GQA [19] and TextVQA [42]. However, recently proposed Knowledge-based VQA tasks like INFOSEEK [9] and Encyclopedic-VQA [35], introduce fine-grained visual entities and focus on encyclopedic knowledge, posing significant new challenges for existing MLLMs. As shown in Q2subscript𝑄2Q_{2}italic_Q start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT of Figure. 1, when queried about the first flight date of an airplane, typical MLLMs tend to provide inaccurate and overly general responses due to their limited and frozen knowledge scope.

To obtain accurate and specific answers, some works [18, 7, 45] introduce multimodal Retrieval-Augmented Generation (mRAG) that leverages external knowledge bases. They first use retrievers to find query-related information, then input them into the MLLM for answer generation. Nevertheless, several challenges remain: 1) Indiscriminate use of retrieval. Visual-dependent questions usually do not require external knowledge, and conducting retrieval for them may introduce noise and lead to wrong answers, such as the response of Naive mRAG to Q1subscript𝑄1Q_{1}italic_Q start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT in Figure 1. 2) Lack of explicit evidence localization. When encountering questions beyond one’s knowledge, humans first search for relevant information and then obtain reliable answers by locating direct evidence, e.g., P3subscript𝑃3P_{3}italic_P start_POSTSUBSCRIPT 3 end_POSTSUBSCRIPT of P𝑃Pitalic_P in Figure 1. However, current methods input retrieved text into the model to directly generate answers, lacking an explicit evidence localization process, making it difficult to determine whether the model can effectively utilize the useful retrieved information. 3) High model complexity. To improve performance, some methods introduce complex rules or even external models to calculate query-passage correlations to filter retrieved content.

To address the above challenges, we propose a novel generalized framework called multimodal Retrieval-Reflection-Augmented Generation (mR2AG), designed to seamlessly leverage the inherent instruction-following, multimodal understanding and reasoning abilities in MLLMs to enhance Knowledge-based question answering. For clarifying the necessity of retrieval, mR2AG introduces Retrieval-Reflection to determine whether the user’s query is Knowledge-based or Visual-dependent for adaptive retrieval, thus expanding MLLMs’ knowledge scope while maintaining the original performance. Furthermore, mR2AG imitates human behavior to implement Relevance-Reflection, which guides the model to explicitly assess the evidence parts in all retrieved information, and generates accurate responses based on the identified evidence. The implementation of the proposed reflection operations involves only modifications to the MLLMs’ vocabulary, without introducing any additional modules or computational strategies to destroy the original structure of the models, thereby effectively coupling with the MLLMs’ existing capabilities.

To quickly integrate mR2AG into pre-trained MLLMs, we provide a corresponding mR2AG Instruction Tuning dataset (mR2AG-IT) specifically constructed for Knowledge-based VQA tasks through an automated annotation pipeline.. Specifically, mR2AG-IT annotates the evidence paragraphs within Wikipedia articles that explicitly support user queries. Comprehensive comparisons on Knowledge-based VQA datasets show that our approach significantly outperforms existing methods. When using LLaVA-v1.5-7B [30] as the base MLLM, applying mR2AG not only achieves performance gains of 10.6% and 15.5% over the previous SOTAs on the INFOSEEKHumanHuman{}_{\text{Human}}start_FLOATSUBSCRIPT Human end_FLOATSUBSCRIPT and INFOSEEKWikidataWikidata{}_{\text{Wikidata}}start_FLOATSUBSCRIPT Wikidata end_FLOATSUBSCRIPT test sets, but also surpasses SOTAs on the Encyclopedic-VQA (Enc-VQA) test set by 2.5% on single-hop questions and 18.2% on multi-answer questions. Moreover, our method retains the excellent capabilities of the base MLLM on Visual-dependent VQA tasks, demonstrating performance comparable to LLaVA-v1.5-7B across various benchmarks.

The contributions of this work are summarized as follows:

  • We propose an advanced multimodal RAG framework mR2AG, which only uses two reflection operations to stimulate MLLMs to implement retrieval invocation, evidence content identification, and answer generation.

  • We provide the mR2AG-IT dataset, which aims to quickly adapt MLLMs to Knowledge-based VQA and serves as a supplement to general visual instruction tuning datasets.

  • mR2AG, when combined with common MLLMs, significantly outperforms existing mRAG methods in answering Knowledge-based queries and maintains the ability of the base MLLM on Visual-dependent tasks. Moreover, mR2AG exhibits conciseness and effectiveness.

2 Related Work

Knowledge-based VQA. Several works [35, 9] focus on Knowledge-based VQA and construct corresponding datasets. Early OK-VQA [33] and its variant A-OKVQA [41] emphasize the significance of knowledge in VQA but primarily focus on commonsense. ViQuAE [23] introduces a wide range of entity types and tests fine-grained knowledge related to named entities. However, certain questions in ViQuAE can be answered without viewing the visual content. INFOSEEK [9] and Enc-VQA [35] address this by designing questions that force the model to examine the image for the correct answer. These datasets cover a broad range of Wikipedia entities and focus on fine-grained knowledge related to these entities. In contrast, INFOSEEK explicitly defines two splits: Unseen Entity and unseen Question, whereas Enc-VQA introduces more diverse question types, including single-hop, multi-answer, and two-hop questions.

Multimodal Large Language Models. The rapid development of LLMs [5, 38, 11, 43, 10, 12] drives the progress of MLLMs [2, 26, 30, 29, 1, 40, 48]. Typical MLLMs, e.g., LLaVA, widely adopt a two-stage pre-training and fine-tuning paradigm, achieving impressive capabilities across various multimodal tasks [42, 25, 46, 27]. MLLMs perform well in understanding human queries [30], handling purely visual VQA tasks [17, 19], and addressing commonsense VQA tasks [33]. However, they struggle with Knowledge-based VQA tasks that involve fine-grained knowledge of specific visual entities, as shown by their performance on INFOSEEK and Enc-VQA [9, 35].

Refer to caption
Figure 2: Overview of the mR2AG framework. (a1) mR2AG w/ Retrieval: This process includes: a) Retrieval-Reflection for determining the necessity of retrieval; b) Relevance-Reflection for identifying evidence passages; c) Post-processing multiple potential answers. (a2) mR2AG w/o Retrieval: The generation process when retrieval is unnecessary. (b) Naïve mRAG: A baseline method without reflection.

Retrieval-Augmented Generation. RAG is widely used in LLMs to address challenges like hallucinations [47], non-renewable knowledge, and opaque reasoning [21, 14]. It combines LLMs’ inherent knowledge with dynamic external knowledge, offering a solution for knowledge-intensive tasks [24, 32]. This technique is gaining traction within the MLLMs domain. For example, Wiki-LLaVA [7] retrieves Wikipedia articles from the input image and uses Contriever [15] to select relevant passages. EchoSight [45] introduces a fine-tuned Q-Former [26] architecture as a reranker, filtering the retrieved content based on both the input image and question. These methods rely on external models to filter the retrieved information, while overlooking the role of MLLMs. Inspired by SELF-RAG [4], we propose the mR2AG framework, which leverages MLLMs to independently localize the query-relevant evidence within the retrieved content, eliminating the need for additional modules or complex strategies.

3 Methodology

Knowledge-based VQA takes an image-question pair (I,Q)𝐼𝑄(I,Q)( italic_I , italic_Q ) as input and is supported by an accessible knowledge base. As shown in Figure. 2, naive mRAG first retrieves the top-N𝑁Nitalic_N articles most relevant to the input from the knowledge base, denoted as P^={Pi^}i=1N^𝑃superscriptsubscript^subscript𝑃𝑖𝑖1𝑁\hat{P}=\{\hat{P_{i}}\}_{i=1}^{N}over^ start_ARG italic_P end_ARG = { over^ start_ARG italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT, and then feeds the concatenation of P^^𝑃\hat{P}over^ start_ARG italic_P end_ARG and the (I,Q)𝐼𝑄(I,Q)( italic_I , italic_Q ) into the MLLM to directly generate the response yanssuperscript𝑦ansy^{\mathrm{ans}}italic_y start_POSTSUPERSCRIPT roman_ans end_POSTSUPERSCRIPT. In contrast, mR2AG proposes two novel reflection operations to decouple the generation process into three steps: (1) Performing Retrieval-Reflection to determine whether retrieval is needed. (2) Performing Relevance-Reflection to identify evidence passages and generate answers accordingly. (3) Post-processing multiple potential answers. The presentation of mR2AG is organized as follows: Section 3.1 introduces the mR2AG method. Section 3.2 describes the training of mR2AG. Section 3.3 introduces the mR2AG-IT dataset for fine-tuning.

3.1 mR2AG Method

3.1.1 Retrieval-Reflection

User queries can be divided into Visual-dependent and Knowledge-based according to the input (I,Q)𝐼𝑄(I,Q)( italic_I , italic_Q ). As shown in Figure 2, the question in case (a1)𝑎1(a1)( italic_a 1 ) requires external knowledge for a confident answer, while the question in case (a2)𝑎2(a2)( italic_a 2 ) can be answered entirely relying on visual content. Introducing external knowledge in the latter case may bring undesirable noise. To guide the model in distinguishing between different queries, we define two special tokens: [Retrieval]/[No Retrieval], to perform Retrieval-Reflection. First, the model generates retrieval-reflection predictions yretsuperscript𝑦rety^{\mathrm{ret}}italic_y start_POSTSUPERSCRIPT roman_ret end_POSTSUPERSCRIPT based on the input (I,Q)𝐼𝑄(I,Q)( italic_I , italic_Q ):

yret=MLLM(I,Q).superscript𝑦retMLLM𝐼𝑄y^{\mathrm{ret}}=\textrm{MLLM}(I,Q).italic_y start_POSTSUPERSCRIPT roman_ret end_POSTSUPERSCRIPT = MLLM ( italic_I , italic_Q ) . (1)

Depending on the different results of yretsuperscript𝑦rety^{\mathrm{ret}}italic_y start_POSTSUPERSCRIPT roman_ret end_POSTSUPERSCRIPT, one of the following ways is executed:

  • yretsuperscript𝑦rety^{\mathrm{ret}}italic_y start_POSTSUPERSCRIPT roman_ret end_POSTSUPERSCRIPT = [No Retrieval]: The model determines that the question can be answered without external knowledge, and conditions on this token along with (I,Q)𝐼𝑄(I,Q)( italic_I , italic_Q ) to generate the answer yanssuperscript𝑦ansy^{\mathrm{ans}}italic_y start_POSTSUPERSCRIPT roman_ans end_POSTSUPERSCRIPT:

    yans=MLLM(I,Q,yret=[No Retrieval]).superscript𝑦ansMLLM𝐼𝑄superscript𝑦ret[No Retrieval]y^{\mathrm{ans}}=\textrm{MLLM}(I,Q,y^{\mathrm{ret}}=\textrm{[No Retrieval]}).italic_y start_POSTSUPERSCRIPT roman_ans end_POSTSUPERSCRIPT = MLLM ( italic_I , italic_Q , italic_y start_POSTSUPERSCRIPT roman_ret end_POSTSUPERSCRIPT = [No Retrieval] ) . (2)
  • yretsuperscript𝑦rety^{\mathrm{ret}}italic_y start_POSTSUPERSCRIPT roman_ret end_POSTSUPERSCRIPT = [Retrieval]: The model recognizes the need for external knowledge to answer the question and invokes retrievers to assist in further generation process.

We use English Wikipedia entries as the knowledge base, where the kthsubscript𝑘𝑡k_{th}italic_k start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT entry consists of a candidate image Ik^^subscript𝐼𝑘\hat{I_{k}}over^ start_ARG italic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG, title Tk^^subscript𝑇𝑘\hat{T_{k}}over^ start_ARG italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG, and article Pk^^subscript𝑃𝑘\hat{P_{k}}over^ start_ARG italic_P start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG. mR2AG combines cross-modal and uni-modal retrieval to select the most relevant Wikipedia entries to the query image I𝐼Iitalic_I. CLIP [39] is utilized to encode I𝐼Iitalic_I, Ik^^subscript𝐼𝑘\hat{I_{k}}over^ start_ARG italic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG and Tk^^subscript𝑇𝑘\hat{T_{k}}over^ start_ARG italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG, and calculates the cosine similarity of sim(I,Ik^)sim𝐼^subscript𝐼𝑘\mathrm{sim}(I,\hat{I_{k}})roman_sim ( italic_I , over^ start_ARG italic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ) and sim(I,Tk^)sim𝐼^subscript𝑇𝑘\mathrm{sim}(I,\hat{T_{k}})roman_sim ( italic_I , over^ start_ARG italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ). The overall retrieval score Skretsubscriptsuperscript𝑆ret𝑘S^{\mathrm{ret}}_{k}italic_S start_POSTSUPERSCRIPT roman_ret end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT for the kthsubscript𝑘𝑡k_{th}italic_k start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT entry is the average of two cosine similarities:

Skret=(sim(I,Ik^)+sim(I,Tk^))/2.subscriptsuperscript𝑆ret𝑘sim𝐼^subscript𝐼𝑘sim𝐼^subscript𝑇𝑘2S^{\mathrm{ret}}_{k}=\left(\mathrm{sim}(I,\hat{I_{k}})+\mathrm{sim}(I,\hat{T_{% k}})\right)/2.italic_S start_POSTSUPERSCRIPT roman_ret end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT = ( roman_sim ( italic_I , over^ start_ARG italic_I start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ) + roman_sim ( italic_I , over^ start_ARG italic_T start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG ) ) / 2 . (3)

The supposed result P^={Pi^}i=1N^𝑃superscriptsubscript^subscript𝑃𝑖𝑖1𝑁\hat{P}=\{\hat{P_{i}}\}_{i=1}^{N}over^ start_ARG italic_P end_ARG = { over^ start_ARG italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_ARG } start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_N end_POSTSUPERSCRIPT correspond to the articles of the top-N𝑁Nitalic_N entries with the highest retrieval scores.

3.1.2 Relevance-Reflection

We divide each retrieved article P^isubscript^𝑃𝑖\hat{P}_{i}over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT into multiple natural paragraphs. To enable the model to determine whether each segmented paragraph P^ijsubscript^𝑃𝑖𝑗\hat{P}_{ij}over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT contains evidence relevant to the question Q𝑄Qitalic_Q, we introduce two relevance-reflection tokens: [Relevant]/[Irrelevant]. The model conditioned on the combination of P^ijsubscript^𝑃𝑖𝑗\hat{P}_{ij}over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT and the query (I,Q)𝐼𝑄(I,Q)( italic_I , italic_Q ) generates the relevance-reflection prediction yijrelsubscriptsuperscript𝑦rel𝑖𝑗y^{\mathrm{rel}}_{ij}italic_y start_POSTSUPERSCRIPT roman_rel end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT:

yijrel=MLLM(I,Q,P^ij).subscriptsuperscript𝑦rel𝑖𝑗MLLM𝐼𝑄subscript^𝑃𝑖𝑗y^{\mathrm{rel}}_{ij}=\mathrm{MLLM}(I,Q,\hat{P}_{ij}).italic_y start_POSTSUPERSCRIPT roman_rel end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = roman_MLLM ( italic_I , italic_Q , over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) . (4)

According to the result of yijrelsubscriptsuperscript𝑦rel𝑖𝑗y^{\mathrm{rel}}_{ij}italic_y start_POSTSUPERSCRIPT roman_rel end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT, mR2AG selects one of the following processes to perform:

  • yijrelsubscriptsuperscript𝑦rel𝑖𝑗y^{\mathrm{rel}}_{ij}italic_y start_POSTSUPERSCRIPT roman_rel end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = [Irrelevant]: This indicates that the model perceives the P^ijsubscript^𝑃𝑖𝑗\hat{P}_{ij}over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT as irrelevant to the query and lacking sufficient evidence, prompting the model to terminate the generation process and avoid producing unreliable answers.

  • yijrelsubscriptsuperscript𝑦rel𝑖𝑗y^{\mathrm{rel}}_{ij}italic_y start_POSTSUPERSCRIPT roman_rel end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = [Relevant]: The model considers Pij^^subscript𝑃𝑖𝑗\hat{P_{ij}}over^ start_ARG italic_P start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_ARG relevant to the query, containing evidence beneficial for answer generation, and thus proceeds to generate the answer yijanssubscriptsuperscript𝑦ans𝑖𝑗y^{\mathrm{ans}}_{ij}italic_y start_POSTSUPERSCRIPT roman_ans end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT based on (I,Q,P^ij,yijrel)𝐼𝑄subscript^𝑃𝑖𝑗subscriptsuperscript𝑦rel𝑖𝑗(I,Q,\hat{P}_{ij},y^{\mathrm{rel}}_{ij})( italic_I , italic_Q , over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT , italic_y start_POSTSUPERSCRIPT roman_rel end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ):

    yijans=MLLM(I,Q,Pij^,yijrel=[Relevant]).subscriptsuperscript𝑦ans𝑖𝑗MLLM𝐼𝑄^subscript𝑃𝑖𝑗subscriptsuperscript𝑦rel𝑖𝑗[Relevant]y^{\textrm{ans}}_{ij}=\textrm{MLLM}(I,Q,\hat{P_{ij}},y^{\textrm{rel}}_{ij}=% \textrm{[Relevant]}).italic_y start_POSTSUPERSCRIPT ans end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = MLLM ( italic_I , italic_Q , over^ start_ARG italic_P start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT end_ARG , italic_y start_POSTSUPERSCRIPT rel end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = [Relevant] ) . (5)

3.1.3 Answer Post-Processing

Multiple evidence passages may exist in an article, leading to generate multiple candidate answers. Therefore, post-processing is necessary to arrive at a single final answer. Based on the retrieval-reflection-augmented generation process, we apply a hierarchical post-processing to rank the candidate answers by integrating scores at three levels:

  • Entry-Level. The retrieval score in Equation. 3 measures the similarity between the query image I𝐼Iitalic_I and the candidate Wikipedia entry, which serves as the Retrieval-Reflection score Siretsubscriptsuperscript𝑆ret𝑖S^{\mathrm{ret}}_{i}italic_S start_POSTSUPERSCRIPT roman_ret end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT for the ithsubscript𝑖𝑡i_{th}italic_i start_POSTSUBSCRIPT italic_t italic_h end_POSTSUBSCRIPT retrieved entry.

  • Passage-Level. The probability of generating the [Relevant] Relevance-Reflection token quantifies the model’s confidence in judging P^ijsubscript^𝑃𝑖𝑗\hat{P}_{ij}over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT as evidence, which can be defined as the Relevance-Reflection score Sijrelsubscriptsuperscript𝑆rel𝑖𝑗S^{\mathrm{rel}}_{ij}italic_S start_POSTSUPERSCRIPT roman_rel end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT:

    Sijrel=pθ(yijrel=[Relevant]I,Q,P^ij),subscriptsuperscript𝑆rel𝑖𝑗subscript𝑝𝜃subscriptsuperscript𝑦rel𝑖𝑗conditional[Relevant]𝐼𝑄subscript^𝑃𝑖𝑗S^{\mathrm{rel}}_{ij}=p_{\theta}\left(y^{\mathrm{rel}}_{ij}=\textrm{[Relevant]% }\mid I,Q,\hat{P}_{ij}\right),italic_S start_POSTSUPERSCRIPT roman_rel end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUPERSCRIPT roman_rel end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = [Relevant] ∣ italic_I , italic_Q , over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) , (6)

    where θ𝜃\thetaitalic_θ represents the parameters of MLLM.

  • Answer-Level. We calculate the probability of each token in the generated answer sequence and use the geometric mean to normalize the influence of sequence length variation, resulting in the answer confidence score Sijanssubscriptsuperscript𝑆ans𝑖𝑗S^{\mathrm{ans}}_{ij}italic_S start_POSTSUPERSCRIPT roman_ans end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT:

    Sijans=pθ(yijans)n,subscriptsuperscript𝑆ans𝑖𝑗𝑛subscript𝑝𝜃subscriptsuperscript𝑦ans𝑖𝑗S^{\mathrm{ans}}_{ij}=\sqrt[n]{p_{\theta}(y^{\textrm{ans}}_{ij})},italic_S start_POSTSUPERSCRIPT roman_ans end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = nth-root start_ARG italic_n end_ARG start_ARG italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUPERSCRIPT ans end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) end_ARG , (7)

    where n𝑛nitalic_n represents the sequence length. This score reflects the model’s confidence in generating answers based on the retrieved content and reflection tokens.

Post-processing. The three levels of scores comprehensively consider each step in the answer generation process, evaluating the reliability of the candidate answers at the entry, passage, and answer levels, respectively. The effects of the three scores are integrated by calculating their product, which serves as the final criterion for ranking candidate answers. The model outputs the answer with the highest score based on this criterion.

3.2 Training with Reflection Mechanism

During instruction tuning, we combine the common visual instruction tuning dataset, such as LLaVA-IT [30], with the specifically designed mR2AG-IT dataset:

  • For each sample in LLaVA-IT, we set the Retrieval-Reflection token to [No Retrieval]. The model is trained to answer the questions depends only on the visual content, and the training loss is formulated as:

    1subscript1\displaystyle\mathcal{L}_{1}caligraphic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT =𝔼(I,Q,yret,yans)𝒟LLaVA-ITabsentsubscript𝔼similar-to𝐼𝑄superscript𝑦𝑟𝑒𝑡superscript𝑦anssubscript𝒟LLaVA-IT\displaystyle=-\mathbb{E}_{(I,Q,y^{ret},y^{\mathrm{ans}})\sim\mathcal{D}_{% \text{LLaVA-IT}}}= - blackboard_E start_POSTSUBSCRIPT ( italic_I , italic_Q , italic_y start_POSTSUPERSCRIPT italic_r italic_e italic_t end_POSTSUPERSCRIPT , italic_y start_POSTSUPERSCRIPT roman_ans end_POSTSUPERSCRIPT ) ∼ caligraphic_D start_POSTSUBSCRIPT LLaVA-IT end_POSTSUBSCRIPT end_POSTSUBSCRIPT (8)
    logpθ(yret=[No Retrieval],yansI,Q).subscript𝑝𝜃superscript𝑦retdelimited-[]No Retrievalconditionalsuperscript𝑦ans𝐼𝑄\displaystyle\log p_{\theta}(y^{\mathrm{ret}}=[\text{No Retrieval}],y^{\mathrm% {ans}}\mid I,Q).roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUPERSCRIPT roman_ret end_POSTSUPERSCRIPT = [ No Retrieval ] , italic_y start_POSTSUPERSCRIPT roman_ans end_POSTSUPERSCRIPT ∣ italic_I , italic_Q ) .
  • For each sample in mR2AG-IT, we set the Retrieval-Reflection token to [Retrieval]. The model is trained to invoke retrievers, identify evidence passages, and generate accurate response. The training loss can be defined as:

    2=subscript2absent\displaystyle\mathcal{L}_{2}=caligraphic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 𝔼(I,Q,yret,P^ij,yijrel,yans)𝒟mR2AG-ITsubscript𝔼similar-to𝐼𝑄superscript𝑦𝑟𝑒𝑡subscript^𝑃𝑖𝑗subscriptsuperscript𝑦𝑟𝑒𝑙𝑖𝑗superscript𝑦anssubscript𝒟mR2AG-IT\displaystyle-\mathbb{E}_{(I,Q,y^{ret},\hat{P}_{ij},y^{rel}_{ij},y^{\mathrm{% ans}})\sim\mathcal{D}_{\text{mR${}^{2}$AG-IT}}}- blackboard_E start_POSTSUBSCRIPT ( italic_I , italic_Q , italic_y start_POSTSUPERSCRIPT italic_r italic_e italic_t end_POSTSUPERSCRIPT , over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT , italic_y start_POSTSUPERSCRIPT italic_r italic_e italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT , italic_y start_POSTSUPERSCRIPT roman_ans end_POSTSUPERSCRIPT ) ∼ caligraphic_D start_POSTSUBSCRIPT mR AG-IT end_POSTSUBSCRIPT end_POSTSUBSCRIPT (9)
    (logpθ(yret=[Retrieval]|I,Q)\displaystyle\Big{(}\log p_{\theta}(y^{ret}=[\text{Retrieval}]|I,Q)( roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUPERSCRIPT italic_r italic_e italic_t end_POSTSUPERSCRIPT = [ Retrieval ] | italic_I , italic_Q )
    +𝟙[yijrel=[Relevant]]logpθ(yijrel,yans|I,Q,P^ij)1delimited-[]subscriptsuperscript𝑦𝑟𝑒𝑙𝑖𝑗delimited-[]Relevantsubscript𝑝𝜃subscriptsuperscript𝑦𝑟𝑒𝑙𝑖𝑗conditionalsuperscript𝑦ans𝐼𝑄subscript^𝑃𝑖𝑗\displaystyle+\mathbbm{1}\big{[}y^{rel}_{ij}=[\text{Relevant}]\big{]}\log p_{% \theta}(y^{rel}_{ij},y^{\mathrm{ans}}|I,Q,\hat{P}_{ij})+ blackboard_1 [ italic_y start_POSTSUPERSCRIPT italic_r italic_e italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = [ Relevant ] ] roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUPERSCRIPT italic_r italic_e italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT , italic_y start_POSTSUPERSCRIPT roman_ans end_POSTSUPERSCRIPT | italic_I , italic_Q , over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT )
    +𝟙[yijrel=[Irrelevant]]logpθ(yijrel|I,Q,P^ij)),\displaystyle+\mathbbm{1}\big{[}y^{rel}_{ij}=[\text{Irrelevant}]\big{]}\log p_% {\theta}(y^{rel}_{ij}|I,Q,\hat{P}_{ij})\Big{)},+ blackboard_1 [ italic_y start_POSTSUPERSCRIPT italic_r italic_e italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = [ Irrelevant ] ] roman_log italic_p start_POSTSUBSCRIPT italic_θ end_POSTSUBSCRIPT ( italic_y start_POSTSUPERSCRIPT italic_r italic_e italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT | italic_I , italic_Q , over^ start_ARG italic_P end_ARG start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) ) ,

    where the indicator function 𝟙[]1delimited-[]\mathbbm{1}[\cdot]blackboard_1 [ ⋅ ] equals to 1 when the condition inside the parentheses is satisfied and 0 otherwise.

3.3 mR2AG-IT Dataset

For the Knowledge-based VQA tasks involving INFOSEEK [9] and Enc-VQA [35], we propose an automated pipeline for annotating training data. Each sample in these training datasets includes a (I,Q,Pi,A)𝐼𝑄subscript𝑃𝑖𝐴(I,Q,P_{i},A)( italic_I , italic_Q , italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_A ) quadruple, where Pisubscript𝑃𝑖P_{i}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT is the ground truth Wikipedia article containing the answer source.

For INFOSEEK [9], the annotation process includes:

  1. 1.

    The Wikipedia article is segmented into natural paragraphs, with each paragraph Pijsubscript𝑃𝑖𝑗P_{ij}italic_P start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT preserving semantic independence and forming a new quadruple (V,Q,Pij,A)𝑉𝑄subscript𝑃𝑖𝑗𝐴(V,Q,P_{ij},A)( italic_V , italic_Q , italic_P start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT , italic_A ).

  2. 2.

    Each paragraph Pijsubscript𝑃𝑖𝑗P_{ij}italic_P start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT is evaluated by GPT-4 [1] with the question Q𝑄Qitalic_Q and answer A𝐴Aitalic_A to determine if it serves as evidence. If Pijsubscript𝑃𝑖𝑗P_{ij}italic_P start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT is relevant to the query and supports answer generation, the Relevance-Reflection token is labeled as yijrelsubscriptsuperscript𝑦𝑟𝑒𝑙𝑖𝑗y^{rel}_{ij}italic_y start_POSTSUPERSCRIPT italic_r italic_e italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = [Relevant]; otherwise, yijrelsubscriptsuperscript𝑦𝑟𝑒𝑙𝑖𝑗y^{rel}_{ij}italic_y start_POSTSUPERSCRIPT italic_r italic_e italic_l end_POSTSUPERSCRIPT start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = [Irrelevant].

  3. 3.

    Due to the limited quantity of Wikipedia articles in the INFOSEEK training set, we additionally incorporate the Natural Questions (NQ) dataset [22] as a supplement. This dataset comprises of real queries from the Google search engine, each with a long answer (when available) and a short answer as the response to the query. We consider the long answer as evidence for the query and the short answer as the response.

For Enc-VQA [35], each sample is annotated with the evidence paragraphs from the ground truth Wikipedia article Pisubscript𝑃𝑖P_{i}italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT that support the answer, eliminating additional steps to identify the evidence paragraphs. To ensure the precision of all training samples, we conduct strict string searches for filtering, ensuring answers appear only in evidence paragraphs.

4 Experiments

Table 1: Main results of models with external knowledge on the INFOSEEK. \dagger denotes our method and its variants with alternative designs.
Model LLM #Params INFOSEEKWikidataWikidata{}_{\text{Wikidata}}start_FLOATSUBSCRIPT Wikidata end_FLOATSUBSCRIPT INFOSEEKHumanHuman{}_{\text{Human}}start_FLOATSUBSCRIPT Human end_FLOATSUBSCRIPT INFOSEEKValidationValidation{}_{\text{Validation}}start_FLOATSUBSCRIPT Validation end_FLOATSUBSCRIPT
Unseen
Question
Unseen
Entity
Overall
Unseen
Question
Unseen
Entity
Overall
Unseen
Question
Unseen
Entity
Overall
Retrieved Knowledge
CLIP→PaLM [9] PaLM 540B 21.9 18.6 20.1 15.6 14.9 15.2 22.7 18.5 20.4
CLIP→FiD [9] T5large 660M 20.7 18.1 19.3 18.9 17.6 18.2 23.3 19.1 20.9
Wiki-LLaVA [7] Vicuna 7B 30.1 27.8 28.9
LLM-RA [20] 26.1 20.9 23.1
EchoSight [45] LLaMA3 8B 31.3
LLaVA-mRAG \dagger Vicuna 7B 30.3 29.2 29.8 17.6 15.9 16.7
LLaVA-SFR \dagger Vicuna 7B 20.8 19.1 19.9 18.5 17.2 17.9
LLaVA-mR2AG \dagger Vicuna 7B 39.1 38.0 38.6 30.2 27.5 28.8 40.6 39.8 40.2
Oracle Knowledge
Oracle→FID [9] T5large 660M 52.0 45.6 52.1 53.0 52.5
Wiki-LLaVA [7] Vicuna 7B 52.7 50.3 51.5
AVIS [18] 56.4 50.7 53.4
LLaVA-mRAG \dagger Vicuna 7B 55.3 56.1 55.7 32.8 28.2 30.3
LLaVA-SFR \dagger Vicuna 7B 56.6 55.6 56.1 46.9 43.3 45.0
LLaVA-mR2AG \dagger Vicuna 7B 58.3 57.9 58.1 50.4 47.2 48.7 60.8 59.3 60.0

4.1 Datasets

INFOSEEK [9] contains a training set and three evaluation sets: INFOSEEKValidationValidation{}_{\text{Validation}}start_FLOATSUBSCRIPT Validation end_FLOATSUBSCRIPT, INFOSEEKWikidataWikidata{}_{\text{Wikidata}}start_FLOATSUBSCRIPT Wikidata end_FLOATSUBSCRIPT and INFOSEEKHumanHuman{}_{\text{Human}}start_FLOATSUBSCRIPT Human end_FLOATSUBSCRIPT. The training set, along with INFOSEEKValidationValidation{}_{\text{Validation}}start_FLOATSUBSCRIPT Validation end_FLOATSUBSCRIPT and INFOSEEKWikidataWikidata{}_{\text{Wikidata}}start_FLOATSUBSCRIPT Wikidata end_FLOATSUBSCRIPT, are all derived from 1.3M samples automatically constructed from Wikipedia to support large-scale training and evaluation. INFOSEEKHumanHuman{}_{\text{Human}}start_FLOATSUBSCRIPT Human end_FLOATSUBSCRIPT consists of 8.9K samples annotated by humans to simulate real information-seeking intentions. To prevent overfitting, each evaluation set is divided into two subsets: Unseen Entity and Unseen Question. The evaluation samples can be divided into three categories, i.e., STRING, TIME, and NUMERICAL. The STRING and TIME categories use VQA Accuracy [17] as the evaluation metric, while the NUMERICAL category is assessed with Relaxed Accuracy [36]. To calculate the overall accuracy, the average score for each question is first computed separately for each test split, followed by the geometric mean of these averages.

INFOSEEK uses a Wikipedia knowledge base containing 100K articles and infobox images as external knowledge sources for the With-KB protocol. Since this external knowledge base is not publicly available, we construct one of the same scale and perform comprehensive evaluations across all available datasets. The evaluation metrics strictly follow INFOSEEK’s standards, ensuring a fair comparison.

Encyclopedic VQA [35] contains 1M {I, Q, A} triples, covering 16.7K entities and totaling 221K unique Q+A pairs. Each Q+A pair is associated with up to 5 images, showing various instances of the same entity. The dataset includes single-hop questions generated using templated and automatic methods, along with multi-answer and two-hop questions. Multi-answer questions require a list of possible answers, while two-hop questions involve two consecutive retrieval and reasoning steps. For evaluation, accuracy is measured as the percentage of matches between the predicted and ground-truth answers on the test split, using the BERT Matching (BEM) [6] standard for correctness assessment. For multi-answer questions, the model’s output is first converted into a set of strings, and the intersection-over-the-union (IoU) with the ground-truth set is calculated. If IoU0.5IoU0.5\text{IoU}\geq 0.5IoU ≥ 0.5, the prediction is considered correct; otherwise, BEM is used to determine the equivalence between the predicted and ground-truth lists.

Enc-VQA provides a controlled knowledge base comprising 2M Wikipedia pages, with each Q+A pair annotated with the corresponding Wikipedia articles and evidence paragraphs supporting the answers. In the retrieval-augmented setting, Enc-VQA employs a Google Lens-based [16] retriever to predict entities from the input image I𝐼Iitalic_I. We focus on single-hop and multi-answer questions in Enc-VQA, evaluating on the test split with a consistent knowledge base, retrieval approach and unified evaluation metrics.

4.2 Implementation details

We perform instruction tuning using the combination of the LLaVA instruction tuning dataset (LLaVA-IT) [30] and the mR2AG-IT dataset. Training continues from the stage-1 checkpoint of the LLaVA-v1.5-7B, with a learning rate of 2×1052superscript1052\times 10^{-5}2 × 10 start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT and a batch size of 8×168168\times 168 × 16, and lasts for one epoch. The main experiments and ablation studies are conducted based on the LLaVA-v1.5-7B [30], while mR2AG is also applicable to other MLLMs, such as Mini-Gemini [28] and Mipha [48]. We use CLIP-ViT-L/14@336px [39] as the retriever for INFOSEEK [9], and directly use the retrieval results based on Google Lens for Enc-VQA [35]. By default, we utilize the top-5 Wikipedia entries from the retrieval results for both benchmarks.

4.3 Comparisons with SOTA

4.3.1 INFOSEEK

Without Knowledge. In the setting without external knowledge, the model predicts the answer based solely on the input image and question, relying on the knowledge stored in its parameters during training. To explore the performance of MLLMs under this protocol, we fine-tune LLaVA-v1.5-7B [30] using {V, Q, A} triples from the INFOSEEK training set. After fine-tuning, the model’s performance on INFOSEEKHumanHuman{}_{\text{Human}}start_FLOATSUBSCRIPT Human end_FLOATSUBSCRIPT improves from 9.5 to 12.0, and on INFOSEEKWikidataWikidata{}_{\text{Wikidata}}start_FLOATSUBSCRIPT Wikidata end_FLOATSUBSCRIPT from 9.1 to 20.5. Additionally, the strongest models, GPT-4v [1] and GPT-4o [37], achieve scores of 12.1 and 21.3 on INFOSEEKHumanHuman{}_{\text{Human}}start_FLOATSUBSCRIPT Human end_FLOATSUBSCRIPT, respectively. These results suggest that while fine-tuning for specific datasets or using stronger models helps improve performance, current models still fall short in knowledge-based VQA tasks, underscoring the need for external knowledge.

Table 2: Main results of various methods on Enc-VQA[35]. By default, these methods use Google Lens [16] as the retriever, while methods marked with use a custom retrieval scheme.
Model Retrieved Oracle
Single-hop Multi-answer Singel-hop
PaLI [35] 28.1% 9.2% 48.8%
PaLM [35] 48.8% 33.6% 87.0%
GPT-3 [35] 44.9% 32.1% 82.1%
Wiki-LLaVA [7] 21.8% 39.2%
EchoSight [45] 41.8%
HAMMR-BLIP-2 [8] 45.0%
HAMMR-PaLI-X [8] 47.8%
Cascade w/o vanilla MLLMs [3] 53.4%
LLaVA-mR2AG 55.9% 51.8% 88.2%

Retrieved Knowledge. Table. 1 presents the comparisons of models utilizing an external knowledge base on the INFOSEEK [9] benchmark. When leveraging the articles of the retrieved Wikipedia entries for answer generation, our mR2AG significantly outperforms the best existing models on all three test sets. Specifically, mR2AG surpasses LLM-RA [20], CLIP→FiD [9], and EchoSight [45] by 15.5%, 10.6%, and 8.9% in overall accuracy on INFOSEEKWikidataWikidata{}_{\text{Wikidata}}start_FLOATSUBSCRIPT Wikidata end_FLOATSUBSCRIPT, INFOSEEKHumanHuman{}_{\text{Human}}start_FLOATSUBSCRIPT Human end_FLOATSUBSCRIPT, and INFOSEEKValidationValidation{}_{\text{Validation}}start_FLOATSUBSCRIPT Validation end_FLOATSUBSCRIPT.

To further verify that our improvement stems from the proposed mR2AG framework rather than simply the improved retrieval method or the powerful MLLM, we implement two comparative models: LLaVA-mRAG and LLaVA-SFR. Both models use the same retrieval results as ours. LLaVA-mRAG directly uses the articles of the retrieved entries to augment generation, while LLaVA-SFR employs an off-the-shelf model, SFR [34], to identify question-related paragraphs in the articles for augmenting generation. Our results significantly outperform these two models, demonstrating that our method can effectively utilize noisy retrieval content, accurately pinpoint the relevant information, and extract the knowledge needed to answer the questions.

Oracle Knowledge. Even when it is possible to obtain the ground-truth Wikipedia entries corresponding to the involved entities in the question, and use them to augment the answer generation, our method still surpasses other methods across all test sets. This further highlights the advantage of mR2AG in refining useful evidence, and explicitly outputting evaluation results before generating answers improves the model’s ability to perform knowledge reasoning.

4.3.2 Encyclopedic VQA

As Table. 2 illustrates, when using the retrieved knowledge, mR2AG achieves superior performance, improving from 53.4% to 55.9% on single-hop questions and significantly increasing from 33.6% to 51.8% on the overlooked multi-answer questions. This indicates that mR2AG can effectively analyze noisy retrieved knowledge, accurately locate relevant information, and generate high-quality responses. Under the Oracle setting, when provided with more reliable knowledge sources, the performance of mR2AG improves dramatically from 55.9% to 88.2%, surpassing the previous SOTA of 87.0%. This suggests that: 1) Improving retrieval precision effectively enhances the performance of mR2AG, demonstrating the method’s high potential. 2) With the same retrieval content, the superior performance underscores that the Relevance-Reflection mechanism introduced by mR2AG further strengthens the model’s information extraction and reasoning capabilities. Overall, the state-of-the-art performance across multiple Knowledge-based VQA tasks showcases the broad applicability of mR2AG.

Table 3: Main results on common Visual-dependent benchmarks.
Model MME LLaVAWW{}^{\text{W}}start_FLOATSUPERSCRIPT W end_FLOATSUPERSCRIPT MMB POPE
cong perc
LLaVA-v1.5-7B [29] 355.7 1507.9 64.9 64.7 85.9
Wiki-LLaVA [7] 341.3 1438.9 - 71.1 84.2
LLaVA-mR2AG 325.7 1476.2 65.8 66.2 86.1
Table 4: Results on MLLMs with different architectures and scales.
Model LLM #Params INFOSEEK Wikidata INFOSEEK Human
Without Knowledge
Mipha [48] Phi2 3B 7.0 6.4
MGM [28] Vicuna 7B 10.7 10.2
LLaVA [30] Vicuna 13B 9.8 9.9
Retrieved Knowledge
Mipha-mRAG Phi2 3B 26.6 14.4
MGM-mRAG Vicuna 7B 29.8 18.4
LLaVA-mRAG Vicuna 13B 31.3 18.1
Mipha-mR2AG Phi2 3B 34.4 26.3
MGM-mR2AG Vicuna 7B 38.4 28.8
LLaVA-mR2AG Vicuna 13B 36.7 29.4

4.4 Comparisons on the Visual-dependent Tasks

We conduct comprehensive evaluations on widely used Visual-dependent benchmarks to demonstrate mR2AG’s capabilities in handling questions about visual content or common sense. As shown in Table. 3, our model performs comparably to the base MLLM, i.e., LLaVA-V1.5-7B [29], on the MME [13] benchmark and surpasses it on LLaVAW [29], MMB [31], and POPE [27] benchmarks. These comparisons highlight the effectiveness of the mR2AG design, which strategically separates Visual-dependent tasks from Knowledge-based VQA tasks, allowing for targeted optimization in Knowledge-based domains while maintaining excellent performance on Visual-dependent tasks.

4.5 Generalizability and Scalability

We validate the effectiveness of the mR2AG framework on three model architectures (Mipha [48], Mini-Gemini [28], and LLaVA [30]), covering three different scales of language models (3B, 7B, and 13B). In our experiments, we fine-tune these MLLMs from their stage-1 checkpoints using the same instruction-tuning data, while maintaining consistent hyperparameters such as learning rate, batch size and number of epochs. The results in Table. 4 show that these base models generally perform poorly on the INFOSEEK [9] benchmark. However, by introducing the naive mRAG approach or the mR2AG framework, we observe significant improvements in performance on Knowledge-based VQA tasks. Notably, the mR2AG framework consistently outperforms the naive mRAG approach, demonstrating its generalizability across different model architectures and scalability across varying language model sizes.

Refer to caption
Figure 3: Qualitative comparison of GPT-4o and mR2AG on INFOSEEK dataset. Two failure cases are shown in the (c) and (d).
Table 5: Comparisons of different answer post-processing methods.
Score Type INFOSEEKValidationValidation{}_{\text{Validation}}start_FLOATSUBSCRIPT Validation end_FLOATSUBSCRIPT
Unseen Question Unseen Entity Overall
Random 32.7 32.7 32.7
Sanssuperscript𝑆ansS^{\mathrm{ans}}italic_S start_POSTSUPERSCRIPT roman_ans end_POSTSUPERSCRIPT 36.4 37.0 36.7
Sretsuperscript𝑆retS^{\mathrm{ret}}italic_S start_POSTSUPERSCRIPT roman_ret end_POSTSUPERSCRIPT 37.6 37.4 37.5
Srelsuperscript𝑆relS^{\mathrm{rel}}italic_S start_POSTSUPERSCRIPT roman_rel end_POSTSUPERSCRIPT 39.7 38.7 39.2
SretSanssuperscript𝑆retsuperscript𝑆ansS^{\mathrm{ret}}\cdot S^{\mathrm{ans}}italic_S start_POSTSUPERSCRIPT roman_ret end_POSTSUPERSCRIPT ⋅ italic_S start_POSTSUPERSCRIPT roman_ans end_POSTSUPERSCRIPT 38.3 38.7 38.5
SrelSanssuperscript𝑆relsuperscript𝑆ansS^{\mathrm{rel}}\cdot S^{\mathrm{ans}}italic_S start_POSTSUPERSCRIPT roman_rel end_POSTSUPERSCRIPT ⋅ italic_S start_POSTSUPERSCRIPT roman_ans end_POSTSUPERSCRIPT 39.5 39.0 39.2
SretSrelsuperscript𝑆retsuperscript𝑆relS^{\mathrm{ret}}\cdot S^{\mathrm{rel}}italic_S start_POSTSUPERSCRIPT roman_ret end_POSTSUPERSCRIPT ⋅ italic_S start_POSTSUPERSCRIPT roman_rel end_POSTSUPERSCRIPT 40.5 39.3 39.9
SretSrelSanssuperscript𝑆retsuperscript𝑆relsuperscript𝑆ansS^{\mathrm{ret}}\cdot S^{\mathrm{rel}}\cdot S^{\mathrm{ans}}italic_S start_POSTSUPERSCRIPT roman_ret end_POSTSUPERSCRIPT ⋅ italic_S start_POSTSUPERSCRIPT roman_rel end_POSTSUPERSCRIPT ⋅ italic_S start_POSTSUPERSCRIPT roman_ans end_POSTSUPERSCRIPT 40.6 39.8 40.2
Table 6: Effect of retrieving different numbers of Wikipedia entries.
Retrieved Articles R@K INFOSEEKValidationValidation{}_{\text{Validation}}start_FLOATSUBSCRIPT Validation end_FLOATSUBSCRIPT
1 0.38 29.7
3 0.53 37.6
5 0.59 40.2
10 0.65 42.8

4.6 Ablation Studies

In this section, we conduct ablation studies to verify the effectiveness of our model’s design choices. Most comparisons are performed on the INFOSEEKValidationValidation{}_{\text{Validation}}start_FLOATSUBSCRIPT Validation end_FLOATSUBSCRIPT dataset.

Effect of Sretsuperscript𝑆retS^{\mathrm{ret}}italic_S start_POSTSUPERSCRIPT roman_ret end_POSTSUPERSCRIPT, Srelsuperscript𝑆relS^{\mathrm{rel}}italic_S start_POSTSUPERSCRIPT roman_rel end_POSTSUPERSCRIPT and Sanssuperscript𝑆ansS^{\mathrm{ans}}italic_S start_POSTSUPERSCRIPT roman_ans end_POSTSUPERSCRIPT. To evaluate the effectiveness of using the product of the Retrieval-Reflection score Sretsuperscript𝑆retS^{\mathrm{ret}}italic_S start_POSTSUPERSCRIPT roman_ret end_POSTSUPERSCRIPT, Relevance-Reflection score Srelsuperscript𝑆relS^{\mathrm{rel}}italic_S start_POSTSUPERSCRIPT roman_rel end_POSTSUPERSCRIPT, and answer confidence score Sanssuperscript𝑆ansS^{\mathrm{ans}}italic_S start_POSTSUPERSCRIPT roman_ans end_POSTSUPERSCRIPT in post-processing, we conducted a comprehensive comparison with all alternative ranking methods, as shown in Table 5. As observed, randomly selecting from answer candidates yields the lowest performance, which is not desirable. When using scores for post-processing, the combination of all three scores is superior to other score combinations, demonstrating that these scores effectively assess the reliability of the answer at different levels.

Effect of retrieving different numbers of Wikipedia entries. To provide insights into the optimal number of retrieved Wikipedia entries for augmenting generation, we explore how varying the number of entries affects performance, as shown in Table 6. Increasing the number of retrieved entries from 1 to 5 significantly improves the overall accuracy of INFOSEEKValidationValidation{}_{\text{Validation}}start_FLOATSUBSCRIPT Validation end_FLOATSUBSCRIPT, as higher recall rates increase the likelihood of capturing relevant information. However, further increasing the number of retrieved entries introduces more irrelevant content and noise, leading to limited performance improvement and higher inference costs. Considering the trade-off between performance and efficiency, we select 5 retrieved Wikipedia entries for each question.

Contribution of the NQ samples to the mR2AG-IT dataset. NQ samples serve as a supplement to the mR2AG-IT dataset, which contains samples with evidence paragraphs. As shown in Table 7, removing this portion of data leads to a performance decline. This finding highlights the importance of including NQ data in enhancing the model’s ability to accurately identify evidence paragraphs.

Table 7: The importance of NQ dataset.
INFOSEEKValidationValidation{}_{\text{Validation}}start_FLOATSUBSCRIPT Validation end_FLOATSUBSCRIPT
Unseen Question Unseen Entity Overall
w/o NQ 39.1 39.7 39.4
w/ NQ 40.6 39.8 40.2

Benefits of combining cross-modal and uni-modal retrievals. Table 8 compares the performance of different methods for retrieving the ground-truth Wikipedia entities on INFOSEEKValidationValidation{}_{\text{Validation}}start_FLOATSUBSCRIPT Validation end_FLOATSUBSCRIPT. It is evident that combining cross-modal and uni-modal retrievals significantly improves R@1/10/20, thereby incorporating more knowledge beneficial for answering the questions.

4.7 Qualitative analysis

In Figure. 3, we present the qualitative comparison of the mR2AG with GPT-4o. As illustrated in Figure. 3 (a) and (b), our method retrieves relevant knowledge and accurately answers the Knowledge-based questions, achieving more precise results compared to GPT-4o. However, two failure cases are shown in Figure.3:

  • Inaccurate retrieval: As illustrated in Figure. 3 (c), when the subject in the image is difficult to identify, the retriever struggles to find relevant information, making it challenging for our method to answer the questions.

  • Knowledge interference: In Figure. 3 (d), the retriever finds the correct entity, but our method provides the wrong answer due to the conflicting knowledge in the text, specifically ”a suspended span of 564 feet.”

Table 8: Comparisons of retrieval performance across different retrieval methods on the INFOSEEK knowledge base.
Retrieval Method R@1 R@10 R@20
Cross-modal 0.31 0.58 0.65
Uni-modal 0.29 0.55 0.62
Cross-modal + Uni-modal 0.38 0.65 0.71

5 Conclusion

This paper proposes an advanced multimodal RAG framework that optimizes Knowledge-based VQA tasks while maintaining the model’s capabilities as a general-purpose MLLM. The approach is based on existing MLLMs, guiding the model to explicitly distinguish the type of user query and evaluate retrieved information to refine the naive multimodal RAG process. It improves inference efficiency by adaptively avoiding unnecessary retrievals. Furthermore, by explicitly evaluating the retrieved content, it can identify the evidence passages relevant to the query, while filtering out noise from the retrieved content, thereby enhancing the credibility of the generated responses. Future work will explore knowledge graph-based retrieval-augmented systems and broader application scenarios.

6 Limitation

Limited by the entity recognition capabilities of existing MLLMs, our method is quite dependent on the retriever, assuming that the default visual entity occupies the major position in the image. In the future, we plan to conduct specialized training for visual entity recognition tasks and guide the model to enhance the discrimination of visual entity recognition categories and locations.

References

  • Achiam et al. [2023] Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  • Alayrac et al. [2022] Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems, 35:23716–23736, 2022.
  • Alazraki et al. [2023] Lisa Alazraki, Lluis Castrejon, Mostafa Dehghani, Fantine Huot, Jasper Uijlings, and Thomas Mensink. How (not) to ensemble lvlms for vqa. In Proceedings on, pages 1–20. PMLR, 2023.
  • Asai et al. [2023] Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. Self-rag: Learning to retrieve, generate, and critique through self-reflection. arXiv preprint arXiv:2310.11511, 2023.
  • Brown et al. [2020] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  • Bulian et al. [2022] Jannis Bulian, Christian Buck, Wojciech Gajewski, Benjamin Boerschinger, and Tal Schuster. Tomayto, tomahto. beyond token-level answer equivalence for question answering evaluation. arXiv preprint arXiv:2202.07654, 2022.
  • Caffagni et al. [2024] Davide Caffagni, Federico Cocchi, Nicholas Moratelli, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. Wiki-llava: Hierarchical retrieval-augmented generation for multimodal llms. arXiv preprint arXiv:2404.15406, 2024.
  • Castrejon et al. [2024] Lluis Castrejon, Thomas Mensink, Howard Zhou, Vittorio Ferrari, Andre Araujo, and Jasper Uijlings. Hammr: Hierarchical multimodal react agents for generic vqa. arXiv preprint arXiv:2404.05465, 2024.
  • Chen et al. [2023] Yang Chen, Hexiang Hu, Yi Luan, Haitian Sun, Soravit Changpinyo, Alan Ritter, and Ming-Wei Chang. Can pre-trained vision and language models answer visual information-seeking questions? In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 14948–14968, 2023.
  • Chiang et al. [2023] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E Gonzalez, et al. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. See https://vicuna. lmsys. org (accessed 14 April 2023), 2(3):6, 2023.
  • Chowdhery et al. [2023] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023.
  • Chung et al. [2024] Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, et al. Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25(70):1–53, 2024.
  • Fu et al. [2024] Chaoyou Fu, Peixian Chen, Yunhang Shen, Yulei Qin, Mengdan Zhang, Xu Lin, Jinrui Yang, Xiawu Zheng, Ke Li, Xing Sun, Yunsheng Wu, and Rongrong Ji. Mme: A comprehensive evaluation benchmark for multimodal large language models, 2024.
  • Gao et al. [2023] Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, and Haofen Wang. Retrieval-augmented generation for large language models: A survey. arXiv preprint arXiv:2312.10997, 2023.
  • Gautier et al. [2022] Izacard Gautier, Caron Mathilde, Hosseini Lucas, Riedel Sebastian, Bojanowski Piotr, Joulin Armand, and Grave Edouard. Unsupervised dense information retrieval with contrastive learning. Transactions on Machine Learning Research, 2022.
  • Google [2017] Google. Google Lens. https://lens.google.com - Web interface available at https://images.google.com, 2017.
  • Goyal et al. [2017] Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017.
  • Hu et al. [2024] Ziniu Hu, Ahmet Iscen, Chen Sun, Kai-Wei Chang, Yizhou Sun, David Ross, Cordelia Schmid, and Alireza Fathi. Avis: Autonomous visual information seeking with large language model agent. Advances in Neural Information Processing Systems, 36, 2024.
  • Hudson and Manning [2019] Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019.
  • Jian et al. [2024] Pu Jian, Donglei Yu, and Jiajun Zhang. Large language models know what is key visual entity: An llm-assisted multimodal retrieval for vqa. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 10939–10956, 2024.
  • Kandpal et al. [2023] Nikhil Kandpal, Haikang Deng, Adam Roberts, Eric Wallace, and Colin Raffel. Large language models struggle to learn long-tail knowledge. In International Conference on Machine Learning, pages 15696–15707. PMLR, 2023.
  • Kwiatkowski et al. [2019] Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, et al. Natural questions: a benchmark for question answering research. Transactions of the Association for Computational Linguistics, 7:453–466, 2019.
  • Lerner et al. [2022] Paul Lerner, Olivier Ferret, Camille Guinaudeau, Hervé Le Borgne, Romaric Besançon, José G Moreno, and Jesús Lovón Melgarejo. Viquae, a dataset for knowledge-based visual question answering about named entities. In Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 3108–3120, 2022.
  • Lewis et al. [2020] Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems, 33:9459–9474, 2020.
  • Li et al. [2023a] Bohao Li, Rui Wang, Guangzhi Wang, Yuying Ge, Yixiao Ge, and Ying Shan. Seed-bench: Benchmarking multimodal llms with generative comprehension. arXiv preprint arXiv:2307.16125, 2023a.
  • Li et al. [2023b] Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 19730–19742. PMLR, 2023b.
  • Li et al. [2023c] Yifan Li, Yifan Du, Kun Zhou, Jinpeng Wang, Wayne Xin Zhao, and Ji-Rong Wen. Evaluating object hallucination in large vision-language models. arXiv preprint arXiv:2305.10355, 2023c.
  • Li et al. [2024] Yanwei Li, Yuechen Zhang, Chengyao Wang, Zhisheng Zhong, Yixin Chen, Ruihang Chu, Shaoteng Liu, and Jiaya Jia. Mini-gemini: Mining the potential of multi-modality vision language models. arXiv preprint arXiv:2403.18814, 2024.
  • Liu et al. [2023] Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023.
  • Liu et al. [2024a] Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. Advances in neural information processing systems, 36, 2024a.
  • Liu et al. [2024b] Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, Kai Chen, and Dahua Lin. Mmbench: Is your multi-modal model an all-around player?, 2024b.
  • Mao et al. [2020] Yuning Mao, Pengcheng He, Xiaodong Liu, Yelong Shen, Jianfeng Gao, Jiawei Han, and Weizhu Chen. Generation-augmented retrieval for open-domain question answering. arXiv preprint arXiv:2009.08553, 2020.
  • Marino et al. [2019] Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pages 3195–3204, 2019.
  • Meng et al. [2024] Rui Meng, Ye Liu, Shafiq Rayhan Joty, Caiming Xiong, Yingbo Zhou, and Semih Yavuz. Sfrembedding-mistral: enhance text retrieval with transfer learning. Salesforce AI Research Blog, 3, 2024.
  • Mensink et al. [2023] Thomas Mensink, Jasper Uijlings, Lluis Castrejon, Arushi Goel, Felipe Cadar, Howard Zhou, Fei Sha, André Araujo, and Vittorio Ferrari. Encyclopedic vqa: Visual questions about detailed properties of fine-grained categories. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3113–3124, 2023.
  • Methani et al. [2020] Nitesh Methani, Pritha Ganguly, Mitesh M Khapra, and Pratyush Kumar. Plotqa: Reasoning over scientific plots. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 1527–1536, 2020.
  • OpenAI [2024] OpenAI. Introducing gpt-4o: Openai’s new flagship multimodal model now in preview on azure. Microsoft Azure Blog, 2024. Available at: https://azure.microsoft.com/en-us/blog/introducing-gpt-4o-openais-new-flagship-multimodal-model-now-in-preview-on-azure/.
  • Ouyang et al. [2022] Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744, 2022.
  • Radford et al. [2021] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  • Reid et al. [2024] Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024.
  • Schwenk et al. [2022] Dustin Schwenk, Apoorv Khandelwal, Christopher Clark, Kenneth Marino, and Roozbeh Mottaghi. A-okvqa: A benchmark for visual question answering using world knowledge. In European Conference on Computer Vision, pages 146–162. Springer, 2022.
  • Singh et al. [2019] Amanpreet Singh, Vivek Natarajan, Meet Shah, Yu Jiang, Xinlei Chen, Dhruv Batra, Devi Parikh, and Marcus Rohrbach. Towards vqa models that can read. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8317–8326, 2019.
  • Touvron et al. [2023] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  • Wei et al. [2022] Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022.
  • Yan and Xie [2024] Yibin Yan and Weidi Xie. Echosight: Advancing visual-language models with wiki knowledge. arXiv preprint arXiv:2407.12735, 2024.
  • Yue et al. [2023] Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi. arXiv preprint arXiv:2311.16502, 2023.
  • Zhang et al. [2023] Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, et al. Siren’s song in the ai ocean: A survey on hallucination in large language models. corr abs/2309.01219 (2023), 2023.
  • Zhu et al. [2024] Minjie Zhu, Yichen Zhu, Xin Liu, Ning Liu, Zhiyuan Xu, Chaomin Shen, Yaxin Peng, Zhicai Ou, Feifei Feng, and Jian Tang. Mipha: A comprehensive overhaul of multimodal assistant with small language models. CoRR, 2024.
\thetitle

Supplementary Material

7 Prompt Engineering

7.1 mR2AG-IT Dataset Annotation

We utilize the GPT-4 [1] model via API to annotate the training dataset and design the following prompt to assess the relevance between retrieved content and the query. Inspired by the chain-of-thought [44] approach, the prompt instructs GPT-4 [1] to extract evidence sentences before generating relevance judgments. This design not only enhances the accuracy of the judgments but also facilitates manual calibration. For each input, the content within curly braces {} is replaced with the corresponding actual input. In the following, ”question”, ”answer”, and ”paragraph” correspond to Q𝑄Qitalic_Q, A𝐴Aitalic_A, and Pijsubscript𝑃𝑖𝑗P_{ij}italic_P start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT, respectively, as defined in Section 3.3.

Dataset Annotation Prompt for GPT-4 Instruction: Given a question and its corresponding answer, I need your help to verify whether the retrieved document provided below can fully and effectively support the corresponding answer to the question, and then accurately locate the source of the answer within the paragraph. If so, please respond with [Relevant] and find the evidence sentence supporting the answer. If not, please just respond with [Irrelevant]. There are only two formats for your response: 1. [Relevant]
Answer source: source sentence.
2. [Irrelevant]
Input: Question: {question}. Answer: {answer}. Retrieved document: {paragraph}.

7.2 INFOSEEK

INFOSEEK [9] evaluates generated answers using exact match, requiring the outputs to strictly match the annotated answers, which are typically concise and presented in the form of a single word or phrase. To ensure the outputs align with these requirements, we design the following prompt, guiding the model to focus on retrieved content and produce concise responses:

”Based on the retrieved document, answer the question with a single word or phrase.”

7.3 Encyclopedic-VQA

The Enc-VQA [35] dataset includes both single-hop and multi-answer questions. For single-hop questions, we adopt the same prompt template as INFOSEEK. For multiple-answer questions, the model needs to generate several possible answers. We adjust the prompt to ensure the answers comply with the dataset requirements, enabling effective extraction of answer lists from the responses for evaluation:

”Based on the retrieved documents, answer the question as briefly as possible, using ’&&’ to connect multiple different answers.”

8 Additional Experiment Results

Tables 9, 10, and 11 present the complete experimental results on INFOSEEK [9] across various question types. In the without external knowledge setting, the model relies solely on the knowledge encoded in its parameters to answer questions. As shown in Tables 9 and 11, we fine-tune the LLaVA model without integrating external knowledge. The results indicate that this approach only leads to limited performance improvements. Additionally, using APIs, we evaluate the performance of GPT-4v/o [1, 37] and Gemini-1.5-pro [40] on INFOSEEKHumanHuman{}_{\text{Human}}start_FLOATSUBSCRIPT Human end_FLOATSUBSCRIPT. As shown in Table 9, although the GPT series models outperform other fine-tuned models, they remain inferior to the mR2AG framework. Overall, the complete experimental results on the INFOSEEK dataset demonstrate that mR2AG significantly improves accuracy across all question types, with the most notable enhancement observed in the Time category. These findings further underscore the superiority of our approach in addressing Knowledge-based VQA tasks.

Table 9: Complete results by question type on INFOSEEKHumanHuman{}_{\text{Human}}start_FLOATSUBSCRIPT Human end_FLOATSUBSCRIPT, with LLaVA-FT referring to the fine-tuned model.
Model LLM Params INFOSEEKHumanHuman{}_{\text{Human}}start_FLOATSUBSCRIPT Human end_FLOATSUBSCRIPT
Unseen Question Unseen Entity Overall
Time Num String Avg Time Num String Avg
Without External Knowledge
LLaVA [30] Vicuna 7B 8.6 12.1 8.6 9.5 7.8 13.1 8.4 9.5 9.5
LLaVA-FT Vicuna 7B 12.6 17.1 13.6 14.3 8.5 14.3 9.3 10.4 12.0
PaLI-X [9] UL232B32B{}_{\text{32B}}start_FLOATSUBSCRIPT 32B end_FLOATSUBSCRIPT 55B 12.9 9.3 10.8
Gemini-1.5-pro [40] 8.1 7.7 15.5 11.3 5.6 5.1 10.3 7.6 9.1
GPT-4v [1] 15.5 13.3 14.0 14.3 12.4 9.5 9.9 10.5 12.1
GPT4-o [37] 31.0 29.5 21.7 26.5 20.8 22.0 13.7 17.9 21.3
Retrieved Knowledge
LLaVA-mRAG Vicuna 7B 20.0 19.1 15.0 17.6 19.3 16.8 13.2 15.9 16.7
LLaVA-SFR Vicuna 7B 16.2 26.0 15.4 18.5 14.6 27.5 13.0 17.2 17.9
LLaVA-mR2AG Vicuna 7B 37.8 39.6 19.7 30.2 33.5 39.7 16.8 27.5 28.8
Oracle Knowledge
LLaVA-mRAG Vicuna 7B 41.9 29.6 29.1 32.8 37.3 28.9 22.2 28.2 30.3
LLaVA-SFR Vicuna 7B 49.8 64.7 34.0 46.9 44.1 66.6 29.4 43.3 45.0
LLaVA-mR2AG Vicuna 7B 65.2 66.6 31.2 50.4 59.3 68.6 27.5 47.2 48.7
Table 10: Complete results by question type on INFOSEEKValidationValidation{}_{\text{Validation}}start_FLOATSUBSCRIPT Validation end_FLOATSUBSCRIPT.
Model LLM Params INFOSEEKValidationValidation{}_{\text{Validation}}start_FLOATSUBSCRIPT Validation end_FLOATSUBSCRIPT
Unseen Question Unseen Entity Overall
Time Num String Avg Time Num String Avg
Without External Knowledge
InstructBLIP [9] Flan-T5XXLXXL{}_{\text{XXL}}start_FLOATSUBSCRIPT XXL end_FLOATSUBSCRIPT 12B 7.9 7.5 17.8 15.0 6.6 8.2 16.1 14.0 14.5
BLIP2 [9] Flan-T5XXLXXL{}_{\text{XXL}}start_FLOATSUBSCRIPT XXL end_FLOATSUBSCRIPT 12B 6.9 5.8 18.5 15.0 5.6 6.0 17.0 14.2 14.6
PaLI-17B [9] mT5XXLXXL{}_{\text{XXL}}start_FLOATSUBSCRIPT XXL end_FLOATSUBSCRIPT 17B 3.8 18.4 27.4 24.2 1.0 14.8 18.2 16.7 19.7
PaLI-X [9] UL232B32B{}_{\text{32B}}start_FLOATSUBSCRIPT 32B end_FLOATSUBSCRIPT 55B 7.7 16.1 30.0 25.8 8.1 17.2 24.8 22.4 24.0
LLaVA-FT Vicuna 7B 10.4 21.0 25.8 24.0 8.2 21.1 20.7 20.2 21.9
Retrieved Knowledge
CLIP → PaLM [9] PaLM 540B 12.5 27.7 21.7 15.6 17.8 21.3 17.7 14.9 15.2
CLIP → FiD [9] T5large 660B 12.3 23.4 23.9 18.9 13.8 15.2 20.5 17.6 18.2
LLaVA-mR2AG Vicuna 7B 40.6 31.8 44.7 40.2 40.3 25.3 43.9 39.8 40.1
Table 11: Complete results by question type on INFOSEEKWikidataWikidata{}_{\text{Wikidata}}start_FLOATSUBSCRIPT Wikidata end_FLOATSUBSCRIPT.
Model LLM Params INFOSEEKWikidataWikidata{}_{\text{Wikidata}}start_FLOATSUBSCRIPT Wikidata end_FLOATSUBSCRIPT
Unseen Question Unseen Entity Overall
Time Num String Avg Time Num String Avg
Without External Knowledge
LLaVA [30] Vicuna 7B 9.3 12.0 9.1 9.9 6.9 11.5 7.5 8.4 9.1
LLaVA-FT Vicuna 7B 11.3 14.8 9.3 10.8 7.6 14.1 7.6 9.0 9.8
Retrieved knowledge
LLaVA-mRAG Vicuna 7B 29.2 22.4 33.3 30.3 27.5 21.0 31.8 29.2 29.8
LLaVA-SFR Vicuna 7B 18.9 19.7 21.3 20.8 15.2 18.6 19.6 19.1 19.9
LLaVA-mR2AG Vicuna 7B 41.6 29.3 42.5 39.1 38.3 25.7 41.7 38.0 38.6
Oracle knowledge
LLaVA-mRAG Vicuna 7B 71.6 34.3 61.8 55.3 58.4 35.2 62.2 56.1 55.7
LLaVA-SFR Vicuna 7B 71.0 45.8 59.5 56.6 59.4 43.2 59.0 55.6 56.1
LLaVA-mR2AG Vicuna 7B 70.6 43.2 62.9 58.3 67.0 41.2 62.3 57.9 58.1

9 Qualitative Results and Visualizations

Figure 4 qualitatively demonstrates the effectiveness of the mR2AG framework. It highlights the framework’s ability to accurately assess the relevance between retrieved content and user queries, precisely locate evidence paragraphs within the retrieved documents, and generate reliable answers. Figure 5 provides additional visualization results, illustrating that mR2AG effectively handles various types of visual entities and question types, further validating the design’s effectiveness and reliability. The last column presents additional error cases, where the primary issue lies in the failure to retrieve relevant Wikipedia entities for the visual content.

Refer to caption
Figure 4: Qualitative results showing the effectiveness of the mR2AG framework. The first row shows results from the INFOSEEK dataset, while the second row shows results from Enc-VQA.
Refer to caption
Figure 5: Additional visualization results are provided: the first row shows examples from INFOSEEK; the second row shows examples from Enc-VQA, covering single-hop and multi-answer questions. The last column presents incorrect answers.