RAG实战—Query Transformations提升上下文召唤质量

最新推荐文章于 2025-09-05 22:58:22 发布

原创最新推荐文章于 2025-09-05 22:58:22 发布 · 1.1k 阅读

13 ·

CC 4.0 BY-SA版权

文章标签：

#rag #llm

LLM 专栏收录该内容

4 篇文章

订阅专栏

Query Transformations提升上下文召唤质量

前言
摘要
Multi Query
- 实现细节
RAG Fusion
Decomposition
Answer Individually
Step Back
- Few Shot Examples
- Text Generation
HyDE
- Hypoteetical Document
- Text Generation
Github
参考内容

前言

本文引用上一篇博客的作法。在本地开启一个代理服务器，然后使用OpenAI的ChatOpenAI作为聊天接口客户端，使其发送的请求链接到我们的本地服务器。

摘要

在传统的 RAG 检索中，单一查询往往存在语义表达不充分、表述角度狭窄以及信息覆盖不全面等问题。这种局限容易导致向量检索阶段召回的上下文不准确，从而影响最终生成结果的质量。为了解决这些问题，Query Transformations 技术应运而生。它通过从多个语义视角对原始查询进行扩展和重构，有效提升了查询的表达能力和语义覆盖范围，弥补了信息缺失，为模型提供更丰富、更相关的上下文，引导生成结果更贴近用户意图。

Multi Query

Multi Query的主旨思想是由大语言模型生成多个与原始查询相关但视角不同的子问题。然后对每一个子问题进行向量检索，在向量数据库根据相似度匹配相关的Documents。最后对所有子问题的Documents进行去重和汇总，作为Context传入文本生成模型，得到最终回答。
在这里插入图片描述
Fig .1 Multi Query框架示意图

实现细节

在Multi Query将完整介绍整个实现流程，在后续的其他Query Transformations策略中只介绍上下文召唤部分。

Index

导入文本数据后，使用Loader加载。并通过RecursiveCharacterTextSplitter对文本进行切块。

file_path = os.path.abspath('../docs/PatchTST.pdf')
loader = PyPDFLoader(file_path=file_path, extract_images=True)

pages = loader.load_and_split()
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=50,
    chunk_overlap=10
)
docs = text_splitter.split_documents(pages)

定义Embedding和Text Generation模型。如上所述，使用vllm部署OpenAI服务，然后调用其聊天接口客户端ChatOpenAI。

# 英文通用文本模型
embedding = ModelScopeEmbeddings(model_id='iic/nlp_corom_sentence-embedding_english-base')
vectorstore = Chroma.from_documents(documents=docs, embedding=embedding, collection_name='ModelScope')

retriever = vectorstore.as_retriever()

# 使用vllm部署OpenAI Serve，然后使用ChatOpenAI
os.environ['VLLM_USE_MODELSCOPE'] = 'True'
chat = ChatOpenAI(
    model='Qwen/Qwen3-0.6B',
    openai_api_key="EMPTY",
    openai_api_base='http://localhost:8000/v1',
    stop=['<|im_end|>'],
    temperature=0
)

Prompt

此处是Multi Query的关键思想之一，通过prompt要求大语言模型生成多个不同视角的子问题，以丰富上下文。

# Multi Query: 根据原始查询生成多个不同视角的相关问题
template = """You are an AI language model assistant. Your task is to generate three 
different versions of the given user question to retrieve relevant documents from a vector 
database. By generating multiple perspectives on the user question, your goal is to help
the user overcome some of the limitations of the distance-based similarity search. 
Provide these alternative questions separated by newlines. Original question: {question}"""

prompt_perspectives = ChatPromptTemplate.from_template(template)


generate_queries = (
    prompt_perspectives
    | chat
    | StrOutputParser()
    | (lambda x: x.split("\n"))
)

去重和汇总

通过generate_queries生成多个子问题之后，依次到向量数据库检索子问题相关的Documents。然后对Documents进行去重，确保唯一性。最后，作为Context和原始查询Question传递给文本生成模型，生成最终答案。

def get_unique_union(documents: list[list]):
    # 序列化，将Document转为Str
    flattened_docs = [dumps(doc) for sublist in documents for doc in sublist]
    # 去重
    unique_docs = list(set(flattened_docs))
    # 反序列化-将Str转为Document
    return [loads(doc) for doc in unique_docs]


template = """Answer the following question based on this context:

{context}

Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)

retrival_chain = generate_queries | retriever.map() | get_unique_union

chat_chain = {
    'context': retrival_chain,
    'question': itemgetter('question')
} | prompt | chat | StrOutputParser()

question = "What is the purpose of time series forecasting?"
ans = chat_chain.invoke({'question': question})
print(ans)

RAG Fusion

RAG Fusion也是将原始查询通过LLM生成多个相关但视角不同的子问题，然后在向量数据库中检索匹配的Documents。与Multi Query不同的是，RAG Fusion通过一种RRF算法，计算每个文档的贡献值，然后根据贡献值降序排序。最终作为Context传入文本生成模型。
在这里插入图片描述
Fig .2 RAG Fusion框架示意图

Prompt

使用Prompt让LLM生成多个相关的子问题

# Fusion: 根据提出的问题生成多个不同视角的相关问题
template = """You are a helpful assistant that generates multiple search queries based on a single input query. \n
Generate multiple search queries related to: {question} \n
Output (3 queries):"""
prompt_rag_fusion = ChatPromptTemplate.from_template(template)

# 假设生成三个相关问题。
# 对于每一个问题，通过StrOutputParser转成字符串之后，由(lambda x: x.split('\n'))划分为一个数组
# 最终，三个问题被处理完之后，得到一个list[list]的数据类型
generate_queries = (
    prompt_rag_fusion
    | chat
    | StrOutputParser()
    | (lambda x: x.split("\n"))
)

RRF

此算法的核心是根据贡献值的大小降序排序。贡献值是由Document在不同向量检索结果中出现的次数加权得到，即使该Document在向量检索结果相似度较低，依旧会得到更高的贡献值。因为这样更能体现一种共识。
注：原始查询检索向量数据库，会根据相似度，选出top-k个Documents。

def reciprocal_rank_fusion(results: list[list], k=60):
    """ Reciprocal_rank_fusion that takes multiple lists of ranked documents
        and an optional parameter k used in the RRF formula """

    fused_scores = {}

    # 一个文档在多个向量检索结果中出现的次数越多，则贡献值加权则越多，排名越高。
    for docs in results:
        for rank, doc in enumerate(docs):
            # 序列化，将Document转为Str
            doc_str = dumps(doc)
            # 初始化贡献值
            if doc_str not in fused_scores:
                fused_scores[doc_str] = 0
            # 更新每个文档的贡献度，RRF: 1 / (rank + k)
            fused_scores[doc_str] += 1 / (rank + k)

    # 反序列化，将Str转为Document。并从大到小降序排序，即贡献度高的文档在前面
    reranked_results = [
        (loads(doc), score)
        for doc, score in sorted(fused_scores.items(), key=lambda x: x[1], reverse=True)
    ]

    return reranked_results

Text Generation

retrieval_chain_rag_fusion = generate_queries | retriever.map() | reciprocal_rank_fusion

question = "What is the purpose of time series forecasting?"

template = """Answer the following question based on this context:

{context}

Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)

chat_chain = {
    'context': retrieval_chain_rag_fusion,
    'question': itemgetter('question')
} | prompt | chat | StrOutputParser()

ans = chat_chain.invoke({'question': question})
print(ans)

Decomposition

Decomposition同样是由原始查询生成多个子问题。每一个子问题使用历史的Qustion和Answer作为额外的上下文传递给LLM，使得每一次回答都是基于历史的问题和回答，具有更丰富的Context。每一次回答都是上一次的精炼和提升，最终得出一个与原始查询高度匹配的答案。
在这里插入图片描述
Fig .3 Decomposition框架示意图

Prompt & sub-questions

与上述内容一致，由原始查询生成多个相关的子问题。

# Decomposition: 根据提出的问题生成多个不同视角的相关问题
template = """You are a helpful assistant that generates multiple sub-questions related to an input question. \n
The goal is to break down the input into a set of sub-problems / sub-questions that can be answers in isolation. \n
Generate multiple search queries related to: {question} \n
Output (3 queries):"""
prompt_rag_decomposition = ChatPromptTemplate.from_template(template)


generate_queries = (
    prompt_rag_decomposition
    | chat
    | StrOutputParser()
    | (lambda x: x.split("\n"))
)

question = "What is the purpose of time series forecasting?"

questions = generate_queries.invoke({'question': question})
print(questions)

Answer recursively Template

相较于之前的Prompt，Decomposition的多了一个q_a_pairs变量，用于存储历史的对话内容，以丰富上下文。

template = """Here is the question you need to answer:

\n --- \n {question} \n --- \n

Here is any available background question + answer pairs:

\n --- \n {q_a_pairs} \n --- \n

Here is additional context relevant to the question: 

\n --- \n {context} \n --- \n

Use the above context and any background question + answer pairs to answer the question: \n {question}
"""

prompt = ChatPromptTemplate.from_template(template)

Text Generation

每一次回答都会将Question和Answer追加到q_a_pairs变量中，然后作为额外的上下文传入LLM中。

def format_qa_pair(question, answer):
    """Format Q and A pair"""

    formatted_string = ""
    formatted_string += f"Question: {question}\nAnswer: {answer}\n\n"
    # 删除多余空格
    return formatted_string.strip()


q_a_pairs = ""
for q in questions:
    rag_chain = (
            {"context": itemgetter("question") | retriever,
             "question": itemgetter("question"),
             "q_a_pairs": itemgetter("q_a_pairs")}
            | prompt
            | chat
            | StrOutputParser())

    answer = rag_chain.invoke({"question": q, "q_a_pairs": q_a_pairs})
    print(answer)
    q_a_pair = format_qa_pair(q, answer)
    q_a_pairs = q_a_pairs + "\n---\n" + q_a_pair

Answer Individually

Answer Individually由原始查询生成多个子问题。对于每一个子问题，通过检索向量数据库，得到多个相似文档，并以此作为Context，利用LLM回答此问题。最终得到多个子问题的答案，并以Q-A的格式将其合并为一个字符串，作为Context传入LLM，得到最终回答。
在这里插入图片描述
Fig .4 Answer Individually框架示意图

sub-question

对于原始查询，生成不同视角的多个子问题。然后依次对每个子问题进行向量检索，并作为context。最后利用大语言模型对其进行文本生成，并记录问题和结果。最后得到子问题和子问题对应的回答两个列表。

def retrieve_and_rag(question, prompt_rag, sub_question_generator_chain):
    """RAG on each sub-question"""

    # Use our decomposition /
    sub_questions = sub_question_generator_chain.invoke({"question": question})

    # Initialize a list to hold RAG chain results
    rag_results = []

    for sub_question in sub_questions:
        # Retrieve documents for each sub-question
        retrieved_docs = retriever.get_relevant_documents(sub_question)

        # Use retrieved documents and sub-question in RAG chain
        answer = (prompt_rag | chat | StrOutputParser()).invoke({
             "context": retrieved_docs,
             "question": sub_question
             })
        rag_results.append(answer)

    return rag_results, sub_questions


question = "What is the purpose of time series forecasting?"

template = """
You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.
Question: {question} 
Context: {context} 
"""

prompt_rag = ChatPromptTemplate.from_template(template)

# Wrap the retrieval and RAG process in a RunnableLambda for integration into a chain
answers, questions = retrieve_and_rag(question, prompt_rag, generate_queries)

Merge

合并Qustions和Answers，使其成为一个具有丰富语义信息的Context。

def format_qa_pairs(questions, answers):
    """Format Q and A pairs"""

    formatted_string = ""
    for i, (question, answer) in enumerate(zip(questions, answers), start=1):
        formatted_string += f"Question {i}: {question}\nAnswer {i}: {answer}\n\n"
    #     用于去除字符串开头和结尾的空白字符（包括空格、换行符 \n、制表符 \t 等）
    return formatted_string.strip()


context = format_qa_pairs(questions, answers)

Text Generation

# Prompt
template = """Here is a set of Q+A pairs:

{context}

Use these to synthesize an answer to the question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)

rag_chain = (
    prompt
    | chat
    | StrOutputParser()
)

answer = rag_chain.invoke({"context": context,"question": question})
print(answer)

Step Back

原始查询往往过于具体，可能导致向量检索仅匹配字面相似内容而忽略更有价值的概念层级信息。Step-Back 通过将 query 提升一个层级来弥合用户意图与文档语义之间的差距。旨在引出更“宽泛”的信息，从而能覆盖更多潜在相关的上下文。
例如：

原始查询：What is the purpose of time series forecasting?
Step Back：Whta is time series forecasting?

在这里插入图片描述
Fig .5 Step Back

Few Shot Examples

其中examples给出了几个示例，其目的是通过少量示例（few-shot examples）教会大模型如何将一个具体的问题转化为一个更抽象、背景更宽泛的step-back问题。

examples = [
    {
        "input": "How does PatchTST improve time series forecasting?",
        "output": "What is PatchTST?",
    },
    {
        "input": "What is the purpose of time series forecasting?",
        "output": "What is time series forecasting?",
    },
]
# We now transform these to example messages
example_prompt = ChatPromptTemplate.from_messages(
    [
        ("human", "{input}"),
        ("ai", "{output}"),
    ]
)
few_shot_prompt = FewShotChatMessagePromptTemplate(
    example_prompt=example_prompt,
    examples=examples,
)
prompt = ChatPromptTemplate.from_messages(
    [
        (
            "system",
            """You are an expert at world knowledge. Your task is to step back and paraphrase a question to a more generic step-back question, which is easier to answer. Here are a few examples:""",
        ),
        # Few shot examples
        few_shot_prompt,
        # New question
        ("user", "{question}"),
    ]
)

generate_queries_step_back = prompt | chat | StrOutputParser()

Text Generation

对于生成的Step-Back Query，通过向量检索得到相关的Documents，并以此作为Context传入LLM，得到最终回答。

response_prompt_template = """You are an expert of world knowledge. I am going to ask you a question. Your response should be comprehensive and not contradicted with the following context if they are relevant. Otherwise, ignore them if they are not relevant.

# {normal_context}
# {step_back_context}

# Original Question: {question}
# Answer:"""

response_prompt = ChatPromptTemplate.from_template(response_prompt_template)

chain = (
    {
        # Retrieve context using the normal question
        "normal_context": itemgetter('question') | retriever,
        # Retrieve context using the step-back question
        "step_back_context": generate_queries_step_back | retriever,
        # Pass on the question
        "question": itemgetter('question'),
    }
    | response_prompt
    | chat
    | StrOutputParser()
)

question = "What is the purpose of time series forecasting?"
answer = chain.invoke({"question": question})
print(answer)

HyDE

HyDE同样是一种抽象的作法，通过生成一个假想的回答文档，再用这个文档进行向量检索，从而提升检索召回的相关性和语义覆盖。
在这里插入图片描述
Fig .6 HyDE框架示意图

Hypoteetical Document

使用LLM生成一段科技文章回答此问题，并根据回答进行向量检索，作为Context传入LLM。

template = """Please write a scientific paper passage to answer the question
Question: {question}
Passage:"""
prompt_hyde = ChatPromptTemplate.from_template(template)

generate_docs_for_retrieval = (
    prompt_hyde | chat | StrOutputParser()
)

retrieval_chain = generate_docs_for_retrieval | retriever

Text Generation

# RAG
template = """Answer the following question based on this context:

{context}

Question: {question}
"""

prompt = ChatPromptTemplate.from_template(template)

final_rag_chain = (
    prompt
    | chat
    | StrOutputParser()
)

question = "What is the purpose of time series forecasting?"

answer = final_rag_chain.invoke({
    "context": retrieval_chain.invoke({"question": question}),
    "question": question
})
print(answer)