The Moat of AI Agents Context Engineering

Author: Wang Chen

The evolution of technical terminology is not only a change in language expression but also represents a shift in thinking paradigms. The new term contextual engineering resonates within the industry because it reflects the evolution of agent complexity and the transformation of coping strategies; it is a collective response to the algorithmic and engineering challenges faced in reality, especially those related to vertical and domain-specific agents.

Source of the image: Baoyu@X

The existing large models are already very intelligent. However, even the smartest person finds it difficult to provide satisfactory delivery if they do not understand the context of what they are trying to do. Two products might do exactly the same thing—one feels magical, while the other seems like a cheap demo. What makes the difference? It lies in the construction of contextual engineering.

1. Starting from a Scenario, Feel the Magic of Contextual Engineering

Scenario Setting: You are the product manager of a certain agent, and you are receiving a private message from the R&D team on DingTalk: "I want to confirm something, does the new import function only support CSV? We need to start writing the interface here."

A regular smart assistant might directly help you draft a reply: "Yes, currently it only supports CSV, it may be expanded later." On the surface, this seems correct, but it does not consider details such as the current stage of the project, upstream and downstream dependencies, tone and style, team consensus, etc., which can easily lead to misunderstandings or rework.

In contrast, an agent equipped with "contextual awareness" would proactively search for:

● Project Status: According to the project plan, the import feature will enter the development stage this week.

● Requirement Document: The design clearly states that V1 supports CSV and JSON, but the latter will be launched a week later.

● Team Atmosphere: The R&D side is short-staffed and worries that scope changes will affect progress.

● Task History: There was once a rework due to unclear details in the requirements, which has just been reviewed.

● Personalized Tone Setting: Direct, clear details, reducing asynchronous exchanges.

The final generated message might be: "Currently, V1 is planned to support CSV and JSON, but JSON won't be able to interface until next week. You can go ahead and work on CSV in the meantime; I'll add the interface format to the requirement list shortly."

Where's the Magic?

It is not because the model's algorithm is better, but because it understands:

● The current task planning

● Past communication hazards within the team

● The other party's work status and concerns

● The real-time status of documents/knowledge bases

This is precisely the magic of "contextual engineering"—to input sufficiently structured information and receive more natural, controllable, and satisfactory outputs. Such a design of contextual engineering will make agents think about tasks more like humans do.

2. From Prompts -> Prompt Engineering -> The Evolution of Contextual Engineering

Prompts are the magic pens in the hands of users who excel at utilizing large models.

Concrete prompts can yield the most satisfactory outputs possible. While we teach large models to understand us, we are also learning how to be understood. Each prompt is a mirror reflecting our understanding of “understanding” itself. For example, prompt master Li Jigang has provided a large number of high-quality prompts. Let's look at an example:

However, as user needs continue to evolve, with complex tasks like multi-turn dialogues, multi-role collaboration, in-depth research, user status tracking, and others emerging, model applications move from short commands to long processes, and from static calls to dynamic interactions. It will become apparent that what truly affects output quality is the systemic design capability of the entire input structure, rather than just the prompts of users or systems themselves. Thus, contextual engineering came into being. It is not merely a simple complement to prompt engineering; it focuses on the organization of the entire context, sources of information, structural design, and dynamic scheduling. This indicates a fundamental shift in the methodology surrounding the core goal of "how to obtain better outputs."

For instance, at the beginning of this article, we presented a conversation between a product manager and an engineer. A high-quality agent will not just let the large model answer the user's questions; instead, through contextual engineering, it helps the large model acquire more structured inputs before responding, including project status, requirement documents, task history, and even team atmosphere, enabling better understanding of the current task planning, past communication hazards within the team, the other party's work status and concerns, and the real-time status of documents/knowledge bases, etc. Contextual engineering thus becomes a structured information pool—a thinking workstation before the model outputs. Think about it: when two people are performing the same task, besides their respective expertise and skill sets, what matters most is their understanding of the task context, right?

We are entering a more structured era.

3. Contextual Engineering is Not the Same as Context

Building context sounds as simple as filling information into a prompt. Why do we still need a dedicated engineering set? (Note that this engineering also includes the model's RL; the common strategies discussed in Chapter 5 will mention specific cases.) In fact, anyone with experience in multi-turn interactions, system collaboration, or information scheduling knows that information failure can occur not only due to absence but also because of confusion, conflict, or incoherence.

We have previously made large models smarter through external information interfaces like RAG, Function Calling, and MCP. However, it has been proven that too many tools can also have consequences; more supplementary information and longer contexts do not necessarily produce better responses. Context overload may lead to agents failing in unexpected ways. Context can become harmful, distracting, confusing, or contradictory, which is particularly problematic for agents relying on context to gather information, integrate discoveries, and coordinate actions. This also somewhat dampens enthusiasm for RAG, Function Calling, and MCP. For example, when your designed agent interfaces with dozens of MCPs, managing tool information becomes a new engineering challenge (Nacos Router and Higress' Toolset are different technical means to address this engineering challenge).

Drew Breunig [1] classifies common contextual failures into four types, which help us further understand “contextual engineering is not equivalent to context.”

Context Poisoning

In the past, while training and evaluating large models, researchers have long focused on single-turn, fully defined task scenarios. However, in real use, users often adopt a multi-turn dialogue to gradually supplement or inadequately describe needs, which can increase the likelihood of erroneous information. When erroneous information, such as user typos or abnormal tool outputs, is written into the context, the model will take it as a fact and repeatedly reference it, stubbornly attempting to achieve impossible or irrelevant goals, with subsequent decisions being based on errors and deviating further.

It's similar to how the internet has enhanced our efficiency and range of information access but can make us more easily misled by false information. When we mistakenly take false information as true and share it with others, it forms part of our subconscious and may be referenced in subsequent discussions and reasoning.

Contextual Interference

This refers to instances where context is too lengthy or redundant information is too abundant, causing the model's attention resources to dilute and lose focus on content learned through training, resulting in decreased output quality. Although current large models can support contexts at the million-token level, a technical report from Gemini 2.5 [2] pointed out that as contexts exceed 100,000 tokens, large models tend to generate long contexts for multi-step reasoning rather than for retrieval. We all have genuine experiences with this; for example, before an exam, when a large pile of mock questions is suddenly assigned, it can confuse the knowledge accumulated so far, making one perform poorly on the exam.

According to a report by Databrick [3], most models' performance declines after reaching a certain context scale. Llama-3.1-405b performance starts to decline after 32k tokens, while GPT-4-0125-preview performance decreases after 64k tokens, and only a few models can maintain consistent long-context RAG performance across all datasets.

Contextual Confusion

This occurs when the model uses excessive content from the context to generate low-quality responses. For example, when an agent configures 16 MCPs, even if the user’s request only relates to 2 of them, the large model may search through all 16 MCPs, potentially including unrelated segments in reasoning, thus generating off-topic or low-quality responses.

The following screenshot is the function call leaderboard released by Berkeley in June this year [4], which serves as a benchmark for assessing models' abilities to respond when using external tool calls. This leaderboard shows that when large models introduce function calls or use tools, almost all large language models perform worse than in original text generation tasks, with accuracy significantly lower in multi-function call scenarios compared to single function calls.

Contextual Conflict

This refers to when new information accumulated in the context conflicts with information from external tools or training knowledge bases, resulting in unstable model outputs. We encounter this in real life too; for example, if a ride-hailing driver installs two navigation apps and different routes are presented for the same destination in both apps, the driver faces a dilemma and randomly chooses one.

Overall, context is both the magic pen that enhances outputs and poses challenges of context poisoning, contextual interference, contextual confusion, and contextual conflict due to excessive information. Therefore, constructing contextual engineering to ensure the quality of context delivery is key to building high-quality agents.

Contextual engineering involves questions like where does the context come from? What to retain? What to discard? Should it be compressed? How to compress? Is isolation necessary? Who writes it? Who stitches it? All these questions constitute the boundaries of work within contextual engineering.

4. Building Contextual Engineering Starts with Orchestration Design

Currently, there is no standard methodology for building agents that are smarter than generic large models. However, from the insights shared by leading practitioners in the industry, we might glean some inspiration, at least learning how to start. For example, a recent technical blog published by Cognition [5] serves as a good reference. The article outlines four orchestration processes for agents handling complex long tasks and compares output reliability.

1. Parallel Execution, No Context Sharing

The main task is broken down into multiple sub-tasks, each executed independently by different SubAgents, with no shared context information or interaction between them. The advantage is simple implementation, ease of parallel computation, and more economical token consumption, fully leveraging the expert capabilities of different agents. However, if any sub-task fails, the integration phase will be difficult and there is no remediation mechanism. Like a jigsaw puzzle, each piece is found by different people, but no one knows what the final picture looks like, and eventually, they find they cannot fit it together.

2. Parallel Execution, With Context Sharing

Before executing sub-tasks, they can all access a shared context, but execution is still independent and there is no interaction between sub-tasks. The advantage is greater reliability than the first method, with more unified task understanding. However, output conflicts or lack of coordination may still occur. For instance, in preparing a dish, the procurement, chopping, and cooking tasks might all look at the same menu; each has its role, but without communication, they will execute based on their interpretations, resulting in a less-than-ideal culinary product.

3. Serial Execution, Gradually Building Context

Tasks are serially completed by multiple SubAgents, with the output of the previous agent becoming the input for the next, creating a "task relay" pattern. The advantage is that sub-tasks can see each other, leading to more unified task understanding. However, when the task is lengthy, context generated serially can expand rapidly, eventually exceeding the model's context window's capacity limit, leading to context overflow. In engineering, if the pipeline we design is very long, the chances of cache overflow increase, which is based on the same principle.

4. Serial Execution, Introducing Compression Models

This approach adds a dedicated "context compression model" to the existing process to compress past dialogues and action histories into a few key details and events, for use by the next model, maintaining decision coherence. The advantage is that it breaks context window limits, enabling longer task processing and improving the agent's long-term memory capacity. However, doing this well is difficult; it requires effort to determine what constitutes key information and to create a systems design skilled in this area. If the quality of compression is low, compressing might lead to the loss of details, thus affecting output stability. For those engaged in technology services, one must understand both product and technology while also deeply grasping the client's business context; otherwise, it is difficult to provide satisfactory delivery.

5. Some Common Strategies in Contextual Engineering

In the four categories of agent orchestration processes proposed by Cognition, the roots of all failures can be traced back to "unreasonable contextual strategies": insufficient sharing, information fragmentation, and uncontrolled redundancy. This constitutes the true mission of contextual engineering when handling complex long tasks—it helps the system "see more clearly."

1. Intelligent Retrieval

Large model outputs can suffer from excessive external tools, lowering output performance. For example, tools defined by MCP. In this paper [6], the team found that when the number of tools exceeds 30, tool descriptions start to overlap and become easy to confuse. After exceeding 100 tools, the model is almost certain to fail. Thus, selecting the correct tool through intelligent retrieval is crucial.

The team introduced RAG-MCP, which aims to alleviate the burden of tool discovery through a retrieval-augmented generation framework. RAG-MCP employs semantic retrieval to identify the MCPs most relevant to the given query from an external index before calling LLM. Only selected tool descriptions are passed to the model, significantly reducing prompt size and simplifying decision-making. Experiments in the paper show that RAG-MCP dramatically reduced prompt tokens (for instance, over 50%) and more than doubled the accuracy of tool selection on benchmark tasks (43.13% compared to a baseline of 13.62%). RAG-MCP enables scalable and accurate tool integration for LLM.

Nacos recently released the MCP Router, which is an open-source implementation of RAG-MCP. Nacos MCP Router retrieves suitable MCP servers based on task descriptions and keywords, which are then called by large models.

2. Context Isolation

Context isolation involves splitting different tasks/sub-goals into independent threads or agents, allowing them to operate in their own "lanes." Tasks are decomposed into smaller, more independent units, each with its own context, preventing information interference or semantic conflicts caused by context sharing.

The technical methods used in engineering are analogous.

Those who have worked with microservices are likely familiar with the term "swimlanes." Imagine you have a multi-tenant microservice platform where each tenant can access an order service. Without an isolation mechanism, a sudden surge in requests from a large client may overwhelm the entire system. With swimlanes, each client has independent resources, preventing interference. Similarly, in the application of large models, if multiple users/tasks share a context space, the model may confuse role instructions and task statuses, leading to frequent hallucinations. An isolation structure addresses these issues well. The difference is that in microservice governance, the "swimlane" is meant to control access traffic, whereas in large model applications, the "swimlane" is meant to control semantic information flow.

3. Context Pruning

Context pruning involves removing irrelevant, outdated, or redundant context segments to reduce the risk of context confusion. As conversations or tasks progress, the original context accumulates. However, the model's context window is limited, and stacking can cause two problems: dilution of model attention and uncontrolled token costs.

This process is similar to maintaining the memory on our phones: initially, all applications and historical information are retained, but when the phone starts to slow down, we begin to clear memory, like deleting large files downloaded locally or removing unimportant historical chats in WeChat. The difference is that in large model applications, context pruning is not just a matter of executing a single instruction; it also requires cutting based on importance scores, removing irrelevant retrieval results based on intention matching, and even using small models for redundant sentence identification. The quality of these strategies will determine the effectiveness of context pruning.

This paper [7] mentions the Provence method, which employs large models for context pruning and a pretrained re-ranking model (for re-scoring) to create synthetic training objectives, ultimately adjusting the pretrained re-ranker based on synthetic data so that the final unified model can efficiently perform context pruning and re-ranking. Provence removes sentences from context paragraphs that are unrelated to user questions, reducing context noise and speeding up large model content generation, and allowing for plug-and-play implementation with any large model or retrieval system.

4. Context Compression & Summary

Unlike context pruning, compression is a more advanced engineering strategy. After pruning the context, we further compress the remaining context into expressions with higher semantic density to free up context space, allowing the model to focus on truly critical information.

The mainstream compression methods include:

● Extractive Compression: Directly filtering key sentences or paragraphs from the original context and stitching them into a new prompt without rewriting. Lab results show it can achieve compression ratios of up to 10× while maintaining almost all model accuracy, performing well across multiple tasks (single-document, multi-document QA, summary). Advantages include retaining the original wording, reliable information, and high output quality; low modification, and simple execution. The downside is that compression ratios are limited by the original paragraph length, requiring effective key content identification mechanisms.

● Abstractive Compression: Using generative models (like summarization models) to condense the context into new text, replacing the original content without preserving the original sentences. Compression ratios are high, but performance is slightly inferior (especially for long-term tasks); in multi-document QA scenarios, using “query-aware summaries,” i.e., generating summaries based on questions, can enhance performance by about 10%. The advantages are a high compression ratio and capability for shorter prompts, but it is prone to information loss or distortion and depends on the quality of the summarization model.

● Token Pruning: Gradually removing redundant or low-value content token by token, keeping only key tokens without reordering sentence structure. For summary tasks, only marginal performance improvements are observed, overall lagging behind extractive compression. It’s simple to implement and does not require generating additional text, but its stability is poor and it inadequately supports tasks requiring semantic completeness, making complex semantic dependencies hard to discern.

These four strategies do not exist in isolation; they form a complete set of contextual engineering capabilities—from retrieval (input), to modularization (isolation), to cleaning (pruning), and finally distillation (compression)—serving as the foundation for building large model applications. Understanding and creatively applying these strategies will become central to the competitiveness of Agent Builders, akin to crafting products while building the contextual engineering.

[1] https://www.dbreunig.com/

[2] https://storage.googleapis.com/deepmind-media/gemini/gemini_v2_5_report.pdf

[3] https://www.databricks.com/blog/long-context-rag-performance-llms

[4] https://gorilla.cs.berkeley.edu/leaderboard.html

[5] https://cognition.ai/blog/dont-build-multi-agents#a-theory-of-building-long-running-agents

[6] https://arxiv.org/abs/2505.03275

[7] https://huggingface.co/blog/nadiinchi/provence

If you want to learn more about Alibaba Cloud API Gateway (Higress), please click: https://higress.ai/en/