Don’t Just Build Agents, Build Memory-Augmented AI Agents
Insight Breakdown:
This piece aims to reveal that regardless of architectural approach—whether Anthropic's multi-agent coordination or Cognition's single-threaded consolidation—sophisticated memory management emerges as the fundamental determinant of agent reliability, believability, and capability. It marks the evolution from stateless AI applications toward truly intelligent, memory-augmented systems that learn and adapt over time.
AI agents
are intelligent computational systems that can perceive their environment, make informed decisions, use tools, and, in some cases, maintain persistent memory across interactions—evolving beyond stateless chatbots toward autonomous action. Multi-agent systems coordinate multiple specialized agents to tackle complex tasks, like a research team where different agents handle searching, fact-checking, citations and research synthesis.
Recently, two major players in the AI space released different perspectives on how to build these systems. Anthropic released an insightful
piece
highlighting their learnings on building multi-agent systems for deep research use cases. Cognition also released a post titled: "
Don't Build Multi-Agents
," which appears to contradict Anthropic's approach directly.
Two things stand out:
Both pieces are right
Yes, this sounds contradictory, but working with customers building agents of all scales and sizes in production, we find that both the use case and application mode, in particular, are key factors to consider when determining how to architect your agent(s).
Anthropic's multi-agent approach makes sense for deep research scenarios where sustained, comprehensive analysis across multiple domains over extended periods is required.
Cognition's single-agent approach is optimal for conversational agents or coding tasks where consistency and coherent decision-making are paramount. The application mode—whether research assistant, conversational agent, or coding assistant—fundamentally shapes the optimal memory architecture. Anthropic also highlights this point when discussing the downside of multi-agent architecture.
For instance, most coding tasks involve fewer truly parallelizable tasks than research, and LLM agents are not yet great at coordinating and delegating to other agents in real time.
Anthropic, Building Multi-Agent Research System
Both pieces are saying the same thing
Memory is the foundational challenge that determines agent reliability, believability, and capability. Anthropic emphasizes sophisticated memory management techniques (compression, external storage, context handoffs) for multi-agent coordination. Cognition emphasizes context engineering and continuous memory flow to prevent the fragmentation that destroys agent reliability.
Both teams arrived at the same core insight:
agents fail without robust memory management
. Anthropic chose to solve memory distribution across multiple agents, while Cognition chose to solve memory consolidation within single agents.
The key takeaway from both pieces for AI Engineers or anyone developing an agentic platform is
not just build agents, build Memory Augmented AI agents
.
With that out of the way, the rest of this piece will provide you with the essential insights from both pieces that we think are important and point to the memory management principles and design patterns we’ve observed in our customers’ building agents.
The key insights
If you are building your agentic platform from scratch, you can extract much value from Anthropic's approach to building multi-agent systems, particularly their sophisticated memory management principles, which are essential for effective agentic systems.
Their implementation reveals critical design considerations, including techniques to overcome context window limitations through compression, function calling, and storage functions that enable sustained reasoning across extended multi-agent interactions—foundational elements that any serious agentic platform must address from the architecture phase.
Key insights:
Agents are overthinkers
Multi-agent systems trade efficiency for capability
Systematic agent observation reveals failure patterns
Context windows remain insufficient for extended sessions
Context compression enables distributed memory management
Let's go a bit deeper into how these insights translate into practical implementation strategies.
Agents are overthinkers
Anthropic researchers mentioned using explicit guidelines to steer agents into allocating the right amount of resources (tool calls, sub-agent creation, etc.), or else, they tend to overengineer solutions. Without proper constraints, the agents would spawn excessive subagents for simple queries, conduct endless searches for nonexistent information, and apply complex multi-step processes to tasks requiring straightforward responses.
Explicit guidance for agent behavior isn't entirely new—system prompts and instructions are typical parameters in most agent frameworks. However, the key insight here goes deeper than traditional prompting approaches.
When agents are given access to resources such as data, tools, and the ability to create sub-agents, there needs to be explicit, unambiguous direction on how these resources are expected to be leveraged to address specific tasks. This goes beyond system prompts and instructions into resource allocation guidance, operational constraints, and decision-making boundaries that prevent agents from overengineering solutions or misusing available capabilities.
Take, for example, the OpenAI Agent SDK with several parameters to describe behaviours of resources to the agent, such as
handoff_description
, which will be utilized in a multi-agent system built with the OpenAI SDK. This argument specifies how the subagent should be leveraged in a multi-agent system. Or the explicit argument
tool_use_behavior
that describes to the agent how a tool should be used, as the name suggests. The key takeaway for AI Engineers is that multi-agent system implementation requires an extensive thinking process that involves what tools the agents are expected to leverage, the subagents in the system, and how resource utilization is communicated to the calling agent in a multi-agent system.
When implementing resource allocation constraints for your agents, consider traditional approaches of managing multiple specialized databases (vector DB for embeddings, graph DB for relationships, relational DB for structured data) compound the complexity problem, and introduce tech stack sprawl, an anti-pattern to rapid AI innovation.
Multi-agent systems trade efficiency for capability
While multi-agent architectures can utilize more tokens and parallel processing for complex tasks, Anthropic found operational costs significantly higher due to coordination overhead, context management, and the computational expense of maintaining a coherent state across multiple agents. In some cases, two heads are better than one, but they are also expensive within multi-agent systems.
One thing we note here is that the use case used in Anthropic's multi-agent system is deep research. This use case requires extensive exploration of resources, including heavily worded research papers, sites, and documentation, to accumulate enough information to formulate the result of this use case (which is typically a 2000+ word essay on the user’s starting prompt).
In other use cases, such as automated workflow with agents representing processes within the workflow, there might not be as much token consumption, especially if the process encapsulates deterministic steps such as database reads and write operations, and its output is execution results that are sentences or short summaries.
The coordination overhead challenge becomes particularly acute when agents need to share state across different storage systems. Rather than managing complex data synchronization between specialized databases, MongoDB's
native ACID compliance
ensures that multi-agent handoffs maintain data integrity without external coordination mechanisms. This unified approach reduces both the computational overhead of distributed state management and the engineering complexity of maintaining consistency across multiple storage systems.
Context compression enables distributed memory management
Beyond reducing inference costs, compression techniques allow multi-agent systems to maintain shared context across distributed agents. Anthropic's approach involves summarizing completed work phases and storing essential information in external memory before agents transition to new tasks. This, coupled with the insight that Context windows remain insufficient for extended sessions, points to the fact that prompt compression or compaction techniques are still relevant and useful in a world where LLMs have extensive context windows.
Even with a 200K token (approximately 150,000 words) capacity, Anthropic’s agents in multi-round conversations require sophisticated context management strategies, including compression, external memory offloading, and spawning fresh agents when limits are reached.
We previously partnered with Andrew Ng and DeepLearning AI on a
course
on prompt compression techniques and
retrieval-augmented generation
(RAG) optimization.
Systematic agent observation reveals failure patterns
Systematic agent observation represents one of Anthropic's most practical insights. Essentially, rather than relying on guesswork (or vibes), the team built detailed simulations using identical production prompts and tools and then systematically observed step-by-step execution to identify specific failure modes. This phase in an agentic system has an extensive operational cost.
From our perspective, working with customers building agents in production, this methodology addresses a critical gap most teams face:
understanding how your agents actually behave versus how you think they should behave
. Anthropic's approach immediately revealed concrete failure patterns that many of us have encountered but struggled to diagnose systematically.
Their observations uncovered agents overthinking simple tasks, like we mentioned earlier, using verbose search queries that reduced effectiveness, and selecting inappropriate tools for specific contexts.
As they note in their piece: "
This immediately revealed failure modes: agents continuing when they already had sufficient results, using overly verbose search queries, or selecting incorrect tools. Effective prompting relies on developing an accurate mental model of the agent.
"
The key insight here is moving beyond trial-and-error prompt engineering toward purposeful debugging
. Instead of making assumptions about what should work, Anthropic demonstrates the value of systematic behavioral observation to identify the root causes of poor performance. This enables targeted prompt improvements based on actual evidence rather than intuition.
We find that gathering, tracking, and storing agent process memory serves a dual critical purpose: not only is it vital for agent context and task performance, but it also provides engineers with the essential data needed to evolve and maintain agentic systems over time. Agent memory and behavioral logging remain the most reliable method for understanding system behavior patterns, debugging failures, and optimizing performance, regardless of whether you implement a single comprehensive agent or a system of specialized subagents collaborating to solve problems. MongoDB's flexible document model naturally accommodates the diverse logging requirements for both operational memory and engineering observability within a single, queryable system.
One key piece that would be interesting to know from the Anthropic research team is what evaluation metrics they use. We’ve
spoken extensively
about
evaluating LLMs
in RAG pipelines, but what new agentic system evaluation metrics are developers working towards?
We are answering these questions ourselves and have partnered with Galileo, a key player in the AI Stack, whose focus is purely on evaluating RAG and Agentic applications and making these systems reliable for production. Our learning will be shared in this upcoming
webinar
, taking place on July 17, 2025.
However, for anyone building agentic systems, this represents a shift in development methodology—building agents requires building the infrastructure to understand them, and sandbox environments might become a key component of the evaluation and observability stack for Agents.
Advanced implementation patterns
Beyond the aforementioned core insights, Anthropic's research reveals several advanced patterns worth examining:
The Anthropic piece hints at the implementation of
advanced retrieval mechanisms
that go beyond vector-based similarity between query vectors and stored information. Their multi-agent architecture enables sub-agents to call tools (an approach also seen in
MemGPT
) to store their work in external systems, then pass lightweight references—presumably unique identification numbers of summarized memory components—back to the coordinator.
We generally emphasize the importance of the multi-model retrieval approach to our customers and developers, where hybrid approaches combine multiple retrieval methods—using vector search to understand intent while simultaneously performing text search for specific product details. MongoDB's native support for vector similarity search and traditional indexing within a single system eliminates the need for complex reference management across multiple databases, simplifying the coordination mechanisms that Anthropic's multi-agent architecture requires.
The Anthropic team implements continuity in the agent execution process by establishing clear boundaries between task completion and summarizing the current phase before moving to the next task. This creates a scalable system where memory constraints don't bottleneck the research process, allowing for truly deep and comprehensive analysis that spans beyond what any single context window could accommodate.
In a multi-agent pipeline, each sub-agent produces partial results—intermediate summaries, tool outputs, extracted facts—and then hands them off into a shared “memory” database. Downstream agents will then read those entries, append their analyses, and write updated records back. Because these handoffs happen in parallel, you must ensure that one agent’s commit doesn’t overwrite another’s work or that a reader doesn’t pick up a half-written summary.
Without atomic transactions and isolation guarantees, you risk:
Lost updates
, where two agents load the same document, independently modify it, and then write back, silently discarding one agent’s changes.
Dirty or non-repeatable reads
, where an agent reads another’s uncommitted or rolled-back write, leading to decisions based on phantom data.
To coordinate these handoffs purely in application code would force you to build locking layers or distributed consensus, quickly becoming a brittle, error-prone web of external orchestrators. Instead, you want your database to provide those guarantees natively so that each read-modify-write cycle appears to execute in isolation and either fully succeeds or fully rolls back. MongoDB's ACID compliance becomes crucial here, ensuring that these boundary transitions maintain data integrity across multi-agent operations without requiring external coordination mechanisms that could introduce failure points.
Application mode is crucial when discussing memory implementation
. In Anthropic's case, the application functions as a research assistant, while in other implementations, like Cognition's approach, the application mode is conversational. This distinction significantly influences how agents operate and manage memory based on their specific application contexts. Through our internal work and customer engagements, we extend this insight to suggest that application mode affects not only agent architecture choices but also the distinct memory types used in the architecture.
AI agents need augmented memory
Anthropic’s research makes one thing abundantly clear: context window is not all you need. This extends to the key point that memory and agent engineering are two sides of the same coin. Reliable, believable, and truly capable agents depend on robust, persistent memory systems that can store, retrieve, and update knowledge over long, complex workflows.
As the AI ecosystem continues to innovate on memory mechanisms, mastering sophisticated context and memory management approaches will be the key differentiator for the next generation of successful agentic applications. Looking ahead, we see “Memory Engineering” or “Memory Management” emerge as a key specialization within AI Engineering, focused on building the foundational infrastructure that lets agents remember, reason, and collaborate at scale.
For hands-on guidance on memory management, check out our
webinar
on YouTube, which covers essential concepts and proven techniques for building memory-augmented agents.
Head over to the
MongoDB AI Learning Hub
to learn how to build and deploy AI applications with MongoDB.
July 9, 2025