From Prompt Stuffing to Queryable Knowledge for Agents
As interest in agent memory, retrieval, and context efficiency grows, a practical pattern is emerging: keep raw sources outside the prompt, maintain a structured knowledge corpus, and let agents query it precisely through MCP when needed.
Introduction
There has been growing interest in how agents should use memory well: what belongs in model weights, what belongs in the context window, and what should live in external systems the model can query. This is partly about capability, but also about efficiency. Tokens are expensive, context is limited, and retrieval quality matters. One practical pattern is to keep raw sources outside the prompt, build a structured knowledge layer over them, and let the agent query that layer when needed.
For people building retrieval-augmented generation systems, agent memory systems, or MCP servers, the question is increasingly the same: how do you expose the right evidence with as little noise as possible?
The point is not to avoid context windows. The point is to use them for the highest-value material: the specific evidence and constraints relevant to the task at hand.
None of this is especially new. Retrieval-augmented generation, tool use, external memory, and long-context evaluation have all been pushing in roughly this direction for some time. What feels different now is that the surrounding tooling is making it easier to build this pattern cleanly.
That is the design direction we find most compelling.
Why This Matters
A common pressure in agent work is that, once the corpus gets large, the prompt starts carrying too many jobs at once. It holds instructions, retrieved passages, metadata, partial synthesis, and whatever evidence the model is supposed to reason over. Sometimes this works well enough. Sometimes it does not. And even when it works, it is often wasteful.
The waste is not only cost. It is attention.
Every extra paragraph in context is competing for the model's focus. Every noisy retrieval result makes it a little harder for the model to see the few things that actually matter. Every question that forces the system to rediscover the same facts from raw documents is doing work that probably should have happened earlier in the stack.
This is why memory and context design have become more important topics. The constraint is not simply "how much can fit?" It is "what is worth putting in front of the model at all?"
Weights, Context, and External Knowledge
A useful way to think about this is to separate three places knowledge can live.
One is in the model weights. This is the model's trained knowledge: broad, powerful, and often surprisingly capable. But it is also diffuse. It is not designed to be a precise record of a particular corpus, and it is not easy to update in a targeted way.
Another is the context window. This is where evidence becomes sharp. Jeff Dean made this point nicely in his conversation with Dwarkesh Patel and Noam Shazeer: information placed directly in the model's input is much clearer than knowledge that has been absorbed indirectly through training, because the model can attend to the exact text it is processing. That distinction matters a lot in practice. If context is where evidence becomes crisp, then context should be reserved for the best evidence, not filled with broad search noise.
The third place is external knowledge: systems the model can query when it needs something more exact, more current, more structured, or more traceable than what lives in weights alone.
That third layer is where the architecture gets interesting.
Long Context Helps, but It Does Not Solve Retrieval
It is easy to see why long context windows are attractive. They relax hard limits. They make it possible to include more source material. They let the model operate over larger inputs without immediate truncation.
But long context is not the same thing as reliable access to relevant information inside that context.
A good paper here is Lost in the Middle (TACL 2024), which showed that model performance can degrade depending on where relevant information appears in a long input. In particular, relevant material in the middle of long contexts can be harder for models to use well than material near the beginning or end. The larger point is simple: fitting information into context does not guarantee the model will use it robustly.
This does not make long context unimportant. It just means long context is not, by itself, a retrieval strategy.
Why Retrieval and External Memory Keep Coming Back
The case for explicit external memory has been around for a while. The original RAG paper from NeurIPS 2020 made a clean version of the argument: language models store a lot in parameters, but knowledge-intensive tasks benefit from access to explicit non-parametric memory. Retrieval improves factuality, specificity, and provenance.
That line of thinking has only become more relevant as models have become more agentic. Once an agent is expected to answer questions over a real body of documents, the architecture matters. The system needs some way to fetch exact information without pretending that everything important should already be in the weights, or that the right answer is always to paste a larger pile of text into the prompt.
Tool use points in the same direction. Toolformer (NeurIPS 2023) is useful here not because it solves the whole problem, but because it reflects the same underlying idea: models work better when they can call external systems instead of relying entirely on internal knowledge.
The Problem With "Just Search More"
Even once retrieval is in the loop, there is another question: retrieval of what shape?
This is where raw search starts to show its limits. Many useful agent questions are not really keyword-search questions. They depend on structure.
- show all statements after a certain date mentioning a particular concept
- compare how one speaker framed an issue across speeches and press conferences
- list documents linked to one entity, then narrow by source and subtype
- compare a document against the previous version
These are not just "find me some text" problems. They are structured retrieval problems. The answer depends on metadata, normalization, relationships, and exact filtering.
This is one reason there has been continued interest in knowledge graphs, memory systems, and maintained intermediate layers. Andrej Karpathy's recent LLM Wiki gist is a nice practitioner expression of the same pressure. His point is not that raw-document retrieval is useless. It is that many systems force the model to rediscover knowledge from scratch on every question, instead of letting that knowledge accumulate into a maintained layer that sits between raw sources and query-time reasoning.
That framing feels directionally right, and his original post is useful because it describes the workflow plainly instead of treating it like a grand theory.
Why Structure Matters
Once the corpus is represented as structured knowledge rather than a loose pile of documents, several things get easier.
- The first is precision. The system can query for exactly the slice it needs: entity, date range, source class, document subtype, or comparison target.
- The second is efficiency. Fewer irrelevant tokens need to be pulled into context.
- The third is composability. The same knowledge layer can serve multiple tasks: answering questions, generating pages, comparing versions, producing summaries, or supporting agent workflows.
- The fourth is traceability. It becomes much easier to say where something came from and why it was selected.
Recent work keeps reinforcing the importance of context quality here. Influence Guided Context Selection for Effective Retrieval-Augmented Generation (NeurIPS 2025 poster) argues that noisy or poor-quality retrieved context can significantly hurt RAG systems, and that better selection matters. That supports a practical intuition many people already have: more retrieval is not automatically better retrieval.
So the goal is not just to retrieve. It is to retrieve narrowly, with structure doing as much of the filtering work as possible before the model ever sees the result.
Why MCP Fits This Pattern
This is where MCP becomes useful.
MCP is not valuable because it magically makes a model smarter. It is valuable because it gives the model a clean way to access external systems. When the external system is a structured knowledge corpus, that access pattern becomes especially compelling.
For teams evaluating MCP for agent workflows, this is the practical shift: move from broad search and prompt stuffing toward exact, tool-mediated access to a structured knowledge corpus.
Instead of asking the model to reason over a large search dump, the model can issue a more precise query:
- search this corpus
- list matching items in a date range
- fetch this document
- diff this document against a previous version
- find the canonical entity and then enumerate related materials
That moves work out of the prompt and into the retrieval layer, where it usually belongs.
The context window still matters. But its role becomes narrower and more valuable. It is where the model reasons over selected evidence, not where it performs broad discovery over the entire corpus.
Why Efficiency Changes the Quality of Work
There is also a more practical point here. Better retrieval architecture does not just save tokens. It changes what kinds of workflows are practical.
If an agent spends too much of its budget rediscovering facts, it has less budget left for comparison, evaluation, synthesis, and judgment. If retrieval is sharp and cheap, more of the system's effort can go toward the parts humans actually care about.
This is one reason papers like xRAG are interesting. The specific method is less important here than the broader signal: efficient grounding is becoming a first-class concern, not an implementation detail.
The same is true of more recent work on agent memory. Papers like A-Mem suggest that people are increasingly treating memory organization itself as part of the agent design problem, not as an afterthought.
That feels right as well. As agent workflows get more serious, memory stops being a convenience and starts becoming infrastructure.
Why We Are Building This Way
For us, the interesting part is not claiming a new paradigm. It is taking this design direction seriously in a real domain.
If the corpus is rich, the questions are precise, and the user wants traceable answers, then relying on weights alone is not enough. At the same time, stuffing larger and larger amounts of text into context is rarely the cleanest answer. The more useful pattern is to keep raw sources outside the prompt, maintain a structured knowledge layer over them, and expose that layer through a query interface the agent can use directly.
That is why a structured corpus exposed through MCP makes sense.
Not because context windows are unimportant. Because context windows are too valuable to waste on work that structure can do better upstream.
References
- Jeff Dean interview
- Transcripted summary of Jeff Dean remarks
- Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks (NeurIPS 2020)
- Toolformer (NeurIPS 2023)
- Lost in the Middle (TACL 2024)
- xRAG (NeurIPS 2024)
- Influence Guided Context Selection (NeurIPS 2025 poster)
- A-Mem (NeurIPS 2025 poster)
- Karpathy, LLM Wiki gist