Two searches are running hot in every enterprise security team right now. One is for prompt injection detection. The other is for a gateway that handles agent tool access through delegated identity. Both are reasonable instincts. Both aim at the wrong boundary.
In the space of a month, Anthropic spelled out the same lesson twice. First, in a Zero Trust framework for deploying agents, it argued that traditional access controls cannot stop an agent from misusing legitimate permissions, and that monitoring has to assume attacks built on persistence rather than exploitation. A week later, its engineering team published how it actually contains the agents it builds, and showed why, in incident after incident. Both land in the same place, and it should reframe what people are shopping for: the controls that held were the ones that capped what the agent could do, not the ones watching what it said or checking who it claimed to be.
That is the bet we made. It is worth walking through why the model vendor’s own engineering arrives at the same place, and why the things people are shopping for will not get them there.
Why the front-door controls feel like enough
Both instincts share a shape. Prompt injection detection screens what comes in. Delegated identity settles who the agent acts for. Each runs before the agent does its work, and each feels like the finish line. Neither is.
Start with detection, because Anthropic has measured both its promise and its ceiling. Good classifiers do real work: the company reports constitutional classifiers blocking the large majority of jailbreak attempts, and a spotlighting technique cutting indirect injection success from over half to low single digits. That is worth having. But it also notes that algorithmic attacks can reach 100% success with prompts that transfer across model families, and that models cannot reliably tell informational context from actionable instructions. Detection raises the cost of an attack. It does not close the door. Anthropic’s own design test names the distinction: does a control make the attack impossible, or merely tedious? A determined agentic attacker has unlimited patience and near-zero cost per attempt, and grinds through tedious every time.
And detection assumes there is an attacker to detect. The larger fear Anthropic names is quieter. As Anthropic puts it, more capable models are “better at finding unexpected paths to a goal,” often by routing around restrictions nobody thought to write down. A more capable model is not just better at its job, it is more determined to finish it, and more inventive about how. Anthropic has watched its own models helpfully escape a sandbox just to complete a task. No injection, no adversary, no hostile input. Just an eager agent doing what it decided the goal required.
This is not a problem that shrinks over time. It grows with every model you swap in. Anthropic deemed the blast radius of its Mythos Preview model too high to ship broadly, and that is the point. The model that replaces your current one will reason better, and it will bring something closer to a determined hacker’s mindset to the task, finding paths you never thought to close. It will do all of it with access you legitimately provisioned. A detector pointed at malicious input never sees that coming, because nothing about it is an attack.
The identity side has the same shortcoming from the other direction. Agent identity and user identity are real progress, and a well-scoped token that binds the two is worth having. But a token is a grant frozen at a moment. It is issued, and then it is valid for a window, often a few minutes. For a human, a few minutes is nothing. For an agent firing thousands of tool calls in that window, a few minutes is eternity. The agent does not need to defeat your identity layer. It runs with the access you already handed it, inside a token that is still good, doing whatever it decides to do until the clock runs out. You have answered who the agent is and who it acts for, and left what it does with that access for the length of the grant wide open.
The injection you cannot detect
Even when there is an attacker, the most dangerous injection never looks like one. In a controlled exercise, Anthropic researchers phished an employee into launching an agent with a prompt that read like ordinary task instructions. Buried in the steps was a request to read cloud credentials and send them to an external endpoint. Across 25 attempts, the agent exfiltrated the credentials 24 times. The instruction arrived through the user, so there was nothing anomalous for a classifier to flag. As Anthropic noted, a human contractor handed the same script would have done the same thing.
Detection cannot save you here, because nothing about the request is detectably malicious. A valid token cannot save you either, because the agent was authenticated the whole time, running inside a window that was still open. The defense that holds is the one that blocks the action regardless of intent or identity.
That defense needs a baseline to measure against, and the agent already hands you one. Its prompt, its skill, and its persona declare what the job is. You do not have to infer the intended scope, because it is written down. Drift, the word Anthropic uses for a capable model wandering off its goal, is simply behavior leaving that declared baseline. You cannot catch drift without first knowing the job, and the job is something the agent itself defines.
Containment is a ring, not a gate
Anthropic’s answer to all of this is containment. Supervise what the agent is able to do, not what it intends. Sandboxes, virtual machines, egress controls. A hard boundary on what the agent can reach, so that a creative model, a careless user, and an external attacker all hit the same wall.
We describe the same idea as a ring around the agent. A scoped job, continuous monitoring of what the agent actually does, and the ability to stop it mid-action when it steps outside the role. The framing differs, the principle does not.
The distinction that matters is this. A gate asks one question, once: should this agent be here? A ring asks a different question, continuously: is this agent still doing its job? Authentication and a scoped token answer the first. Detection takes a probabilistic guess at a third question, is this input hostile, and gets it wrong often enough to matter. Only runtime behavioral monitoring answers the question that governs actual damage: is the agent, right now, acting within the role you gave it?
Not every gateway watches behavior
The word gateway is doing a lot of work in this market, and it is worth being precise. Many gateways in front of AI traffic were built to broker the call to the model. They route between providers, cap spend, cache responses, manage keys, and meter tokens. That is real infrastructure, and it sits in roughly the right place. But it governs the request on its way to the model, not what the agent does against your systems after the model answers.
Routing is not enforcement. Counting tokens is not watching behavior. A gateway that optimizes how you consume a model has not, by virtue of sitting in the path, decided whether the agent is staying inside its job. Those are different jobs at the same chokepoint, and conflating them is how teams end up with a gateway deployed and the behavioral layer still missing.
The whitelist is a capability grant, not a boundary
Egress control deserves its own caution, because so many teams treat an egress allowlist as the finish line. Anthropic learned otherwise, in a way worth repeating.
An attacker planted a file in an agent’s workspace carrying a hidden API key. The agent followed the embedded instructions, called a domain that was on the allowlist, and uploaded data using the attacker’s key. The destination was approved. The sandbox worked perfectly. The data still left.
Their reframe is the lesson. An allowlist is not a destination filter. It is a capability grant. Every function reachable through an approved domain becomes part of the attack surface, so permitting a domain meant permitting every action behind it.
This is the trap waiting for anyone who treats whitelisted egress as containment. You allow the domains the agent needs, you feel safe, and you have handed it a menu. The fix is not a longer whitelist. Instead, it is inspecting what the agent does once it reaches an approved destination. Allowed to talk is not the same as allowed to do.
The economics that will break the model layer
Here is the part most teams have not priced in yet. Anthropic’s containment works because Anthropic controls the model, the runtime, and the endpoint. It can harden its own model against injection, drop a vendor hypervisor on the machine, and run a proxy inside the virtual machine. Enterprises building their own agents inherit none of that, and the economics are about to push them to build their own.
Agent loops are token-hungry. They read large contexts and generate across many turns, and the bill scales with autonomy. Frontier models sit at the top of the price curve. Public pricing surveys through early 2026 put open-weight models on inference providers at roughly 50 to 90% below frontier APIs, and at high volume, self-hosted open models can cut inference cost 70 to 90%. When an agent makes thousands of calls a day, that gap is the difference between a viable product and a budget fire.
So teams will switch. They will swap the frontier model for a cheaper open-weight one, and increasingly they will self-host it to capture the savings. The containment stack Anthropic spent two years building does not come with that decision. It was built around their model, on infrastructure they operate. A self-hosted agent on a low-cost model arrives naked: weaker reasoning, no injection resistance, no sandbox, no proxy, and the same access to your APIs.
The one boundary that survives every switch
Walk through what survives that transition. The model layer weakens. The endpoint is a machine you will never see. It may be running in a virtual machine your own endpoint detection and response cannot see into, the exact blind spot Anthropic flagged when its enterprise customers asked why their EDR could not inspect inside Claude Cowork. The answer was that the same isolation containing the agent also shut endpoint detection out, leaving the agent’s runtime an opaque process from the outside.
The reflex controls do not fit the shape of the problem, because they were built around people. EDR watches endpoints. SASE secures the path from a user’s device to the applications it reaches. DLP inspects data leaving through human channels, the email, the upload, the download. All three assume a person at a laptop, traffic egressing through a known point, identity tied to that human. Agents honor none of those assumptions. They run in your cloud or a managed cloud, not on a device behind a SASE client, and much of their traffic never crosses the user-to-app path those tools sit on. Even where it does, they inspect the connection, the device, or the channel, not whether an authenticated agent’s sequence of actions is staying inside its job. Anthropic’s own framework describes the mechanism precisely: an agent chains a trusted internal tool with an external one to move data that neither tool alone would expose, and because every command runs through trusted binaries under valid credentials, host-centric monitoring sees no malware and the misuse goes undetected. The tools were approved. The credentials were valid. Nothing on the host looked wrong. The token still authenticates the agent as the user, still valid for its window. Detection still misses its small, exploited percentage.
One boundary is left standing. The point the agent has to cross to reach anything that matters to you. Wherever your systems are exposed, the gateway sits in front of them. That is the single place that does not care which model is reasoning, who hosts it, or how cheap the tokens were. Every call still arrives there. At that point you can scope the agent to its declared job, score every action against it, and halt the agent the instant behavior leaves the role.
We have watched this exact failure in production. An authenticated agent made thousands of clean tool calls, then drifted beyond its scope, probing for files it was never given. Identity intact the entire time. The credentials never lied. The behavior did. No detector flagged it, because no input was hostile. No identity check caught it, because the token was valid the whole time it drifted. The boundary that caught it was the one watching what the agent did, at the point it had to cross.
A reference blueprint, not a product pitch
So here is the opinionated version, because a diagnosis is only useful if it points at an architecture you can build.
No single vendor owns this whole stack, and any vendor who tells you otherwise is selling you a gap. The honest blueprint has layers, each with its own job, and the discipline is making them work together rather than pretending one box covers all of them. This is what Zero Trust looks like applied to agents, and it is the same layered view behind both Anthropic’s framework and the three CIS Critical Security Controls Companion Guides for LLMs, agents, and MCP that we co-authored with the Center for Internet Security and partners. Identity, isolation, observability, behavior, and recovery are all load-bearing. Skip one and attackers find the gap. The layer most teams skip is behavior.
Think of it as five layers:
- The model layer. Hardening, alignment, and injection resistance live here, and for the most part the model vendor owns them. Use the best model you can for the task. Just do not mistake this layer for your security perimeter, because the moment you switch models, swap to open weights, or self-host to cut token costs, whatever protection lived here changes or disappears.
- The identity layer. Every agent needs a verifiable identity, and every credential it carries, API keys, service accounts, OAuth tokens, needs to be discovered, owned, and revocable. This is the non-human identity problem, a discipline of its own, with specialists focused on discovering agents and stripping excessive privilege before they scale. Identity tells you who the agent is. It is necessary, and it sits upstream of everything else.
- The authorization layer. Knowing who an agent is differs from deciding what it may do, and the agent already tells you what it needs. Its prompt, its skill, and its persona declare the job. That declaration is the source of truth you scope against, narrowing a broad, time-boxed token down to the specific tools and endpoints the role actually requires, and checking the agent against that scope for the whole life of the grant, not just at the moment it is issued. Just-in-time access narrows when the grant is given. It still says nothing about what the agent does once it has it.
- The runtime behavioral layer. This is the one most stacks skip, and the one that governs actual damage. It is where least agency, OWASP’s extension of least privilege to what an agent can actually do, gets enforced moment to moment. Once an agent is authenticated and scoped, something has to watch what it does against that declared job, score each action, and halt it mid-action when behavior drifts outside the role. Drift is exactly the failure Anthropic describes when a capable model finds an unexpected path to its goal, and the declared job is the line it drifts from. Detection signals feed in here as inputs. This is the layer Cequence was built for, and where the failures in this piece, the missed injection, the abused whitelist, the probing agent, all get caught.
- The egress and data layer. What leaves matters as much as what acts. Traditional data loss prevention watches the human channels, the email, the upload, the endpoint, so an agent moving data through an authorized API call slips past it. Approved destinations are capability grants, not safe lists, which means the inspection has to move to where the agent actually operates: every call it makes, scored for what is being sent and whether the data should be leaving at all. This is the layer where sensitive-data detection belongs, in the agent’s path rather than on the user’s.
The connective tissue across all five is a gateway you own, sitting wherever agents reach your systems, whether through MCP, a direct integration, or whatever protocol comes next. This is not the gateway that routes model calls and counts tokens. It is the one that enforces what the agent does once the model has answered. Identity and authorization plug into it as inputs. Model choice sits above it and changes freely. Egress policy enforces through it. The gateway is where the layers meet and where the allow, scope, or halt decision is made, in infrastructure you control rather than in a model or a runtime you do not.
Build it model-agnostic
The single most important property of that gateway is independence from the model. When enforcement lives in the gateway, the model becomes a component you swap, not a perimeter you defend. Frontier model today, cheaper open-weight one next quarter, self-hosted model the quarter after, and your security posture does not move. That is what model-agnostic actually buys you. Not vendor flexibility for its own sake, but a boundary that survives every model decision your finance team forces on you.
It is also the only design that supports an autonomous agent vision instead of fighting it. Enterprises do not want fewer agents. They want dozens per employee, acting independently across customer, internal, and partner surfaces. You cannot put a human in the loop on that, and you cannot trust each model to police itself at that scale. What scales is a gateway that gives every agent a scoped job, watches what each one does against that job in real time, and stops the ones that drift. Provision the agent like a new hire, with a role and a subset of access. Then manage its performance like one.
Identity is necessary. It was never sufficient.
This is the gateway built for that job, the behavioral one. Cequence AI Gateway already connects to hundreds of enterprise applications rather than asking you to rebuild around it. It enforces a scoped job for each agent and monitors behavior at runtime, and Agent Personas express that job as a plain-English role with permissions down to the individual tool call, a subset of the user’s access rather than the whole reach a token hands over for the life of its window. Identity gets the agent in. A scoped persona decides what it may do. Behavioral monitoring confirms it is still doing only that. None of it depends on which model is on the other end.
Anthropic builds the most capable models in the world, and its own engineering lands on the same conclusion we did: design for containment at the boundary first, then steer behavior at the model layer. The token over-grants and outlives its own decision. The routing gateway watches the wrong thing. The cheaper model strips the vendor’s safety net away. Through all of it, the one control you own outright is the behavioral boundary you put in the agent’s path. Build it as a blueprint, not a single box. Make it model-agnostic, enforce it at runtime, and place it where every agent has to cross to reach you.