Three years later: AI can (now) defend AI
In 2022, Simon Willison argued that 'adding more AI' was the wrong fix for prompt injection and related failures. He was mostly right at the time. What people tried then were brittle ideas that either overblocked or were easy to trick. This post explains what has changed since, what has not, and why builders can now use AI to meaningfully defend their agents in production.
In 2022, Simon Willison argued that "adding more AI" was the wrong fix for prompt injection and related failures. He was mostly right at the time. What people tried then were brittle ideas that either overblocked or were easy to trick. This post explains what has changed since, what has not, and why builders can now use AI to meaningfully defend their agents in production.
What 2022 got right
Two patterns dominated early attempts:
-
Binary text classifiers. Small models or rules that labeled input as "malicious" or "not malicious." These broke on nuance and caused high false positives.
-
General models playing guard. GPT-3 asked to act as a filter via a system prompt. Attackers could bypass it with the same tricks used against the target model.
Simon's criticism targeted these patterns. If you relied on them, you would miss things or block the wrong things, and you would never know when you were safe.
What has changed since 2022
Three shifts make today's approach different:
-
Purpose-built defender models. Instead of repurposing a general model, you can now deploy small models trained only for defense tasks. They reason about intent and consequences, not just patterns. SuperagentLM is one example of this new class of defender models.
-
Continuous updates from the wild. Defenders are retrained as new attack techniques appear, keeping pace with the landscape instead of freezing a snapshot in time.
-
Reinforcement from real usage. On top of global updates, defenders adapt to customer traffic. They learn what is normal in a given deployment and what is risky in context, cutting false positives without opening holes.
This is not "GPT checking GPT" and it's not a static filter. It's a reasoning model that learns globally and locally, combined with runtime controls.
How to reduce risk at runtime
A credible defense system for agents operates while they run. That means putting checks at the right points:
-
Inputs are analyzed before they reach the model, stopping adversarial prompts that try to jailbreak or rewrite policy.
-
Tool calls are scored for intent and consequence, with unsafe actions blocked, sandboxed, or escalated.
-
Outputs are checked for leaks or unsafe code before they leave the system.
-
Decisions are logged with traces so teams can audit, measure, and prove that unsafe behavior never reached production.
The result is a runtime defense layer — not a single filter, but a system where a reasoning model, policies, and observability work together to keep agents safe.
Is it 100 percent? No — but close enough to matter
These systems will never be 100 percent secure because they're probabilistic by nature. But combining reasoning with continuous learning gets you close. The comparison to deterministic defenses can be misleading. Even in traditional software, nothing has ever been truly 100 percent. If it were, we wouldn't have a massive security industry around patching, monitoring, and layered controls. AI security will follow the same path: not perfect, but strong enough to make agentic apps viable at scale.
What this means for builders
You don't need to rely on vibes in a system prompt. You can put a defender in the loop, wrap tool calls and outputs with checks, and see exactly what decisions were made. Start where the risk is highest, measure outcomes, and tighten over time.
This is the approach we've taken with Superagent: a small reasoning model, updated continuously, reinforced by real usage, and embedded in runtime controls. It doesn't claim perfection — but it does something more important. Because attacks themselves are reasoning-based, defenses need to reason too. Traditional software engineering guardrails are not enough in an adversarial environment. With a system that learns and adapts as it runs, AI can now defend AI.