Red TeamingDecember 9, 20255 min read

Red Teaming AI Agents: What We Learned From 50 Assessments

After red teaming 50 AI agents across different companies, industries, and setups, we've identified critical patterns that teams need to understand. Here's what actually matters when securing AI agents in production.

Ismail PelaseyedCo-founder & CTO
Share:

After red teaming about 50 AI agents across different companies, industries, and setups, I wanted to document the patterns we've seen. Many teams are making similar mistakes, and it's useful to lay them all out.

Every Agent Is Its Own Thing

When we started, I thought we could build a universal test suite. Run it against any agent, get a report, done. This turned out to be completely wrong.

The problem is that no two agents are the same. You have the base model (GPT-4, Claude, Llama, whatever), but then you have the harness around it, the tools it can call, the data it has access to, the business logic, the regulatory environment. All of this matters. A healthcare scheduling agent and an IT helpdesk bot might both use the same underlying model, but the attack surface is totally different.

The healthcare agent? Biggest risk was indirect prompt injection through patient forms. People submit appointment requests, the agent reads them, and if you put instructions in there the agent just... follows them. The IT bot was different. It had access to a ticketing system and we could craft ticket descriptions that would trick it into doing admin stuff it shouldn't do.

So yeah. No universal test suite. You have to actually understand what the agent does, what it connects to, what data flows through it. Then you design tests for that specific thing.

Pre-Prod Evals Are Kind of a Lie

This is the one that keeps coming up. Teams will show us their eval results. "Look, we ran all these benchmarks, the agent passes everything." And then we find issues in production within like an hour.

Why? Because sandbox testing is testing a different system. You have synthetic data, mocked integrations, controlled inputs. Production has real database connections, real APIs, real users doing weird things that nobody anticipated. The attack surface is just different.

So we always test as a real user. Not through some admin backdoor, not with special access. We use the agent exactly like a customer would. Or exactly like an attacker would.

One example: agent passes all safety benchmarks. Looks great. But in production we found it would leak API keys if you asked questions in a specific sequence. That sequence never existed in the eval dataset because it depended on conversation history that only happens in real usage.

Another one: guardrails work fine in testing, fail under load. The safety checks had timeouts, and when the system got slow during peak traffic, they'd just... not run. Fall back to permissive defaults. An attacker who figured this out could just wait for busy hours.

Pre-prod testing isn't useless. It catches stuff. But it's not enough. You need to test in production conditions or you're going to miss the things that actually matter.

Automation Is Hard

Doing this manually doesn't scale. 50 agents with human testers doing everything by hand would take forever. So you want to automate. But it's genuinely difficult.

The reason is that agents come in totally different forms. Some are chatbots. Text in, text out. Some are voice agents on phone lines. Some are these new browser-use agents that literally navigate websites, click buttons, fill out forms.

Each type needs different testing approaches. Chatbot prompt injection is different from voice agent exploitation (turns out you can do interesting things with phonetic patterns that confuse speech-to-text). Browser agents are a whole other thing. What happens if you put malicious instructions on a webpage the agent is supposed to interact with?

You can't just build one tool and call it done. You need testing that can interact with agents via API, through chat widgets, over phone lines, through browser sessions. And the agent landscape keeps changing. New frameworks, new modalities, new deployment patterns.

It's doable. But it's a real engineering problem, not something you solve with a quick script.

Bottom Line

AI agents are not normal software. They make decisions. They have access to data and tools. They take actions. And they do this across text, voice, browser, whatever.

Securing them is a different game. Treat every agent as unique. Test in production. And if you want to do this at scale, be prepared to invest in automation that actually matches how agents get deployed in the real world.

That's basically it.

Join our newsletter

We'll share announcements and content regarding AI safety.