SecurityJanuary 25, 20264 min read

What Can Go Wrong with AI Agents

AI agents fail in ways traditional software doesn't. Data leaks, compliance violations, unauthorized actions. Here's what to watch for.

Alan ZabihiCo-founder & CEO
Share:

AI agents fail in ways traditional software doesn't.

When your agent reads data, follows instructions, calls tools, and produces outputs, it creates failure modes that your system prompt can't prevent. These failures happen even when your system "works".

Here's what can go wrong.

Data leaks

Your AI agent has access to context: customer data, internal documents, API keys, conversation history. When it responds, any of that context can end up in the output.

A user asks a reasonable question. The agent answers helpfully, including an API key it found in the context. Or it references another customer's data. Or it exposes internal pricing that was never meant to be shared.

The agent didn't malfunction. It did exactly what it was designed to do: be helpful with the information available. The failure is that "helpful" and "safe" aren't the same thing.

What this looks like in practice:

  • Customer PII appearing in responses to other users
  • Credentials or API keys included in agent outputs
  • Internal business context leaking into external conversations
  • RAG pipelines surfacing confidential documents

Compliance violations

Your AI agent generates text. That text represents your company. When the agent says something that violates policy, regulations, or brand guidelines, you own it.

This isn't about the agent being malicious. It's about the agent not understanding the boundaries. It gives medical advice when it shouldn't. It makes promises your legal team would never approve. It states something as fact that's actually a compliance violation in your industry.

The model doesn't know your policies. It knows language patterns. The gap between those two is where compliance failures happen.

What this looks like in practice:

  • Unauthorized medical, legal, or financial advice
  • Statements that violate industry regulations
  • Brand-damaging language or tone
  • Claims that misrepresent your product or service

Unauthorized actions

AI agents don't just generate text. They call tools, hit APIs, access databases, and take actions in the real world. When those actions happen without proper authorization, you have a problem.

An attacker embeds instructions in an email, document, or user input. The agent interprets those instructions as legitimate and acts on them. It forwards sensitive data to an external address. It executes a database query it shouldn't. It triggers a workflow that bypasses your normal approval process.

The agent followed instructions. They just weren't your instructions.

What this looks like in practice:

  • Tool calls triggered by malicious inputs
  • Database queries that access unauthorized data
  • API calls that exfiltrate information
  • Workflows executed without human approval

Why your system prompt isn't enough

Most teams start their AI security strategy by adding instructions to the system prompt: "Never reveal secrets. Always follow company policies. Don't take actions without approval".

That might help with UX, but as a security boundary it's paper-thin.

A system prompt is just another input. It isn't enforced cryptographically or via sandboxing. It competes with other instructions in the context window. It behaves non-deterministically across calls and model versions.

When an attacker crafts a prompt that sounds like a legitimate request, or embeds instructions in a document your agent retrieves, the model has no built-in way to distinguish "trusted policy" from "malicious instruction". Everything gets processed together.

The gap

The gap isn't between your intentions and your implementation. The gap is between what your AI can do and what you can prove about it.

You built something that works. Now you need to prove it won't fail your users.

That's what red team testing is for. We attack your production system to find the failures before your users do. Data leaks. Compliance violations. Unauthorized actions. You get findings, evidence, and remediation guidance.

Book a call to discuss your AI agent's attack surface.

Join our newsletter

We'll share announcements and content regarding AI safety.