ResearchMarch 24, 20265 min read

Frontier models miss 57% of threats in agent context

We ran 485 real artifacts through Claude 4.6 Opus with a security-focused system prompt. The model missed 57% of the threats brin had already identified. Here's the full breakdown.

Alan ZabihiCo-founder & CEO
Share:

Most agent builders assume the model itself will catch malicious content. We wanted to know if that's actually true.

We built brin-bench to measure the gap. 485 real artifacts from brin's scanning database, tested against Claude 4.6 Opus with a strong security system prompt. The model saw the same raw content a coding agent would encounter in production: packages, web pages, skills, repositories, contributors.

The model missed 57.3% of the threats brin had already identified. 106 out of 185 flagged artifacts went undetected.

The benchmark and full dataset are open-source at github.com/superagent-ai/brin-bench.

What we measured

Each artifact gets its full content (HTML, registry metadata, skill definitions, READMEs, contributor profiles) and a security-focused system prompt. The model classifies each artifact as safe or dangerous based on content alone. That's what a coding agent does today.

brin's own verdicts are the ground truth. We compare what the model flagged against what brin flagged, and count the misses.

Three metrics, reported overall and per category:

  • Model coverage: of artifacts brin flagged, what percentage did the model also flag?
  • Model gap: what the model missed, broken down by which brin signal drove the detection
  • False positive rate: percentage of brin-safe artifacts the model incorrectly flagged

Results by category

Category Brin flagged Model caught Model missed
Repositories 18 11.1% (2) 88.9% (16)
Packages 7 28.6% (2) 71.4% (5)
Web pages 50 30.0% (15) 70.0% (35)
Domains 50 34.0% (17) 66.0% (33)
Contributors 10 40.0% (4) 60.0% (6)
Skills 50 78.0% (39) 22.0% (11)

Skills had the lowest miss rate because the threats are content-visible: prompt injection sitting in plaintext. The model can read it and flag it.

Everything else depends on signals the model can't access.

Results by signal type

brin scores artifacts across four dimensions. The model has zero visibility into three of them.

Signal type Brin flagged Model caught Model missed
Graph (dependency chains, cross-repo trust) 5 0% 100%
Identity (domain reputation, account age) 49 30.6% 69.4%
Behavior (install hooks, runtime patterns) 10 40.0% 60.0%
Content (what's in the artifact) 121 49.6% 50.4%

A model can only judge what's in front of it at one point in time. It can't check whether a domain was registered yesterday on a bulletproof host, whether a package was published two hours ago by a throwaway account, or whether a contributor went dormant for six months before suddenly submitting PRs to popular repos.

Results by threat type

Threat Count Model missed
Blocklisted entities 22 100%
TLS failures (dead infrastructure) 4 100%
Install attacks 5 80.0%
Credential harvesting 5 80.0%
Encoded payloads 26 76.9%
Phishing 46 67.4%
Exfiltration 103 61.2%
Cloaking 8 50.0%
Typosquat 21 42.9%
Prompt injection 17 41.2%

100% of blocklisted entities passed undetected. 80% of credential harvesting. 77% of encoded payloads. These aren't exotic attacks. They're common, and they need external context to catch.

The false positive rate was 1.3% (4 out of 300 safe artifacts). The model isn't noisy. It just can't see enough.

What this means if you build agents

If your security model is "the model will catch it," you have a 57% hole. The threats that get through are specifically the ones that require reputation data, behavioral history, and trust graphs to identify.

Coding agents fetch packages, load web pages, install skills, and read contributor profiles without reviewing any of it. Content-only classification catches about half the threats in the content itself, and nothing else.

The model spots a skill file with prompt injection. It does not spot a package published by a two-day-old account that mimics a popular library name. It does not spot a domain on a blocklist. It does not see that a contributor's activity graph looks like a sleeper account.

Those are the signals brin covers. Identity, behavior, content, and graph. The benchmark measures the cost of not having them.

Limitations

This benchmark measures what you miss without brin. It does not measure what brin misses. Artifacts that both systems fail to detect are invisible by design. The contributor (10 flagged) and npm (7 flagged) categories have small flagged samples and should be read with that caveat.

Methodology and data

The full methodology, per-threat breakdowns, raw results, and dataset are at github.com/superagent-ai/brin-bench. The model path runs inside a Shuru microVM for isolation, since the malicious artifacts in the dataset are real.

For background on what brin is and how it works: Launching brin.sh.

Join our newsletter

We'll share announcements and content regarding AI safety.