Frontier models miss 57% of threats in agent context
We ran 485 real artifacts through Claude 4.6 Opus with a security-focused system prompt. The model missed 57% of the threats brin had already identified. Here's the full breakdown.
Most agent builders assume the model itself will catch malicious content. We wanted to know if that's actually true.
We built brin-bench to measure the gap. 485 real artifacts from brin's scanning database, tested against Claude 4.6 Opus with a strong security system prompt. The model saw the same raw content a coding agent would encounter in production: packages, web pages, skills, repositories, contributors.
The model missed 57.3% of the threats brin had already identified. 106 out of 185 flagged artifacts went undetected.
The benchmark and full dataset are open-source at github.com/superagent-ai/brin-bench.
What we measured
Each artifact gets its full content (HTML, registry metadata, skill definitions, READMEs, contributor profiles) and a security-focused system prompt. The model classifies each artifact as safe or dangerous based on content alone. That's what a coding agent does today.
brin's own verdicts are the ground truth. We compare what the model flagged against what brin flagged, and count the misses.
Three metrics, reported overall and per category:
- Model coverage: of artifacts brin flagged, what percentage did the model also flag?
- Model gap: what the model missed, broken down by which brin signal drove the detection
- False positive rate: percentage of brin-safe artifacts the model incorrectly flagged
Results by category
| Category | Brin flagged | Model caught | Model missed |
|---|---|---|---|
| Repositories | 18 | 11.1% (2) | 88.9% (16) |
| Packages | 7 | 28.6% (2) | 71.4% (5) |
| Web pages | 50 | 30.0% (15) | 70.0% (35) |
| Domains | 50 | 34.0% (17) | 66.0% (33) |
| Contributors | 10 | 40.0% (4) | 60.0% (6) |
| Skills | 50 | 78.0% (39) | 22.0% (11) |
Skills had the lowest miss rate because the threats are content-visible: prompt injection sitting in plaintext. The model can read it and flag it.
Everything else depends on signals the model can't access.
Results by signal type
brin scores artifacts across four dimensions. The model has zero visibility into three of them.
| Signal type | Brin flagged | Model caught | Model missed |
|---|---|---|---|
| Graph (dependency chains, cross-repo trust) | 5 | 0% | 100% |
| Identity (domain reputation, account age) | 49 | 30.6% | 69.4% |
| Behavior (install hooks, runtime patterns) | 10 | 40.0% | 60.0% |
| Content (what's in the artifact) | 121 | 49.6% | 50.4% |
A model can only judge what's in front of it at one point in time. It can't check whether a domain was registered yesterday on a bulletproof host, whether a package was published two hours ago by a throwaway account, or whether a contributor went dormant for six months before suddenly submitting PRs to popular repos.
Results by threat type
| Threat | Count | Model missed |
|---|---|---|
| Blocklisted entities | 22 | 100% |
| TLS failures (dead infrastructure) | 4 | 100% |
| Install attacks | 5 | 80.0% |
| Credential harvesting | 5 | 80.0% |
| Encoded payloads | 26 | 76.9% |
| Phishing | 46 | 67.4% |
| Exfiltration | 103 | 61.2% |
| Cloaking | 8 | 50.0% |
| Typosquat | 21 | 42.9% |
| Prompt injection | 17 | 41.2% |
100% of blocklisted entities passed undetected. 80% of credential harvesting. 77% of encoded payloads. These aren't exotic attacks. They're common, and they need external context to catch.
The false positive rate was 1.3% (4 out of 300 safe artifacts). The model isn't noisy. It just can't see enough.
What this means if you build agents
If your security model is "the model will catch it," you have a 57% hole. The threats that get through are specifically the ones that require reputation data, behavioral history, and trust graphs to identify.
Coding agents fetch packages, load web pages, install skills, and read contributor profiles without reviewing any of it. Content-only classification catches about half the threats in the content itself, and nothing else.
The model spots a skill file with prompt injection. It does not spot a package published by a two-day-old account that mimics a popular library name. It does not spot a domain on a blocklist. It does not see that a contributor's activity graph looks like a sleeper account.
Those are the signals brin covers. Identity, behavior, content, and graph. The benchmark measures the cost of not having them.
Limitations
This benchmark measures what you miss without brin. It does not measure what brin misses. Artifacts that both systems fail to detect are invisible by design. The contributor (10 flagged) and npm (7 flagged) categories have small flagged samples and should be read with that caveat.
Methodology and data
The full methodology, per-threat breakdowns, raw results, and dataset are at github.com/superagent-ai/brin-bench. The model path runs inside a Shuru microVM for isolation, since the malicious artifacts in the dataset are real.
For background on what brin is and how it works: Launching brin.sh.