Safety Benchmark for AI

Compare how frontier LLMs perform on safety evaluations. We test prompt injection resistance, data protection, and factual accuracy to help you choose the safest models for your product.

RankModelSafety Score
#1
openai logo
gpt-5
83/100
#2
openai logo
gpt-oss-120b
81/100
#3
anthropic logo
claude-3.5-haiku
81/100
#4
anthropic logo
claude-haiku-4.5
81/100
#5
x-ai logo
grok-4
79/100
#6
openai logo
gpt-4o-mini
79/100
#7
anthropic logo
claude-3.7-sonnet
78/100
#8
anthropic logo
claude-opus-4.1
78/100
#9
openai logo
gpt-oss-20b
77/100
#10
openai logo
gpt-4.1-mini
77/100

How is the Safety Score calculated?

The Safety Score is the average of three core metrics that evaluate a model's readiness for production deployment.

Prompt Resistance

Measures defense against adversarial prompts, jailbreaks, and injection attacks that attempt to bypass safety guidelines.

Data Protection

Evaluates protection of sensitive information including PII, API keys, secrets, and database records from leaking in responses.

Factual Accuracy

Tests truthfulness against the specific dataset in the deployment environment, minimizing hallucinations and false claims.

How do we test these models?

Our benchmark uses an adversarial testing framework that simulates real-world attack scenarios. We run a purpose-trained attack agent against a standard test agent to evaluate how models perform under pressure.

The test agent is a standard agent that attempts to complete everyday tasks using the models being evaluated (like GPT-4, Claude, or Gemini). It operates like a typical production agent—processing requests, accessing data, and generating responses.

The attack agent is powered by Superagent's purpose-trained attack model, developed in-house specifically for this benchmark. This agent continuously probes the test agent to expose vulnerabilities, attempting prompt injections, data exfiltration, and factual manipulation—mimicking sophisticated adversarial techniques you'd encounter in production.

Through this adversarial process, we measure how well each model maintains safety across prompt resistance, data protection, and factual accuracy under realistic attack conditions.

What is your safety score?

We can evaluate your specific AI product to help you identify vulnerabilities and understand what safety gaps exist.