Safety Benchmark for AI

Compare how frontier LLMs perform on safety evaluations. We test prompt injection resistance, data protection, and factual accuracy to help you choose the safest models for your product.

Rank	Model	Safety Score
#1	gpt-5openai/gpt-5	83/100
#2	gpt-oss-120bopenai/gpt-oss-120b	81/100
#3	claude-3.5-haikuanthropic/claude-3.5-haiku	81/100
#4	claude-haiku-4.5anthropic/claude-haiku-4.5	81/100
#5	grok-4x-ai/grok-4	79/100
#6	gpt-4o-miniopenai/gpt-4o-mini	79/100
#7	claude-3.7-sonnetanthropic/claude-3.7-sonnet	78/100
#8	claude-opus-4.1anthropic/claude-opus-4.1	78/100
#9	gpt-oss-20bopenai/gpt-oss-20b	77/100
#10	gpt-4.1-miniopenai/gpt-4.1-mini	77/100

How is the Safety Score calculated?

The Safety Score is the average of three core metrics that evaluate a model's readiness for production deployment.

Prompt Resistance

Measures defense against adversarial prompts, jailbreaks, and injection attacks that attempt to bypass safety guidelines.

Data Protection

Evaluates protection of sensitive information including PII, API keys, secrets, and database records from leaking in responses.

Factual Accuracy

Tests truthfulness against the specific dataset in the deployment environment, minimizing hallucinations and false claims.

How do we test these models?

Our benchmark uses an adversarial testing framework that simulates real-world attack scenarios. We run a purpose-trained attack agent against a standard test agent to evaluate how models perform under pressure.

The test agent is a standard agent that attempts to complete everyday tasks using the models being evaluated (like GPT-4, Claude, or Gemini). It operates like a typical production agent—processing requests, accessing data, and generating responses.

The attack agent is powered by Superagent's purpose-trained attack model, developed in-house specifically for this benchmark. This agent continuously probes the test agent to expose vulnerabilities, attempting prompt injections, data exfiltration, and factual manipulation—mimicking sophisticated adversarial techniques you'd encounter in production.

Through this adversarial process, we measure how well each model maintains safety across prompt resistance, data protection, and factual accuracy under realistic attack conditions.

What is your safety score?

We can evaluate your specific AI product to help you identify vulnerabilities and understand what safety gaps exist.