We Bypassed Grok Imagine's NSFW Filters

Text-to-image safety is broken. We generated explicit content of a real person using basic compositional tricks. Here's what we found, why it worked, and what this means for AI safety systems.

Note: This research was conducted prior to Grok's updated terms of service. We are not publishing the generated artifacts for legal reasons, but will share them with research labs upon request.

Text-to-image safety is broken. Not in some theoretical, academic sense—broken in a "we just generated explicit content of a real person using basic compositional tricks" sense.

Here's what we found, why it worked, and what this means for anyone building AI safety systems.

The Attack

The technique combined three elements:

Artistic reframing: Present the target content within a legitimate artistic context—a gallery, museum, or art book setting
Context manipulation: Use language that mimics system states or establishes false premises
Multilingual fragmentation: Split the request across languages to evade pattern matching

The result: explicit content of a real public figure, generated and displayed without intervention.

No sophisticated jailbreak. No prompt engineering wizardry. Basic compositional tricks that anyone could discover.

Why It Worked

Modern image generators use a two-stage safety pipeline:

[Prompt] → [Prompt Guard] → [Generation] → [Image Classifier] → [Output]

Our attack bypassed both layers simultaneously.

Layer 1: Prompt Guard

Prompt guards are lightweight keyword filters with limited inference budgets. They're looking for explicit terms, not semantic intent.

Artistic framing reads as legitimate. Ambiguous modifiers slip through. Mixed-language requests fragment across different pattern-matching rules. System-state language establishes false context.

The prompt guard sees: cultural content, multilingual user, apparent system state. Nothing triggers.

Layer 2: Image Classifier

Here's where it gets interesting. Post-generation classifiers (typically NudeNet or CLIP-based) are trained on photographs. They detect skin tones, body part shapes, specific pixel patterns.

Artistic framing changes everything.

When the classifier analyzes the output, it doesn't see the content—it sees the context. Artistic styling, decorative elements, compositional framing. The classifier scores the overall image composition, not the content within embedded elements or artistic presentations.

Research from UnsafeBench confirms this vulnerability: NudeNet has the lowest robustness of all tested classifiers at 0.293 Robust Accuracy. Classifiers trained on real-world photos show 10-13% F1-Score degradation on stylized or AI-generated content. Art styles cause progressive accuracy drops across all tested safety systems.

The artistic framing essentially launders the content through a context that classifiers aren't trained to decompose.

The Deeper Problem

This isn't a Grok-specific issue. It's an architectural failure in how we build AI safety systems.

Prompt guards are fundamentally bypassable. They operate under compute constraints that limit their ability to understand semantic intent. Controlled-release prompting research shows 100% bypass rates on Grok-3 using encoding techniques that exceed the prompt guard's inference budget.

Image classifiers are trained on the wrong data. Real-world photos don't prepare classifiers for AI-generated content with artistic framing, unusual color palettes, or compositional tricks. The distribution shift is massive.

The two systems don't talk to each other. Even when both layers are "working," they're solving different problems. The prompt guard doesn't know what image was generated. The classifier doesn't know what was requested. Neither understands the other's context.

This is defense in depth that isn't deep—it's just two shallow systems stacked on top of each other.

What Would Actually Work

This attack exposes a fundamental assumption: that image classifiers can score global composition and catch local content violations. They can't.

Fixes that would matter:

Content-aware decomposition: Classifiers need to detect compositional elements and embedded content, then analyze what's inside them separately from the overall image.

Cross-layer context sharing: The image classifier should know what the user requested. The prompt guard should see what was generated. Right now they're blind to each other.

Semantic prompt analysis: Move beyond keyword matching to actual intent understanding. Requests with clear harmful intent survive keyword filters when phrased with ambiguous terms or split across languages.

Adversarial training on compositional tricks: Train classifiers specifically on artistic presentations and embedded content scenarios.

The Takeaway

We bypassed a production AI safety system using basic compositional tricks and multilingual text. The attack required no technical expertise, no special access, no sophisticated tooling.

If your safety architecture can be defeated by artistic framing, it's not a safety architecture—it's a suggestion.

The current generation of prompt guards and image classifiers aren't protecting anyone. They're creating a false sense of security while the actual attack surface remains wide open.

We Bypassed Grok Imagine's NSFW Filters With Artistic Framing