Redact PII or PHI from ingested PDFs before processing

Documents can contain personal or sensitive data. Guardrails detect and remove PII or PHI before the model reads or uses the file, ensuring GDPR-safe ingestion.

What's at stake

  • PDFs uploaded to AI systems often contain names, addresses, social security numbers, or medical records
  • Processing unredacted documents means your LLM sees and may memorize sensitive personal data
  • GDPR requires data minimization—only processing personal data that is strictly necessary
  • A single document with PHI can trigger HIPAA violations if processed without proper controls
  • Enterprise customers in healthcare and finance will block procurement if you can't demonstrate compliant document handling

How to solve this

Before any document reaches your AI agent, it needs to pass through a detection and redaction layer. This layer scans the document for patterns that indicate personal or health information—names, dates of birth, addresses, phone numbers, medical record numbers, diagnoses, and more.

The challenge is that PII and PHI appear in countless formats. A social security number might be hyphenated or not. A name might appear in a header, footer, or buried in a paragraph. Medical information can be coded or written in plain language.

The solution requires both pattern matching for structured data (SSNs, phone numbers, dates) and semantic understanding for unstructured data (names, addresses, medical conditions). Once detected, the data must be masked or removed before the document is passed to the model.

How Superagent prevents this

Superagent provides guardrails for AI agents—small language models purpose-trained to detect and prevent failures in real time. These models sit at the boundary of your agent and inspect inputs, outputs, and tool calls before they execute.

For document ingestion, Superagent's Redact model scans PDFs and other documents before they reach your model. It identifies PII (names, addresses, emails, phone numbers, government IDs) and PHI (medical record numbers, diagnoses, treatment information) using pattern recognition and semantic analysis.

Detected information is automatically masked with configurable replacement tokens—preserving document structure while removing sensitive data. Your agent processes the redacted version, never seeing the original personal information.

Redact integrates into your document ingestion pipeline with a single API call. Documents go in, clean versions come out, and your compliance team has audit logs proving what was detected and removed.

Related use cases

Ready to protect your AI agents?

Get started with Superagent guardrails and prevent this failure mode in your production systems.