The March of Nines

When Andrej Karpathy says he's "very unimpressed by demos", he isn't being dismissive — he's being realistic. In his conversation with Dwarkesh Patel, Karpathy explains that the gap between a working demo and a reliable product is vast, especially in domains where failure carries serious consequences.

"For some kinds of tasks… there's a very large demo-to-product gap where the demo is very easy, but the product is very hard. It's especially the case in cases like self-driving where the cost of failure is too high."

He calls this slow, unforgiving process the "march of nines." The phrase comes from reliability engineering: when a system works 90% of the time, that's the first nine. Achieving 99%, 99.9%, or 99.99% means stacking nines — and each new one takes as much work as all the previous ones combined. Progress slows as reliability compounds. Adding just one more nine can take years.

The Demo-to-Product Gap

Karpathy learned this firsthand at Tesla. Early self-driving demos looked nearly perfect, yet production reality was far messier. "Every single nine is the same amount of work," he said. "When you get a demo and something works 90% of the time, that's just the first nine. Then you need the second nine, a third nine, a fourth nine."

The same pattern holds for today's AI systems. Chatbots, copilots, and coding agents can seem magical in controlled settings — until they encounter the unpredictable edge cases of the real world. Inputs get messy. Prompts turn adversarial. Hallucinations, data leaks, and silent failures creep in.

Karpathy draws a sharp parallel between self-driving and software:

"In self-driving, if things go wrong, you might get injured. There are worse outcomes. But in software, it's almost unbounded how terrible something could be."

The takeaway is sobering. As AI systems become more autonomous and more embedded in sensitive workflows, the cost of error rises — and the margin for error shrinks.

Why Bigger Models Aren't Enough

It's tempting to believe that bigger models or larger datasets can close the reliability gap. But research shows otherwise. Scaling helps, yet it doesn't solve the problem.

A 2024 Stanford study found that even state-of-the-art models "continue to produce false or unverifiable claims under realistic conditions." OpenAI acknowledges the same: hallucination remains "a fundamental challenge for all large language models."

Anthropic's Constitutional AI research also points to limits. Rule-based alignment and reinforcement tuning improve safety, but can't eliminate unsafe or incorrect outputs entirely.

The lesson is clear: scaling improves accuracy, not reliability. It might move you from 90% to 95%, but the next nine demands a different kind of work — systems engineering, not more parameters.

The Hidden Work Behind Each Nine

Every extra nine represents a layer of engineering that protects the system from itself. In self-driving, that meant redundant sensors, fallback logic, and extensive validation. In AI, it means building runtime infrastructure that detects, contains, and corrects errors before they reach users.

At Superagent, we think about this infrastructure in three core capabilities:

Guard — catching prompt injections, jailbreaks, and unsafe tool calls in real time.
Verify — grounding outputs against trusted data or APIs before they're shown.
Redact — preventing leaks by removing sensitive data from inputs, outputs, and logs.

These mechanisms don't make the model smarter; they make the system safer. They're what transform an impressive demo into a dependable product. We've built each capability as a purpose-trained small language model that operates at the runtime layer — fast enough for production, precise enough for compliance. This is the trust infrastructure that turns 90% accuracy into 99.9% reliability.

The same evolution happened in aviation and medicine. Planes didn't become safe because engines got smarter — they became safe because we built systems that detect, log, and recover from errors. AI reliability will follow that same playbook.

From Accuracy to Reliability

Early AI progress was measured in benchmark scores — accuracy, fluency, or code completion rates. In production, those metrics matter less. What matters is how often the system fails, how fast it recovers, and how transparent those failures are.

Teams are starting to measure mean time between incidents (MTBI) and failure rates under real-world conditions to quantify reliability. As AI systems move into healthcare, finance, and enterprise software, this shift will only accelerate. Reliability becomes both an engineering goal and a market differentiator.

The implication is simple: AI progress is no longer about how capable your model is. It's about how predictable, observable, and recoverable your system is.

Conclusion — Earning the Next Nine

Karpathy's insight cuts through the hype: the real work begins after the demo. Each new nine demands as much effort as the last, and there are no shortcuts.

"It's still a huge amount of work to do… we're going to see all this stuff play out. It's a march of nines."

For anyone building AI systems, that's both a warning and a roadmap. The challenge isn't to make models smarter — it's to make systems trustworthy.

Because in the real world, 90% isn't good enough. The path forward is the same as it's always been: earn your next nine.

The Demo-to-Product Gap

Why Bigger Models Aren't Enough

The Hidden Work Behind Each Nine

From Accuracy to Reliability

Conclusion — Earning the Next Nine

Alan Zabihi

Related Articles

Subscribe to our newsletter