Prove It or Shut Up: The AGI/Agent Receipts Standard (Common-Good Edition)

Prove It or Shut Up: The AGI/Agent Receipts Standard (Common-Good Edition)

You’re being sold a religion with a GPU budget.

The sermon goes like this: Agents are here. AGI is close. The hard parts are basically solved. Stand back while we “transform society.” And then your calendar invite breaks, your customer support bot lies, your “AI teammate” loops itself into a coma, and somehow the moral of the story is that you lack vision.

No. The public’s skepticism is the only adult in the room.

So here’s the deal: if leaders want belief back, they don’t get it by announcing timelines. They get it by publishing receipts—verified, repeatable, audited impact—including what it costs, how it fails, and who pays the externalities.


Hook

The only honest definition of “it’s here” is: ordinary people can feel it, and independent evaluators can verify it.

Not a demo. Not a keynote. Not an “early access” screenshot. A measurable win that survives contact with:

  • adversarial conditions,
  • boring operational reality,
  • budgets,
  • and the physics of the power grid.

Context

The con isn’t that AI is useless. The con is the category error.

The industry keeps laundering three different axes into one mystical word (“AGI”):

  1. Capability: can it do the task at all?
  2. Reliability: can it do it repeatably? (p95/p99, not your best run)
  3. Autonomy: can it do it unsupervised without unacceptable damage?

Most public talk about “agents” and “AGI” is really just people swapping between these when convenient.

When the system works once, it’s “capability.” When it fails in production, it’s “edge cases.” When asked for accountability, it’s “early days.” When fundraising, it’s “inevitable.”


Analysis

What’s actually hard isn’t “intelligence.” It’s long-horizon correctness.

Agents don’t die because they can’t write. They die because the real world requires 50+ correct steps in a row while the interface changes, the tools fail, the instructions are ambiguous, and the state of the system is partially observed.

And when we measure this in realistic benchmarks, we see the gap between demos and reality:

  • WebArena (realistic websites, outcome-verified tasks): the best GPT-4-based agent reported ~14.41% success vs ~78.24% for humans.
  • OSWorld (open-ended desktop tasks): humans ~72.36% vs best model ~12.24%.

That’s not “almost there.” That’s a different sport.

“But AlphaFold!”

Good. Yes. Keep that energy—because AlphaFold is exactly what real progress looks like.

The AlphaFold Protein Structure Database provides open access to 200M+ predicted protein structures and is widely used by researchers. This is not vibes. It’s infrastructure.

But notice what made AlphaFold “real”:

  • a hard, well-defined scientific target,
  • benchmarked against experimental reality (CASP),
  • massive adoption,
  • and downstream usefulness.

AlphaFold is a receipt. It’s not a prophecy.

And also: AlphaFold doesn’t magically produce “free food” or “cure cancer next year”

Because structure ≠ solution.

It speeds up parts of discovery. It doesn’t replace biology, toxicology, clinical trials, manufacturing, distribution, or regulatory proof.

Even Isomorphic Labs—DeepMind/Google’s drug effort built in AlphaFold’s shadow—has pushed “first clinical trials” to end of 2026, per reporting from Davos. That’s not a failure. That’s the real world refusing to be impressed by slides.


Framework

The Prove-It Standard: 10 Receipts or You Don’t Get to Say “Agents”

If someone claims “agents are here” or “AGI is near,” they publish this—publicly—on every major release.

Receipt 1 — Open, adversarial benchmark results (not private evals)

Across long-horizon environments (web + desktop). Include out-of-distribution variants (layout changes, new flows, friction).

Receipt 2 — Loop rate + silent failure rate

Success rate is the pretty metric. Loop rate is the truth. Silent failure is the nightmare.

Receipt 3 — Cost per verified completion

Not cost per prompt. Not token price theater. $ per successful end-to-end task (median, p95, p99).

Receipt 4 — Time per completion (wall-clock)

If it takes 4 hours of “thinking,” you didn’t automate work—you rescheduled it.

Receipt 5 — Generalization under change

New UI, new vendor portal, slightly different form, ambiguous instructions. Show performance anyway.

Receipt 6 — Calibration

Confidence must match accuracy. If it speaks with certainty while wrong, it’s not “smart.” It’s dangerous.

Receipt 7 — Incident reports (production)

Classes of failure, frequency, detection time, mitigation. Treat it like aviation, not influencer marketing.

Receipt 8 — Third-party autonomy audits

If you say “unsupervised,” independent evaluators get to try to break it.

Receipt 9 — A declared deployment threshold

Example: “No autonomous execution below 99.9% success on this workflow family.”

Receipt 10 — A no-go zone list

Name what you won’t do (and enforce it). That’s what competent engineering looks like.


The Common Good Bar

What would actually make society say: “Okay. It’s here.”

Not “AGI.” Not “superintelligence.” Public-proof milestones that are hard to fake and easy to feel.

A) Health: “Show endpoints, not vibes.”

The receipt: an AI-discovered or AI-designed drug demonstrates published clinical benefit (Phase II/III) for a specific cancer subtype with hard endpoints (survival / progression-free survival). No TED talk can counterfeit that.

Secondary receipts that hit home:

  • multi-hospital reductions in sepsis mortality or medication errors with audited results,
  • antibiotic discovery that survives resistance pressure and reaches late-stage trials.

B) Food: “Not free food—cheaper food without environmental debt.”

The receipt: multi-season, multi-region yield gains under drought/heat with independent replication. Or fertilizer dependency materially reduced without yield collapse.

People don’t need to understand agriculture. They understand grocery bills.

C) Power + Infrastructure: “Keep the lights on.”

The receipt: measurable reductions in outages / restoration time using AI planning and ops (replicated across regions). And data center growth paired with transparent load management and power sourcing that doesn’t dump risk on everyone else.

D) Government: “Kill the paperwork monster.”

The receipt: benefits/permit/tax workflows completed end-to-end with verified accuracy, safe human fallback, and measurable backlog reduction.

The fastest way to restore public trust is to destroy the bureaucratic time tax.

E) Science: “AlphaFold-class wins.”

The receipt: a system cracks a hard scientific task, is benchmarked against reality, becomes standard infrastructure, and enables downstream breakthroughs. AlphaFold is the canonical example—and that’s why it should be used as a weapon against AGI mysticism, not as a justification for it.


Leader Narrative Translation

Stop reacting to personalities. Translate claims into testable commitments.

  • “Agents are here” → publish p95 success, loop rate, silent failure rate, and $/verified completion on open benchmarks.
  • “AGI by 2027” → define AGI as a battery of generalization + autonomy thresholds, then submit to third-party audits.
  • “AI will cure cancer” → publish trial timelines, endpoints, and replication. Anything else is storytelling.

Strategic Implications

For builders

If you’re building “agents,” your differentiator isn’t clever prompting. It’s operational discipline:

  • instrument everything,
  • treat failures as first-class objects,
  • ship with explicit thresholds and no-go zones,
  • and measure cost per verified completion like your life depends on it (because your business does).

For buyers and operators

Demand receipts. You are not buying intelligence. You are buying reliability under constraints.

If a vendor can’t tell you:

  • expected failure modes,
  • p95 completion rate,
  • and the cost of remediation,

you’re not buying software. You’re buying a liability.

For society

We don’t need to “slow down AI” in the abstract. We need to civilize AI claims with standards, audits, and accountability—so the commons isn’t forced to subsidize hype.


Wrap

Belief isn’t a vibe. It’s a contract.

If AI leaders want the world to stop rolling its eyes, they can do one simple thing:

Stop selling prophecies. Start publishing scoreboards.

And if the scoreboards look like AlphaFold—benchmarked, adopted, useful—people will believe. Because the world is not anti-technology.

It’s anti-bullshit.

Read more