Verification Boundary: What Pancake Actually Verifies

2026-05-29 · Michael Mustopo

The 3-tuple every receipt carries

Every Pancake receipt records its epistemic scope as a 3-tuple. Not a pass/fail grade. Not a quality score. A precise accounting of three distinct categories:

verified: The math and data integrity checks Pancake actually ran. This is what the engine computed from inputs it controlled: schema validation, lookahead checks, monotonicity, cash-ledger arithmetic, fee and slippage application, formula outputs (Sharpe, CAGR, max drawdown, win rate, confidence intervals, permutation p-value). If it is in the verified block, Pancake re-derived it from raw inputs and stands behind the computation.
agent_supplied_evidence: Inputs the agent (or user) provided that Pancake accepted but could not independently re-derive. Entry price sources, feature columns, liquidity assumptions, market-resolution data. Pancake validated their format and ran its formulas on them — but the provenance of those values is the agent's claim, not Pancake's. The receipt names them explicitly so the reader knows exactly what was taken on faith.
unmodeled_risks: What the backtest window cannot model. Market impact, execution latency, bid/ask spread, resolver risk, regime shifts, survivorship bias. These are not oversights — they are structural limits of any historical simulation. Pancake names them on every receipt so they cannot be quietly ignored.

The 3-tuple is not a legal disclaimer. It is a precision instrument. Reading a receipt means reading all three fields, not just the Sharpe number at the top.

Why the separation matters

Consider receipt ImLZ67Sx: 49 trades, 100% win rate, Sharpe 7.48. Every number on that receipt is mathematically correct. The Sharpe computation used Bessel-corrected variance, 252-day annualization, and a bootstrap CI from 10,000 resamples. The formulas are right.

But look at the agent_supplied_evidenceblock. The dataset covers Polymarket NO-side positions with a gate of NO ≥ 0.99. That threshold means the sample is dominated by markets that were already at the rail — near-certain to resolve. The strategy's 100% win rate was earned on bets where the market had already priced the outcome at 99-cents-on-the-dollar or better. It was not edge; it was collecting pennies that were already on the ground.

A conventional backtester would have returned the same Sharpe. It would not have named the gate condition as agent-supplied. It would not have flagged selection bias as something the reader needs to interrogate. Pancake does not have a selection-bias detector — but it does surface the gate condition as evidence the agent supplied, which makes the next question obvious: what happens when you widen the gate?

The answer is in the next two receipts in the series. We cover the full story in The 100% Win Rate Was Fake.

How the boundary shows up on a receipt

The verification-boundary block renders on every receipt page, below the headline metrics. It is not hidden in a JSON blob or buried in a modal. You can see it on the ImLZ67Sx receipt at /r/ImLZ67Sx.

The block has three labeled sections, each with a list of named items. Structural checks (schema, lookahead, monotonicity) appear under “Verified — structural.” Formula outputs (Sharpe, CAGR, drawdown) appear under “Verified — runner math.” The agent's inputs appear under “Agent-supplied evidence.” The gaps appear under “Unmodeled risks.”

The Markdown version of the receipt (append .md to the URL) renders the same three sections as plain text, which is what LLM agents read when they fetch a receipt to cite. The verification boundary is not human-only information — it is part of the machine-readable format too.

What other backtesting tools claim vs what they verify

Most backtesting tools return a Sharpe number. Some return a full metrics dashboard. A few let you inspect the trade log. None of them — to our knowledge — produce a stable document that separates “the engine computed this from data it controlled” from “the user told us this was the entry price.”

The reason is not that other tools are dishonest. The reason is that they were designed for a world where a human expert was responsible for the inputs. If you are a quant running your own backtest in a Jupyter notebook, you know what data you used. You do not need the notebook to remind you.

But when an LLM agent is generating the strategy, selecting the evidence, and submitting the backtest — the human expert is no longer in the loop at input time. The audit trail needs to be in the output, because the person reviewing the result may have no visibility into what the agent did upstream.

The verification boundary is the artifact that bridges that gap. It does not make a bad strategy good. It does not catch every form of selection bias. What it does is make the evidence explicit, so that the next person who reads the receipt can ask the right questions instead of trusting a number they cannot interrogate.

That is what verification actually means at Pancake. Not a guarantee. An honest accounting.