The 100% Win Rate Was Fake

2026-05-29 · Michael Mustopo

The setup

We were testing Pancake on ourselves. The point of dogfooding is to find the gaps your users will hit before they hit them — so we took a real strategy idea and ran it through the full loop: spec, backtest, receipt, interrogate.

The strategy was a Polymarket prediction-market play on the NO side. The thesis: in highly liquid markets where the crowd has already priced resolution at near-certainty, there might be small but consistent arbitrage from being on the right side of residual uncertainty. The initial spec set the gate at NO ≥ 0.99. Very tight. Very conservative. Only enter when the market is already saying “this is almost certainly going to resolve NO.”

On paper, this looked like a low-risk approach. In practice, the first backtest result looked almost too good to be true.

v0 — the backtest that looked perfect

Receipt ImLZ67Sx is the v0 result. 49 trades. 100% win rate. Sharpe 7.48.

Every formula on that receipt is correct. The Sharpe computation uses Bessel-corrected variance and 252-day annualization — the same math you would use in a production risk system. The bootstrap CI is from 10,000 resamples. The permutation p-value tests the null hypothesis that the Sharpe came from random noise. The math is right.

If Pancake had been a conventional backtester — one that returns a dashboard with a Sharpe number and a win-rate chart — this is where the story would have ended. 49-for-49. Great Sharpe. Ship it.

But the receipt also carries its verification boundary. And the verification boundary said something we had to stop and read carefully.

The catch

Look at the sample composition. 145 of the 157 markets in the evidence dataset had NO ≥ 0.99 at time-of-entry. That is the gate condition — the strategy only enters when the market is at 99 cents or above on the NO side.

A market priced at NO = 0.995 is saying: there is a 0.5% chance this resolves YES. If you bet NO at that price and collect the pennies, you win 99.5% of the time by construction. The sample was dominated by markets that were already at the rail — already near-certain. The 100% win rate was not evidence of edge. It was evidence that the gate had selected for markets where the bet was nearly free.

This is selection bias at the gate level. Pancake did not detect it automatically — there is no oracle that can tell you whether your gate condition is economically meaningful or tautologically safe. But Pancake surfaced the gate condition as agent_supplied_evidence on the receipt. The gate was something we told the backtester, not something it derived. That label made the question obvious: what if we change the gate?

v1 — widen the band

The fix for “your sample is dominated by near-certain outcomes” is to widen the band. Instead of only entering when NO ≥ 0.99, enter when 0.95 ≤ NO ≤ 0.985. Now you are capturing markets where the crowd thinks the outcome is likely but not certain. You are betting on near-settled markets, not already-settled ones.

Receipt MupOp1tS is the v1 result.

14 upsets. Return: −34.6%.

The win rate collapsed. The strategy that had been 49-for-49 in a tight band was −34.6% in a slightly wider one. The markets in the 0.95 to 0.985 range were not the ones that always resolved correctly — they were the ones the crowd had high confidence in but where reality occasionally disagreed. That is where the real uncertainty lives.

Without the receipt chain, we would have had no way to trace this. We would have compared two Sharpe numbers and shrugged. The receipts made the comparison exact: same engine version, same formulas, same data format — different gate condition, different sample, drastically different result.

v2 — tighten and shift

One more iteration: 0.97 ≤ NO ≤ 0.985. Narrower than v1, but shifted away from the near-certain rail. You are still in the high-confidence zone, but you are not at 99 cents anymore.

Receipt 6KnLKlza is the v2 result. Return: −20.4%.

Better than v1, worse than the fake 100% of v0. The pattern is clear: the tight NO ≥ 0.99 gate was selecting for markets that were already functionally resolved. The edge was not in the strategy — it was in the sample construction. When you moved to markets with genuine residual uncertainty, the strategy lost money.

The lesson

The v0 backtest was not a lie. The math was right. The Sharpe was computed correctly from the data it was given. If you had been handed only the number — 7.48 — and told “this strategy has Sharpe 7.48,” you would have had no way to know that the sample was dominated by near- trivial bets.

The receipt made the lie visible. Not by catching it automatically — but by surfacing the gate condition as agent-supplied evidence, making the next question obvious, and then letting the v1 and v2 receipts answer it with the same provenance so you could compare them directly.

A backtester that is honest about its sample composition is more valuable than one that hides it. The number can be right and the strategy can still be worthless. The receipt is the document that lets you tell the difference.

This is not a hypothetical scenario we constructed to illustrate the point. This is the dogfood story — the first real strategy we ran through Pancake before opening it to other teams. We found the gap in our own work, with our own tool. That is exactly what a receipt layer is supposed to do.

Try it yourself

All three receipts are public. Click through, read the verification boundary on each one, and compare the agent_supplied_evidence blocks across v0, v1, and v2. The gate condition is spelled out on every receipt. The sample composition numbers are in the evidence metadata. The formulas are the same — only the inputs changed.

ImLZ67Sx— v0, NO ≥ 0.99, 49 trades, 100% win rate, Sharpe 7.48
MupOp1tS— v1, 0.95 ≤ NO ≤ 0.985, 14 upsets, return −34.6%
6KnLKlza— v2, 0.97 ≤ NO ≤ 0.985, return −20.4%

If you are building strategies with LLMs — or if you are reviewing strategies that LLMs generated — the receipts are the audit trail. The Sharpe number is not enough. The verification boundary is the part that tells you whether to trust it.