Pancake Methodology

usepancake.com/methodology · Last updated 2026-05-27

Verification boundary

Every Pancake receipt carries a 3-tuple verification statement: what the engine verified, what it accepted as agent-supplied evidence, and what it did not model. This is not a pass/fail grade — it is an honest accounting of the receipt's epistemic scope, so the agent and its user can judge the result against the right standard.

Verified — structural

Pre-flight invariants the runner enforces before the main loop: schema_match (all declared columns present with matching types), lookahead (decision_time < resolution_time per row), monotonicity (prices non-negative, no reversed timestamps within a market), range (values within declared bounds), required_columns (all five semantic roles present exactly once). Failure in any of these aborts the run with a structured error — no partial receipt is emitted.

Verified — runner math

Deterministic computations the runner performs from inputs alone: cash_ledger (position accounting closes to zero on exit), fee_application (fee_bps applied per trade as declared), slippage_application (slippage_bps applied symmetrically on entry and exit), event_ordering (trades sorted by decision_time before the main P&L loop). These are the structural facts of the run — every reader can reproduce them by re-running the open-source engine against the cited dataset.

Agent-supplied evidence

Data the agent uploaded as EvidenceDataset rows. Pancake validates the schema and enforces the structural invariants above, but it does not independently verify the feature column values, the entry price source, or the liquidity source. These are accepted as declared by the agent and surfaced verbatim in the receipt so a reader can assess them directly. The agent_supplied_evidence block names the feature columns, entry price source (observed / agent_estimate / last_trade / mid / vwap), and liquidity source used.

Unmodeled risks

Categorical risks the engine does not model in the current version: market_impact (the strategy's own order flow affects prices), resolution_lag (final resolution may differ from the price used at resolution_time), resolver_risk (the venue may resolve differently than the market implies), small_sample (metrics are statistical noise below 10 trades — suppressed in the receipt). The unmodeled_risks list appears verbatim in every receipt so a reader is never left guessing what the engine does and does not cover.

Math foundations

Sharpe ratio (Sharpe 1994; annualized √252, Bessel-corrected)

Annualised excess return divided by the Bessel-corrected standard deviation of daily returns, scaled by √252. The Bessel correction (N−1 denominator) is used in all variance estimates to produce an unbiased sample variance. Returns are computed from the daily equity curve with the risk-free rate set to zero (as is standard for short-horizon prediction-market strategies where no bond equivalent is available). Suppressed when N < 10 (see small-sample handling).

Sortino ratio (Sortino & Price 1994; target=0, full-N denominator)

Annualised excess return divided by the downside deviation using a 0% target return. The downside deviation uses the full-N denominator (not N−1) to match the Sortino & Price 1994 definition. Only negative deviations from the target contribute to the denominator. Null when all returns are non-negative (infinite Sortino is uninformative). Suppressed when N < 10.

CAGR (Bacon 2008; piecewise with RUINED + OVERFLOW handling)

Compound Annual Growth Rate: (ending_value / starting_capital)^(365.25 / holding_days) − 1. Holding days is the calendar span from first trade entry to last trade exit. RUINED: if ending_value ≤ 0, returns −1.0 (−100%). OVERFLOW: if the computed value exceeds ±10 000 (1 000 000%), returns ±1.0 (±100%) as a sentinel. The special-case values are preserved in the receipt so readers are never shown a silently-clipped number. Suppressed when N < 10.

Wilson CI95 (Wilson 1927)

95% confidence interval on the win rate using the Wilson score interval. Preferred over the normal-approximation interval (Wald) because it has better coverage at small sample sizes and near the boundary values (0% and 100%). The centre of the Wilson interval is used as the point estimate when the raw proportion is in the boundary region.

Brier crowd score (Brier 1950)

Mean squared error between the strategy's implied probability (the side price at decision time) and the binary resolution outcome. Lower is better. The crowd score baseline uses the market's closing price as the probability estimate — a strategy that beats the crowd score consistently is improving on the market's collective forecast.

Bootstrap percentile CI (Efron 1979)

10 000 bootstrap resamples of the trade-level P&L series, with replacement, seeded via PCG64 for reproducibility. The 2.5th and 97.5th percentile of the bootstrapped Sharpe distribution yield the 95% CI. The CI is suppressed when N < 10 (bootstrap degenerates at very small samples).

Permutation test for Sharpe null (Good 2005)

1 000 permutations of the trade outcome labels (win/loss) under the null hypothesis of no skill. The p-value is the fraction of permuted Sharpe values at least as extreme as the observed Sharpe. A result with p < 0.05 is flagged as statistically significant at the 5% level, with the caveat that prediction- market receipts at N < 50 have low power. The permutation test uses the same PCG64 seed as the bootstrap for cross-run consistency.

Determinism

Canonical JSON serialization

The strategy spec is canonicalized before hashing: keys sorted lexicographically at every nesting level, no whitespace, UTF-8 encoding. The spec_hash (SHA-256 of the canonical bytes) is pinned in the receipt and in the database row. Any change to the spec — including key reordering — produces a different spec_hash. This prevents silent drift between what was run and what is receipted.

PCG64 seeded RNG (cross-platform byte-stable on Python 3.12+)

All stochastic operations (bootstrap, permutation test, any future sampling) use the PCG64 generator from numpy.random with a fixed seed derived from the spec_hash. The PCG64 output is byte-stable across operating systems and CPU architectures on Python 3.12+. This means any reader can reproduce the exact same bootstrap CI and permutation p-value by running the open- source batter engine with the same spec and dataset.

Engine identity stamp

Every receipt carries three version fields: engine_version (the batter package version), compiler_version (the spec compiler version), and renderer_version (the envelope renderer version). These are pinned at run time — a receipt from batter@0.3.1 is reproducible only by batter@0.3.1 under Python 3.12+. Version mismatches in the bootstrap/permutation subsystem produce a warning in the receipt rather than a silent difference.

Small-sample handling

Why N=10 threshold

CAGR, Sharpe, Sortino, and the bootstrap CI are suppressed when the trade count is below 10. At N < 10 the variance of the Sharpe estimator is so large that the point estimate is dominated by noise — a strategy that went 3-for-3 looks identical on paper to one that went 9-for-3. The 10-trade floor is a conservative threshold that prevents the receipt from asserting statistical precision it does not have.

What gets suppressed

When N < 10: CAGR, Sharpe ratio, Sortino ratio, bootstrap CI on Sharpe, and the permutation test p-value. What is never suppressed: total return (sum of realized P&L), win rate (fraction of winning trades), trade count, and the verification boundary statement. The suppression is surfaced in the receipt's metrics table as "insufficient_data (N=X < 10)" — not as a blank cell or a zero — so the agent's LLM reads the honest framing.

How to read suppressed receipts

A receipt with suppressed metrics is not a failed receipt — it is an honest receipt at a small sample size. The total return and win rate are real numbers. The unmodeled_risks list includes "small_sample" as a reminder. The appropriate response is to gather more evidence (more rows, more markets) and run again — not to interpret the total return as a CAGR proxy or to back-calculate Sharpe from the suppressed placeholder value of 0.

Open-source engine

batter on GitHub

The Pancake execution engine is published as github.com/usepancake/batter under the Apache 2.0 license. The engine is a Python 3.12+ package implementing the EvidenceDataset schema, the structural invariant checks, the runner math, and all the statistical computations described above. The spec compiler is included; the receipt renderer is in the Next.js frontend repository.

The math layer is published as the batter Python package (pip install batter). Verify our formulas yourself or read the engine page for install instructions, the 12 verified formulas, determinism guarantees, and a citable BibTeX entry.

Independently auditable from first principles

Any reader can reproduce a receipt by: (1) downloading the cited EvidenceDataset rows (rows_sha256 is the content hash), (2) loading the spec_viewer.spec from the receipt, (3) running batter at the engine_version stated in the receipt, and (4) comparing the output envelope's ir_hash and spec_hash to the values in the receipt. A match confirms the receipt is an honest representation of what was run.