How reproducible are Pancake backtests?

Pancake backtests are byte-reproducible on Python 3.12+ across Ubuntu, macOS, and Windows. Given the same strategy spec and evidence dataset, re-running batter at the same engine_version produces an identical SHA-256 result_hash. Python 3.11 is permanently out of scope due to a CPython sum() precision change.

Reproducibility in Pancake is engineered at the byte level. Every receipt carries a result_hash: the SHA-256 digest of the canonical execution output envelope. Any reader can verify a receipt by downloading the cited evidence rows, loading the spec from the receipt, running batter at the stated engine_version, and comparing the output hash.

The batter engine achieves byte-stability through three design decisions. First, canonical JSON serialization: the strategy spec is sorted lexicographically at every nesting level before hashing, so key reordering does not change the spec_hash. Second, PCG64 seeded RNG: all stochastic operations (bootstrap CI, permutation test) use NumPy's PCG64 generator with a seed derived from the spec_hash — producing identical resamples across platforms. Third, Python 3.12+ exclusively: a CPython change (gh-100946) altered the sum() implementation for homogeneous float lists in Python 3.12, producing a 1-ULP difference from 3.11 that propagates through the bootstrap into a completely different result_hash.

The Python 3.12 requirement is not a preference — it is a hard constraint. The engine and all four reference fixtures (toy, jakarta_temperature, rapture_family, btc_pred_hedge) have been verified to produce different hashes on 3.11 vs 3.12. The /engine/determinism page documents the investigation in full.

Engine version is pinned in every receipt. A receipt from batter@0.4.2 is reproducible only with batter@0.4.2 under Python 3.12+.

Related