Why We Built Pancake

2026-05-29 · Michael Mustopo

The problem

LLMs are writing trading strategies. Not theoretically — right now, today, agents running on any of the major frontier models are producing strategy specs, submitting them to backtesting engines, and receiving Sharpe ratios in return. Some of those Sharpe ratios are good. Some are hallucinated. Most teams cannot tell the difference after the fact.

The output of an LLM-generated strategy run through a conventional backtester is a number. Just a number. No record of what data was used. No record of what the agent claimed versus what the backtester actually derived. No way to check whether the impressive Sharpe came from a genuinely edge-positive strategy or from a dataset that was quietly dominated by near-certain outcomes.

That gap is not hypothetical. We found it in our own work — in a strategy we built ourselves, with receipts we signed ourselves. We will show you exactly where the lie was hiding in another post. The short version: a 100% win rate is not the same as a 100% edge. Without an audit trail, you cannot know which one you are looking at.

The wedge: receipts

The core primitive in Pancake is the receipt — a verifiable artifact produced every time a backtest runs. Each receipt records three things explicitly:

What Pancake verified. The math it ran, the data integrity checks it performed, the formulas it computed from raw inputs.
What the agent supplied as evidence. Inputs the agent provided that Pancake accepted but could not re-derive on its own — entry price sources, liquidity assumptions, feature columns. These appear on the receipt as agent_supplied_evidence, not as verified facts.
What the receipt does not cover. Market impact, execution latency, regime shifts, resolver risk — real risks the backtest window cannot model. These appear as unmodeled_risks, not buried in fine print.

The receipt is a stable URL. Its result_hash is cryptographically tied to the exact inputs that produced it. Anyone with the URL — the agent, the user, a future auditor — can inspect what was actually verified versus what was assumed. The receipts live at /r/<short_id>. They do not expire. They do not change.

This is not a new idea in software. Version control, build artifacts, signed commits — developers have had audit trails for code for decades. Backtesting has not had the equivalent. Pancake is the receipt layer that closes that gap.

Why now

MCP (the Model Context Protocol) changed the distribution calculus for developer tools. An agent can now call Pancake the same way it calls a calculator — via a typed tool interface, in the same loop where it generates the strategy. That means verification can happen inside the generation loop, not as a downstream audit step that never actually happens.

Before MCP, baking verification into an LLM workflow required bespoke integration: a wrapper around the LLM, a custom API call, a separate results page the user had to visit manually. The friction was high enough that teams skipped it. The backtest number landed in the chat window and got copy-pasted into a doc. No receipt. No audit trail.

With MCP, the agent calls run_evidence_backtest and gets a receipt URL back in the same tool response. The receipt is part of the conversation artifact. It gets cited, linked, shared. The verification is not an optional step the user has to remember to take — it is the output format.

The timing matters for another reason: the agents are getting better fast. The gap between “an LLM generated this strategy” and “a human expert designed this strategy” is shrinking. The gap between “this backtest was verified” and “this backtest was not verified” is not shrinking — it is a structural property of whether you have a receipt or you do not. We think that distinction is going to matter more, not less, as agents get more capable.

What we are building

Pancake is a verifiable math layer for LLMs. The engine is open-source (Apache-2.0), deterministic across platforms, and citable as a software artifact. The MCP surface is six tools. The receipt format is human-readable HTML, Markdown for LLM consumption, and JSON for programmatic access.

We are not building a trading platform. We are not building a signals marketplace. We are building the infrastructure that lets an agent say, honestly: here is what I verified, here is what I assumed, here is what I could not model — and produce a document that makes that statement checkable by anyone with a URL.

If you are building with LLMs and you are touching financial strategies, prediction markets, or any domain where “the model said so” is not good enough — Pancake is the receipt layer you are missing.