Breakthrough Preferences oil

Oil model v18.1 beats Polymarket by +2.16pp Brier score

Stale 10-day-old model, degenerate 0%/100% probabilities from MC time-window bug -> R²=0.9134, MAPE=2.26%, beats Polymarket by +2.16pp uniform Brier, 55% win probability across all thresholds

April 3, 2026

breakthroughoilquantitative

Key Metric

Before

Stale 10-day-old model, degenerate 0%/100% probabilities ...

→

↓

After

R²=0.9134, MAPE=2.26%, beats Polymarket by +2.16pp unifor...

Context

The oil model had been sitting at v17 for ten days, long enough that the forward-month Monte Carlo was quietly broken in a way the fit metrics hid. Polymarket, the public prediction market on crude-price thresholds, was absorbing my mis-priced signal for free. Worse, the model was outputting degenerate probabilities (0% or 100%) on a non-trivial share of days, which means it had stopped expressing uncertainty at all. A model that never says “maybe” is a model that cannot be wrong in a useful way. Beating the crowd on a prediction market requires both calibration and sharpness. Polymarket had better calibration than v17.

What Changed

I ran a 28-step Karpathy ratchet (a pattern where every candidate change must either improve the target metric or get rejected, logged in a visible ledger) after finding the critical time-window bug inside the Monte Carlo engine. The MC was tracking a mar31Day peak variable but comparing it against April Polymarket markets. Day greater than mar31Day made every simulation lock to the actual realized peak, producing the degenerate 0% and 100% probabilities. Once a MC has seen the future, its variance collapses.

Fixing the time window was step one. Step two externalized five previously-hardcoded forward MC parameters into parameters.json so the ratchet could perturb them systematically. Step three ran the automated ratchet to completion: 6 accepted parameter changes, 22+ rejected. The model now outperforms Polymarket on a uniform Brier score for the first time across the full threshold grid. The repository was initialized and pushed to github.com/Alex-Zeo/cl-futures-hormuz.

Impact

R²: 0.9046 to 0.9134, a 0.9-point improvement on an already high baseline
MAPE: 2.39% to 2.26%
Autocorrelation: 0.512 to 0.380, fixing a broken AC structure that had been hiding in the residuals
Late MAE: $3.58 to $3.22
Uniform Brier beats Polymarket by 2.16 percentage points
55% win probability across all thresholds, up from roughly coin-flip

The v18.1 release is the first version where the model is both a research artifact and a competitive forecaster. Every later improvement (v19, v20, the auto-refresh launchd plist, the R²=0.904 MAPE=2.42% production build) rides on the time-window fix and the externalized parameters introduced here. Without the MC fix, every downstream optimization was noise on top of a corrupt signal.

Source

experiments/oil/2026-04-04-timesfm-plateau-detection