Breakthrough Preferences oil

Calibration-proper scoring eliminates Goodhart gaming: model probabilities within 5-28pp of market consensus

v25.1 gamed: model 5-25x more bullish than market ($120: 76% vs PM 16.5%, $150: 49% vs PM 4%). MC forecast Day 46 median $122 when WTI at $92. CalPen 0.216. No event markers. -> 260414b: MC forecast Day 46 median $97 (was $122). CalPen 0.060. $1 granularity (141 thresholds). Proportional anchor + known ceasefire injection + 5%/day recovery. Public Lab live with R2 pipeline, event markers on all charts.

breakthroughoilquantitativecalibration
Key Metric
Before
v25.1 gamed: model 5-25x more bullish than market ($120: ...
After
260414b: MC forecast Day 46 median $97 (was $122). CalPen...

What Changed

Discovered that the Karpathy ratchet was gaming the Brier scoring function by exploiting a uniform reference distribution. Replaced with ICE-exchange-style calibration-proper scoring: market-implied midpoint reference, squared-deviation calibration penalty against both Polymarket and Kalshi, April-Yearly monotonicity gate, and $1 threshold granularity matching Robinhood contracts.

The model went from absurdly overconfident (49% chance of $150 oil in 16 days) to structurally justified edges (17.8% for $150, backed by Hormuz blockade dynamics and real-time traffic data).

Key Metrics

  • CalPen: 0.216 -> 0.060 (72% reduction in market deviation)
  • tailProb: 0.50 -> 0.20 (from 50% tail paths to 20%)
  • April $120: 76.2% -> 44.9% (PM: 16.5%, edge: +28pp)
  • April $150: 48.9% -> 17.8% (PM: 4.0%, edge: +14pp)
  • April $200: 27.4% -> 5.6% (PM: 1.1%, edge: +5pp)
  • MC thresholds: 17 -> 141 ($1 granularity)
  • Pitfall documented: topics/pitfalls/brier-uniform-reference-gaming

Lifecycle Chain

Pitfall (Brier gaming) -> Experiment (calibration fix) -> Breakthrough (calibrated model)

This is the first oil model pitfall caught by manual probability inspection rather than automated metrics. The automated Brier scores all said “MODEL WINS” while the actual probabilities were nonsensical.