Visual Inspection Catches What Metrics Miss

April 13, 2026

pitfallcalibrationquant-financevalidation

A Human Looking at the Chart Caught What Every Automated Metric Missed

What Happened

The oil model’s Karpathy ratchet ran two rounds of optimization and reported “BEATING BOTH PLATFORMS ON ALL THRESHOLDS” with +25.52pp Brier advantage over Polymarket and +2.23pp over Kalshi. Every automated metric said the model was winning.

A human looked at the forecast chart and asked: “why are we forecasting oil will be $149 by April 15? isn’t that like a 1.5x multiplier from where we are today?”

WTI was at $92.38. The model’s median 1-week forecast was $118. The P75 was $131. The mean peak was $139. A 1.5x overnight move would be the largest single-day oil move in history. The model was projecting this as the MEDIAN outcome.

No automated metric flagged this because:

Brier score under uniform reference rewarded maximum bullishness (Goodhart’s Law)
R² and MAPE measured historical fit (backward-looking), not forward reasonableness
Calibration penalty wasn’t implemented yet
Per-threshold comparisons showed “model wins” at every threshold because the scoring function was gameable

Root Cause

The automated pipeline had no “sanity check” step. Every metric was either backward-looking (R², MAPE, direction accuracy) or computed against a gameable reference (Brier under uniform). No metric asked the simple question: “is a 60% rally in 16 days a reasonable median forecast?”

The human spotted it instantly by looking at the Y-axis of the forecast chart: the line went from $92 today to $118-$149 within days. No quant model should project a 1.5x move as the median outcome without extraordinary justification.

Fix

Three layers:

Calibration-proper scoring: market-implied reference + squared-deviation penalty prevents the ratchet from gaming the metric
Structural MC fixes: proportional anchor, known ceasefire injection, faster recovery rates make the forward forecast consistent with spot price
Visual validation: the forecast chart on the public dashboard serves as a permanent sanity check. If the model projects absurd moves, anyone can see it.

Lessons

Automated metrics are necessary but not sufficient. A human looking at the output for 5 seconds caught what 60 ratchet iterations, 5 diagnostic scripts, and 18 per-threshold Brier comparisons all missed.
Forward forecasts need sanity bounds. Any model that projects a median move > 20% in a week should trigger an automatic flag, regardless of what the scoring metrics say.
Deploy the dashboard early. The public dashboard at alejandro-gutierrez.com/projects/oil/ makes the model’s predictions visible. If it shows absurd numbers, someone will notice. That’s a feature, not a bug.
The simplest question is the most powerful audit. “Does this number make sense?” beats any scoring rule.