Oil v16 quant audit: +11.21pp Brier edge vs Polymarket, 82% win probability

The model beats Polymarket by 11 percentage points on Brier scoring with 82% win probability. That’s not in-sample performance : that’s external validation against real market prices. The model is production-grade.
Context
In-sample R² is a necessary quality check, but it’s not sufficient for a pricing model used for active trading decisions. R² measures how well the model fits its own training data. What matters for practical use is whether the model’s probabilistic forecasts are better calibrated than the market itself. The gold standard for probabilistic calibration is Brier scoring against a liquid prediction market like Polymarket.
Before this experiment, the oil model’s v13-v16 chain had produced strong in-sample improvements, but there was no external benchmark. The 3-round quant audit was designed to answer: is this model actually good, measured against what the market believes?
What Changed
A 3-round quantitative audit protocol compared the model’s probability forecasts directly against Polymarket market prices on identical questions (WTI price above/below thresholds at specific dates). Brier scores were computed for both the model and the market. The Brier score measures forecast accuracy on a 0-1 scale where lower is better; a Brier edge means the model’s forecasts are more accurate than the market’s.
The audit also validated the sell model (introduced in v16) independently: 100% trigger accuracy, 0 false sells across all 3 rounds.
Impact
Before: model accuracy measured only against in-sample R². No external probabilistic benchmark. After: +11.21pp Brier edge vs Polymarket, 82% win probability. One-step-ahead MAE = $0.833, direction accuracy = 93.8%. Sell model: 100% trigger accuracy, 0 false sells, 24.6% CVaR99 reduction.
The +11.21pp Brier edge is the definitive result. An 82% win probability means the model beats the market’s probabilistic forecast in roughly 4 of 5 comparable prediction windows. This is the first external validation proving the model is production-grade, not just in-sample optimized.