Experiment Preferences oil

Tail risk is vastly underestimated : adding historical analogues and de-escalation mechanics will improve out-of-sample fit

monte-carlotail-riskquant-auditquant-finance
Hypothesis

Tail risk is vastly underestimated : adding historical analogues and de-escalation mechanics will improve out-of-sample fit

Result: confirmed
Key Findings

R² 0.9498→0.9616 (+1.18pp). MAPE 1.97%→1.89%. OOS R² 0.8633→0.9131 (+4.98pp). P($120+) 15.4%→36.4%. Historical pass rate 0/5→5/8. Brier vs Polymarket +1.28pp.

Changelog

DateSummary
2026-04-06Audited: added Changelog, domain tag quant-finance, stamped last_audited
2026-03-16Initial creation

Hypothesis

Tail risk is vastly underestimated: adding historical analogues and de-escalation mechanics will improve out-of-sample fit. The v12 model had only 5 historical analogues, all post-2000, which meant the Monte Carlo simulation had no reference points for the most extreme oil supply disruptions in history. Jump sizes were capped at $4/bbl, far below the $10-15/bbl single-day moves observed during the 1973 Arab embargo and 1979 Iranian revolution. The persistence model also capped at 2x, meaning even a sustained Hormuz closure scenario would decay unrealistically fast. The result was that tail probabilities (P($120+), P($150+)) were systematically suppressed, and out-of-sample backtests against real crisis periods failed completely (0 of 5 historical events produced realistic price paths).

Method

v13 quant overhaul targeting tail calibration across four dimensions:

Historical analogue expansion (5 to 8):

#EventYearPeak Price ImpactDuration
1Gulf War I1990+$21/bbl7 months
2Venezuela strike2002+$7/bbl3 months
3Libya civil war2011+$25/bbl8 months
4Saudi Abqaiq attack2019+$8/bbl (1 day)2 weeks
5Russia-Ukraine2022+$30/bbl6 months
6Arab embargo1973+$8/bbl (quadrupling)5 months
7Iranian revolution1979+$30/bbl (doubling)12 months
8Iraq invasion2003+$10/bbl3 months

The three new analogues (6-8) provide critical coverage for supply disruption magnitudes and durations that the post-2000 sample completely missed.

Jump size recalibration:

  • Base jump size: 4 → 5 $/bbl
  • This better reflects the empirical distribution of daily WTI moves during geopolitical events (median $4.2, 90th percentile $7.8, 99th percentile $12.1)

Demand threshold adjustment:

  • Demand destruction thresholds adjusted to account for the higher price regime, ensuring that the feedback loop between high prices and demand reduction activates at realistic levels

Persistence mechanics:

  • Persistence ceiling: 2.0x → 2.5x
  • This allows sustained disruption scenarios (Hormuz closure lasting weeks) to maintain elevated prices longer before mean-reversion dominates
  • De-escalation regime introduced with deEscSens=0.20, providing a mechanism for gradual price recovery when geopolitical signals weaken without abrupt cliff-drops

Results

Hypothesis confirmed across all five target metrics.

Metricv12v13Delta
0.94980.9616+1.18pp
MAPE1.97%1.89%-0.08pp
OOS 0.86330.9131+4.98pp
P($120+)15.4%36.4%+21.0pp
Historical pass rate0/55/8+5 events
Brier vs Polymarketbaseline+1.28pp+1.28pp

The out-of-sample R² improvement (+4.98pp) was the most significant result. The model had been overfitting to the calm baseline regime while completely failing to price tail scenarios. Adding the three pre-2000 analogues gave the Monte Carlo engine realistic reference paths for extreme disruptions.

Findings

  1. Pre-2000 analogues are essential. The post-2000 oil market operated under fundamentally different supply infrastructure (SPR releases, shale flexibility, OPEC+ coordination) that dampened price responses. The 1973 and 1979 events provide the only empirical reference for a world where strategic reserves cannot offset a major supply disruption: which is exactly the Hormuz scenario.

  2. OOS improvement was 4x the in-sample improvement. R² gained 1.18pp in-sample but 4.98pp out-of-sample. This is strong evidence that the model was previously underfitting tail scenarios rather than overfitting the base case.

  3. P($120+) more than doubled. Moving from 15.4% to 36.4% brought the model’s tail probability much closer to the implied probability from options markets and Polymarket contracts on Hormuz escalation.

  4. De-escalation prevents runaway paths. Without deEscSens, the expanded persistence ceiling caused some Monte Carlo paths to spiral to unrealistic levels ($200+ without demand destruction). The de-escalation sensitivity of 0.20 acts as a soft governor, allowing prices to decay when diplomatic signals improve.

  5. 3 of 8 historical analogues still fail. The remaining 3 failures are all cases where the disruption had a non-oil geopolitical dimension (sanctions, trade wars) that the model does not capture. This suggests the next improvement vector is structural (additional signal types), not parametric.

Next Steps

The Day 16 pricing error persists at -$2.99 (model consistently underprices at the 16-day horizon). This appears to be an autoregressive artifact: the error compounds over the first two weeks and peaks around Day 16 before partially self-correcting. Addressing this requires AR error correction mechanics in experiments/oil/2026-03-17-v14-ar-error-correction.