Journal

Five Experiments, Three Days, One Fixed Criterion

March 15, 2026

ai-agentsmldata-engops

Five experiments over three days advanced a WTI crude pricing model from 0-of-5 historical backtests passing to autonomous hourly recalibration at +11.21pp brier-score edge over polymarket and cumulative mean-absolute-error down 49%. The fixed evaluation criterion is what made five iterations converge instead of thrash.

Five Experiments, Three Days, One Fixed Criterion

The oil model chain ran v13 through v17 on March 16-18. Each experiment had one structural target. All five were measured against the same criterion: beat polymarket accuracy on WTI CL.

v13 (Mar 16): tail risk recalibration. Added three historical analogues (Arab 1973, Iran 1979, Iraq 2003), raised jump size from $4 to $5/bbl. OOS r-squared improved +4.98pp (0.8633 to 0.9131). Historical backtest pass rate rose from 0/5 to 5/8. P($120+) doubled from 15.4% to 36.4%. The fix was adding canonical historical analogues, not parameter-tuning.

v14 (Mar 17): AR error correction. The -$2.99 Day 16 persistent bias was visible in all backtests. arLambda=0.4 and spike reversion mechanics eliminated it (Day 16 error: -$2.99 to +$0.05). One-step mean-absolute-error dropped 33% ($1.63 to $1.09). Direction accuracy rose +6.6pp to 93.3%. Active parameters reduced from 16 to 8.

v14b (Mar 17): sigmoid calibration. Replaced binary polymarket blend threshold with a smooth sigmoid (center=145, width=5, ceil=0.85). Expanded validation dataset to 10,114 daily WTI prices (1986-2026). MAE dropped another 42% ($1.633 to $0.951). Autocorrelation eliminated (0.1651 to -0.001). Largest single-experiment MAE gain in the chain; eliminating autocorrelation is structural, not parametric.

v16 (Mar 18): 11-trigger sell model built from scratch. Triggers: take-profit, P&L stops, conviction hold (P>90%, edge>25pp), hopeless position, edge erosion, dust, stale lottery, portfolio circuit breaker. Result: 100% trigger accuracy, zero false sells, CVaR99 reduction of 24.6%. Brier edge vs polymarket: +11.21pp. A pricing model that cannot systematically exit gives back its entry edge on every trade.

v17 (Mar 18): autonomous realtime consensus pipeline. 6 parallel sub-agents (WTI price, Polymarket odds, Hormuz traffic, military intel, diplomatic intel, supply/economic) running hourly at

. 8-layer validation stack (evidence quality, tier-based filtering, anti-oscillation, adaptive step sizing, coherence constraints, forward sandbox, backward sandbox, per-parameter drift caps). Manual update cadence (2x/day, 2-4h each) replaced by autonomous 24x/day. Removing the human cadence constraint is a phase change, not an optimization.

Cumulative v13-v17: MAE -49% ($1.633 to $0.833), OOS r-squared +6.24pp, brier-score +9.93pp vs polymarket.

The fixed criterion determined when to stop and what counted as progress. Without a fixed criterion, five experiments would have produced five sets of incomparable numbers. With it, each experiment either advances the metric or doesn’t. That discipline is what turns iterations into convergence.

Transferable insight: A fixed evaluation criterion turns iterations into convergence. Changing the metric between experiments is the structural cause of thrashing.

Accumulate-Then-Flush: 76% of Sessions Were Dying

Mixin: real-time AI processing anti-pattern

The naive hypothesis: automated lint and review on every file write will catch issues before they accumulate. Partially refuted: per-write AI processing killed 76% of sessions via the 1,500ms hook timeout.

The fix was architectural. Accumulate events during the session at ~1ms each. Single AI call at SessionEnd. Content-hash dedup (SHA-256) prevents the correction loop where the review’s own fix triggers a re-review on the unchanged content.

The accumulate-then-flush pattern generalizes to any development tool making AI calls on high-frequency events. File writes, keystrokes, linter passes: all produce events faster than any reasonable AI latency budget. The natural batch boundary is the session end, not the event.

This is the same design principle behind write-ahead logs, event buffers, and batch ETL over streaming. Real-time processing is an anti-pattern when the downstream consumer (model call) is slower than the upstream producer (file write) by an order of magnitude.

PeonNotify v1.0.0 shipped this architecture alongside CodeGuard v2: 26 fixes (W1-W26) addressing linter reliability, file guards, AI review hardening, observability, validation, dedup, timeout handling, and blocking mode.

Transferable insight: When event frequency exceeds AI latency by an order of magnitude, accumulate at the source and flush at a natural boundary. The session end is almost always the right boundary for development tools.

88% of AI Scoring Spend Produced Nothing

Mixin: deterministic pre-filters

An audit pipeline was spending on AI scoring for 2,860 entities per week. Only 347 (12.1%) passed the quality threshold. 87.9% of AI calls produced no useful output.

A deterministic pre-scoring gate was implemented: entities failing cheap local checks never reach the AI scorer. Estimated elimination: $52.86/week out of $60.78/week total. 88% of spend eliminated without losing any signal.

The local checks are not approximations of the AI scoring. They are orthogonal filters: null fields, malformed dates, missing required attributes. An entity that fails these checks would not pass the AI scorer either, but the AI call is not needed to confirm that.

The lesson generalizes beyond AI cost. Any expensive gate should be preceded by cheap gates on orthogonal dimensions. Model calls, database queries, network requests, and cryptographic operations all have cheap upstream filters that can eliminate the majority of candidates before the expensive step runs.

A secondary audit on the same day found 804 short dates and 1,316 datetime issues in a production dataset. A timezone-aware parsing and normalization pipeline was built. The gate and the normalization are both upstream-first fixes.

Transferable insight: The most impactful AI cost optimization is not model selection: it’s not calling the model when a deterministic check can reject first.

Zeitgeist

@CloudflareDev

Cloudflare /crawl: Entire website crawled with a single API call, no browser management required

19.9K likes, 10.6M views

@oliviscusAI

Marker: CPU-only PDF at 100 pages/sec: No GPU, no browser, 100 pages per second PDF to Markdown conversion

7.9K likes, 1.5M views

@techNmak

LangExtract: structured data with provenance: Structured data extraction from unstructured documents with source provenance tracking

8.2K likes, 727K views

By the Numbers

Metric	Value
Sessions	119
Compute cost	$220.10
Git commits	151
Deploys	0
MQI avg	0.1852
MQI delta	+0.0568
Cache hit rate	99.5%
Oil sessions	36 ($128.71, $3.57/session)
Autosearch sessions	11 ($52.66, $4.79/session)
Coverage tier	SPARSE (2 of 7 days measured)

Five Experiments, Three Days, One Fixed Criterion

Five Experiments, Three Days, One Fixed Criterion

Accumulate-Then-Flush: 76% of Sessions Were Dying

88% of AI Scoring Spend Produced Nothing

Zeitgeist

By the Numbers

Changelog

260507: Generated by journalize-weekly (topic-first format, v2 regeneration)