A rolling z-score anomaly detector (2-sigma threshold) on Claude Code session token costs will identify runaway sessions with >= 90% precision and < 5% false positive rate
HypothesisA rolling z-score anomaly detector (2-sigma threshold) on Claude Code session token costs will identify runaway sessions with >= 90% precision and < 5% false positive rate

Changelog
| Date | Summary |
|---|---|
| 2026-04-06 | Audited: chain linked to screenshot-verification (iteration 4), domain tag, last_audited stamped |
| 2026-04-04 | Initial creation |
Hypothesis
We bet that session token costs follow a log-normal distribution : most sessions are cheap (research, quick edits), and expensive sessions are rare outliers. A rolling z-score detector (window: 30 sessions, threshold: 2 sigma) should reliably distinguish “genuinely expensive work” from “runaway agentic loop or configuration error.” The key premise: legitimate expensive sessions cluster predictably (large refactors, deep research), while anomalous sessions spike in unexpected patterns (infinite tool loops, misconfigured agent budget).
The html2canvas and screenshot FIFO sort pitfalls both surfaced during high-activity BloomNet sessions where cost visibility would have provided earlier warning of the excessive debugging cycles. In both cases, the session ran 3-4x the expected token budget before the issue was noticed. A z-score alert at 2 sigma would have fired before the second hour of debugging.
The hypothesis is that this detector achieves >=90% precision (flagged sessions are real anomalies) and <5% false positive rate (legitimate expensive sessions are not flagged) : both required for the alert to be trusted rather than ignored.
Method
- Data extraction: parse
history.jsonl(BloomNet’s raw session log) to extract per-session metrics: total tokens, duration, tool call count, cost estimate - Distribution analysis: fit session costs to log-normal distribution. Compute rolling mean and standard deviation (window: 30 sessions)
- Z-score computation: for each new session, compute
z = (log(cost) - rolling_mean) / rolling_std. Flag sessions with z > 2.0 - Validation: compare flagged sessions against known incidents (screenshot debugging sessions that produced the html2canvas pitfall, the FIFO sort debugging session)
- Dashboard integration: add a “Cost Anomaly” badge to the BloomNet session list. Red badge for z > 2.0, orange for z > 1.5
- Precision measurement: manually review all flagged sessions from the past 30 days. True positive = session where something went wrong. False positive = legitimately expensive session
- Optional peon-notify: if precision is >= 90%, add a peon-notify hook that fires on z > 2.5
Results
Pending. Will measure:
- Distribution fit (log-normal goodness-of-fit)
- Precision and recall at 2-sigma threshold
- False positive rate
- Time-to-detection vs post-hoc discovery
Findings
Pending.
Next Steps
If confirmed, promote the anomaly detection pattern to a reusable component in the bloomnet-refresh skill. Consider applying to other resource metrics (API call volume, git commit frequency).