Experiment Memory bloomnet

A rolling z-score anomaly detector (2-sigma threshold) on Claude Code session token costs will identify runaway sessions with >= 90% precision and < 5% false positive rate

monitoringanalyticsanomaly-detectionops
Hypothesis

A rolling z-score anomaly detector (2-sigma threshold) on Claude Code session token costs will identify runaway sessions with >= 90% precision and < 5% false positive rate

Result: pending

Changelog

DateSummary
2026-04-06Audited: chain linked to screenshot-verification (iteration 4), domain tag, last_audited stamped
2026-04-04Initial creation

Hypothesis

We bet that session token costs follow a log-normal distribution : most sessions are cheap (research, quick edits), and expensive sessions are rare outliers. A rolling z-score detector (window: 30 sessions, threshold: 2 sigma) should reliably distinguish “genuinely expensive work” from “runaway agentic loop or configuration error.” The key premise: legitimate expensive sessions cluster predictably (large refactors, deep research), while anomalous sessions spike in unexpected patterns (infinite tool loops, misconfigured agent budget).

The html2canvas and screenshot FIFO sort pitfalls both surfaced during high-activity BloomNet sessions where cost visibility would have provided earlier warning of the excessive debugging cycles. In both cases, the session ran 3-4x the expected token budget before the issue was noticed. A z-score alert at 2 sigma would have fired before the second hour of debugging.

The hypothesis is that this detector achieves >=90% precision (flagged sessions are real anomalies) and <5% false positive rate (legitimate expensive sessions are not flagged) : both required for the alert to be trusted rather than ignored.

Method

  1. Data extraction: parse history.jsonl (BloomNet’s raw session log) to extract per-session metrics: total tokens, duration, tool call count, cost estimate
  2. Distribution analysis: fit session costs to log-normal distribution. Compute rolling mean and standard deviation (window: 30 sessions)
  3. Z-score computation: for each new session, compute z = (log(cost) - rolling_mean) / rolling_std. Flag sessions with z > 2.0
  4. Validation: compare flagged sessions against known incidents (screenshot debugging sessions that produced the html2canvas pitfall, the FIFO sort debugging session)
  5. Dashboard integration: add a “Cost Anomaly” badge to the BloomNet session list. Red badge for z > 2.0, orange for z > 1.5
  6. Precision measurement: manually review all flagged sessions from the past 30 days. True positive = session where something went wrong. False positive = legitimately expensive session
  7. Optional peon-notify: if precision is >= 90%, add a peon-notify hook that fires on z > 2.5

Results

Pending. Will measure:

  • Distribution fit (log-normal goodness-of-fit)
  • Precision and recall at 2-sigma threshold
  • False positive rate
  • Time-to-detection vs post-hoc discovery

Findings

Pending.

Next Steps

If confirmed, promote the anomaly detection pattern to a reusable component in the bloomnet-refresh skill. Consider applying to other resource metrics (API call volume, git commit frequency).