Breakthrough Memory bloomnet

MQI per-model baselines + multiclaude account detection: quality signals became self-reporting

->

breakthroughmldata-engai-agents
Key Metric
Before
After

MQI Plans F and G shipped together, turning quality metrics from something you query into something that reports itself. Per-(model, metric) top-K baselines in transformed space replaced the single global threshold. Multiclaude account-aware automation detection (Fix C) and account_name passthrough through the quality parser (Fix D) closed two attribution gaps that had been silently misclassifying sessions.

What Happened

Three problems converged in the same week:

Problem 1: Global thresholds were wrong for every model. A single baseline per metric meant Opus sessions were judged by Haiku norms and vice versa. Metrics like depth_ratio and tool_diversity have fundamentally different distributions across model tiers. Plan F introduced per-(model, metric) top-K baselines computed in transformed space (log-normal for skewed metrics, z-score for symmetric ones). The transformation matters: raw percentile ranks collapse tails, hiding the difference between “slightly below average” and “catastrophically degraded.”

Problem 2: Multiclaude sessions were invisible. When multiple Claude Code sessions run under the same OS user (a normal pattern: the user runs parallel sessions constantly per MEMORY.md), the quality parser couldn’t distinguish them. Fix C added account-aware automation detection: each session’s API key fingerprint is matched against known account identities, so a Dakka orchestrator session and a manual coding session on the same machine produce separate quality traces.

Problem 3: Default-account sessions lost their name. The quality parser extracted account_name from the file path, which works for explicitly named sessions but returns empty for the default account. Fix D threaded an account_name_hint parameter through parse_quality_file, falling back to the hint when path extraction fails. Without this, ~30% of sessions were attributed to an empty string and excluded from per-model baseline computation.

Why It Matters

The combination moved MQI from “queryable data store” to “self-reporting system.” The staleness pill on the dashboard now shows time-since-last-healthy-baseline-update, not just time-since-last-row. The /api/health endpoint returns per-model baseline freshness. When a model’s quality degrades, the system surfaces it without anyone running a query.

This is also the first time the baseline computation is aware of the data’s statistical shape. Prior baselines were computed on raw values, which meant a model with 10% more tool calls per session looked dramatically worse on io_ratio even when the underlying behavior was identical: the metric’s log-normal tail inflated the gap.

Evidence

  • Plan F: per-(model, metric) baselines live in baselines table, keyed on (model_id, metric_name), with transform_type and top_k columns.
  • Plan G: staleness pill renders amber at 4h, red at 12h since last baseline refresh.
  • Fix C: multiclaude detection tested with 3 concurrent sessions: each attributed correctly.
  • Fix D: account_name_hint param added to parse_quality_file; default-account sessions now carry correct attribution.
  • /api/health returns per-model baseline ages in response body.