MQI per-model baselines + multiclaude account detection: quality signals became self-reporting
->
MQI Plans F and G shipped together, turning quality metrics from something you query into something that reports itself. Per-(model, metric) top-K baselines in transformed space replaced the single global threshold. Multiclaude account-aware automation detection (Fix C) and account_name passthrough through the quality parser (Fix D) closed two attribution gaps that had been silently misclassifying sessions.
What Happened
Three problems converged in the same week:
Problem 1: Global thresholds were wrong for every model. A single baseline per metric meant Opus sessions were judged by Haiku norms and vice versa. Metrics like depth_ratio and tool_diversity have fundamentally different distributions across model tiers. Plan F introduced per-(model, metric) top-K baselines computed in transformed space (log-normal for skewed metrics, z-score for symmetric ones). The transformation matters: raw percentile ranks collapse tails, hiding the difference between “slightly below average” and “catastrophically degraded.”
Problem 2: Multiclaude sessions were invisible. When multiple Claude Code sessions run under the same OS user (a normal pattern: the user runs parallel sessions constantly per MEMORY.md), the quality parser couldn’t distinguish them. Fix C added account-aware automation detection: each session’s API key fingerprint is matched against known account identities, so a Dakka orchestrator session and a manual coding session on the same machine produce separate quality traces.
Problem 3: Default-account sessions lost their name. The quality parser extracted account_name from the file path, which works for explicitly named sessions but returns empty for the default account. Fix D threaded an account_name_hint parameter through parse_quality_file, falling back to the hint when path extraction fails. Without this, ~30% of sessions were attributed to an empty string and excluded from per-model baseline computation.
Why It Matters
The combination moved MQI from “queryable data store” to “self-reporting system.” The staleness pill on the dashboard now shows time-since-last-healthy-baseline-update, not just time-since-last-row. The /api/health endpoint returns per-model baseline freshness. When a model’s quality degrades, the system surfaces it without anyone running a query.
This is also the first time the baseline computation is aware of the data’s statistical shape. Prior baselines were computed on raw values, which meant a model with 10% more tool calls per session looked dramatically worse on io_ratio even when the underlying behavior was identical: the metric’s log-normal tail inflated the gap.
Evidence
- Plan F: per-(model, metric) baselines live in
baselinestable, keyed on(model_id, metric_name), withtransform_typeandtop_kcolumns. - Plan G: staleness pill renders amber at 4h, red at 12h since last baseline refresh.
- Fix C: multiclaude detection tested with 3 concurrent sessions: each attributed correctly.
- Fix D:
account_name_hintparam added toparse_quality_file; default-account sessions now carry correct attribution. /api/healthreturns per-model baseline ages in response body.